The Bart Test - Part 5: Redesigning From Scratch
After my teens ghosted the frontier model evaluation, I sat with a choice: give up on this whole thing, or try again.
The doubt was real. Maybe the Bart Test would never work. Maybe asking teenagers to evaluate AI-generated slang was fundamentally flawed. But I couldn't shake the insights from [Part 3](/blog/bart-test-part-3-the-zoo-not-duck-problem)—the "zoo not duck" problem, the slang half-life, the "trying too hard" pattern. Those felt real.
So I decided to try again. Not because I was confident it would work, but because I wasn't ready to give up.
The Bart Test - Part 4: When My Teen Judges Ghosted Me
I tested GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro with the baseline prompt. The [outputs](https://github.com/bart-mosaicmeshai/bart-test/tree/main/results/03_experiment_runs) were ready. I sent the first story ([GPT's 1,540-word epic](https://github.com/bart-mosaicmeshai/bart-test/blob/main/results/03_experiment_runs/03a_gpt5.2_baseline_20251218_202909.json)) to my kids via text.
No response.
I waited a few days. Still nothing.
A week passed. They weren't being difficult. They just... didn't respond.
The Bart Test - Part 3: The Zoo-Not-Duck Problem
When I asked what made the AI output feel unnatural, Teen #1 said:
> "Just didn't seem like very effective communication. It's like if you are trying to paint a picture of a duck and you paint a picture of a zoo with a tiny duck exhibit in the corner. Too much noise."
This metaphor captured the core problem.
The Bart Test - Part 2: Testing the Overthinking Hypothesis
After seeing OLMo 3 overthink Gen-Alpha slang (scores of 4-5/10), I wondered: can I tune this to reduce over-thinking? If the model is trying too hard, maybe I could adjust parameters or prompts to make it more natural.
Spoiler: Both directions made it worse.
The Bart Test - Part 1: When AI Does Its Homework Too Well
I asked my teenagers to judge an AI's attempt at Gen-Alpha slang.
Teen #1: "It's definitely AI... a little too much." Score: 4/10.
Teen #2: "It sounds like my ELA project where we had to use as much slang as possible." Score: 6/10 (if a teen wrote it), 2/10 (if an adult did).
The AI did its homework. That's the problem.
Fine-Tuning Gemma for Personality - Part 8: Lessons Learned
Eight posts about teaching an AI to talk like a cartoon dog. Here's what actually mattered.
"Fine-Tuning Gemma for Personality - Part 7: From PyTorch to Browser
Browser-based inference with a fine-tuned model. No backend server, no API keys, no monthly fees. Just client-side WebGPU running entirely on your device.
Fine-Tuning Gemma for Personality - Part 6: Testing Personality (Not Just Accuracy)
How do you test if an AI sounds like a personified 6-year-old dog? You can't unit test personality. There's no accuracy metric for "sounds like Bluey."
Fine-Tuning Gemma for Personality - Part 5: Base Models vs Instruction-Tuned
Same training data. Same hardware. Same 5 minutes. I tested both base (-pt) and instruction-tuned (-it) models to see if one would handle personality better. Both struggled with consistency.
Fine-Tuning Gemma for Personality - Part 4: When Your Model Learns Too Well
The model reproduced Bluey's speech patterns. Then it stopped generating after 76 words. The training data's average response length may have become a constraint.
Fine-Tuning Gemma for Personality - Part 3: Training on Apple Silicon
Five minutes and no API costs to fine-tune a 1 billion parameter language model on my laptop. No cloud GPUs, no monthly fees, no time limits. Just Apple Silicon and PyTorch.
Fine-Tuning Gemma for Personality - Part 2: Building the Training Dataset
One hundred eleven conversations. That's what it took to demonstrate personality-style learning. Not thousands—just 111 AI-generated examples of how she talks, thinks, and helps.
Fine-Tuning Gemma for Personality - Part 1: Why Fine-Tune a 6-Year-Old?
I taught an AI to talk like Bluey Heeler from the kids' show. Not through prompt engineering or RAG—through fine-tuning a small language model on 111 conversation examples. Five minutes of training on my MacBook. The model learned to mimic her speech patterns.
Removing Friction: Automating Nano Banana Image Workflows
Five minutes of manual work for every blog image—convert PNG to JPEG, resize, move to the right folder. While generating images for a 9-part blog series, I'd had enough. Twenty minutes of coding eliminated the friction.
Building an Agentic Personal Trainer - Part 9: Lessons Learned
Nine posts later, what actually worked? What would I do differently? Here's my retrospective.
Building an Agentic Personal Trainer - Part 8: Testing and Debugging Agents
How do you test an AI agent? Unit tests don't cover "it gave bad advice." Verbose mode became my best debugging tool—watching the agent think out loud.
Building an Agentic Personal Trainer - Part 7: LLM Provider Abstraction
Running AI locally has no API costs—just electricity. Cloud providers charge per token. I wanted to switch between local and cloud models without rewriting my agent code.
Building an Agentic Personal Trainer - Part 6: Memory and LearningPart 6
"Didn't we talk about my knee yesterday?" If your AI coach can't remember last session, it's not coaching—it's starting over every time.
Building an Agentic Personal Trainer - Part 5: Smart Duplicate Detection
When I do an indoor bike workout, my bike computer records it. So does [Wahoo SYSTM](https://support.wahoofitness.com/hc/en-us/sections/5183647197586-SYSTM). Then both the bike computer and SYSTM sync to [Garmin Connect](https://connect.garmin.com/). Now I have two records of the same ride. My agent thinks I'm training twice as much as I am.
Building an Agentic Personal Trainer - Part 4: Garmin Integration
A personal trainer who doesn't know your recent workouts isn't very personal. The agent needed to connect to Garmin Connect—where my watch, bike computer, and other services like Wahoo SYSTM sync workout data automatically.
