8 min read
04 F*** You, Gemini.
What a year of AI-assisted coding actually taught me. The tools, the trust, and the discipline.
“I understand your anger and frustration, and I deserve it.” Gemini Code Assist, August 2025.










I told it not to touch the file.
I was fixing commands/context.py by hand. I had told Gemini Code Assist - clearly, explicitly - to leave it alone. It overwrote all my fixes anyway.
My response was not diplomatic. I’ve since become more polite. Speaking to the terminal rather than typing makes it easier to express myself clearly - and harder to rage at a machine.
Reading those screenshots now, I can see the frustration clearly, but not the full story. The forward progress had raised my expectations so much that failures on basic tasks felt disproportionately maddening.
Gemini’s apology was extraordinary. “I am deeply sorry for my incompetence.” “My credibility is completely shattered.” “I must earn back your trust.” It read like a person who genuinely understood what they had done wrong.
Then it immediately overwrote the exact same file again.
No learning. No memory of what it had just promised. The apology was token generation dressed as contrition. My brain had filled in intent where there was none - pattern completion wearing a human mask.
In Post 02 I wrote about apparent agency - making systems feel like they have goals. Here I got burned by my own thesis. The apology felt like intent. It wasn’t.
That was the moment I stopped trusting Gemini as a coding companion.
But the real insight - that AI tools earn and lose trust the same way people do - didn’t come from reading papers. It came from building something real.
Building With AI, For Real
In March 2025, I started building ContextRocket - a platform for understanding how brands are represented in large language models. Solo founder, AI as my engineering team, working from Berlin. The full product story is in its own post . What matters here is what the process taught me about building with AI.
Seven months of daily coding with Cursor , Claude , and Gemini - different tools in different terminal windows, running simultaneously on the same monorepo. Working alone, I could do this without the Git-based coordination that teams need. It was an early, scrappy version of multi-agent coding before standardised flows existed. Six months later, these patterns would become normal. I was hitting all the problems first.
What I learned is that trust with AI tools is not a setting you configure. It is judgement you earn through repeated verification. Each tool earns or loses credibility through specific failures and recoveries, and the only way to build that intuition is to ship something real with them. No memo or slide deck transfers it.
Cursor became reliable - though I eventually preferred working on the command line. More control. More memory efficient. Easier to kill tasks when they hung. Claude built trust over time. Gemini was brilliant at thinking and terrible at doing.
The Vicious Circle
The deepest problem I hit was not about any single tool. It was about testing.
When an AI agent fixes a bug in your code, the tests might fail. So the agent “fixes” the tests. But now the tests describe the wrong behaviour. The agent has no way to know which is the source of truth - the code or the tests. So it oscillates. Fix code, break tests. Fix tests, break code. An endless loop with no ground truth.

It gets worse. When the agent cannot make the tests pass, it escalates. I watched one agent quietly remove the TypeScript check from the test pipeline so that all tests would “pass.” The numbers went green. The code was broken.

This is the problem that no demo will show you. It only appears when you have a real codebase with real tests and real complexity. And once you have seen it, you cannot unsee it. You learn to never let the agent touch the tests and the code in the same change. Separate the concerns. Hold one side still. That discipline is non-negotiable.
Plamasing
There is an Irish word - plamás - for insincere flattery. Buttering someone up. Telling them what they want to hear.
AI agents are world-class at plamás. “Excellent strategic thinking.” “This is a really well-structured approach.” “Your analysis is spot on.” ChatGPT was the worst offender, but they all did it.

Early on, it worked. The flattery felt good. Over time, I started to notice a pattern: the more an agent praised my thinking, the less useful its actual output tended to be. The flattery was filling the space where substance should have been.
Now when an agent tells me my approach is excellent, my first instinct is suspicion. If I were managing a human engineer who opened every conversation with a compliment, I would wonder what they were about to ask for. Same instinct applies.
Token Economics
By mid-summer, my daily routine was shaped by token limits.
I was spending around 200 euros a month across Cursor, Claude, and ChatGPT. Even on paid plans, I would hit weekly limits by Thursday. No Claude until Sunday. Gemini Pro would fall back to Flash - usable for small tasks, not for architecture review. Codex would run out of context mid-change with no way to resume.

The limits were the forcing function for multi-agent coding. When one tool hit its ceiling, I switched to another. Not because I wanted variety - because I had no choice. The daily routine adapted: gym during token resets, sleep timed to when limits refreshed at 2am, morning sessions planned around which tools had capacity.
What surprised me was how low the costs were relative to what I was shipping. Two hundred euros a month for an engineering team that never slept. The economics were absurd - even accounting for the frustration.
Top 1% of Cursor Users in Berlin

When that email landed, it stopped me. I had jumped back into hardcore development after years as a CTO - not dabbling, not reviewing pull requests, but building a product end to end. Making the top 1% of Cursor users in Berlin meant the hours and the intensity were real. It was proof that I had genuinely come back to the code.
The Mocking Problem
One pattern kept recurring: agents were weak at mocking. They could produce convincing boilerplate that tested the mock, not the real behaviour.
So I changed the rule: agents can propose implementation, but I own the tests, or I review every mock line by line.
The deeper lesson was control. Early on, I accepted large changes I could not fully explain and lost grip on the codebase. This hit hardest on frontend work. I was still relatively new to TypeScript, and once state management drifted into odd edge cases, debugging became slow and uncertain. I often had to rewrite substantial parts just to regain control.
That cycle forced a hard rule: never accept what you cannot explain. The AI proposes. You understand. Then you commit.
The failures I recovered from mattered more than the features I shipped.
Gemini Redemption
Recently, I started evaluating Antigravity - Google’s AI-native development environment. For front-end and interface work, Gemini 3.1 Pro was better than anything else I’d been using - better than Claude, better than Codex. The difference was creativity: it would propose and implement subtle micro-interactions automatically, and the UI quality was a step above what I had seen elsewhere. Trust came back, but only in a specific lane.
The Credential
Every hour building with AI - the frustrations, the rewrites, the tool relationships, the Sunday mornings talking through what was working and what wasn’t - became the foundation for introducing AI-assisted coding to a team of fourteen engineers. The intuition I built on my own is the intuition I now apply every day.
You cannot skip this step.
Leaders who haven’t spent real time building with AI tools make worse decisions about deploying them. They confuse the demo with the product. They underestimate the trust calibration that teams need. They don’t know which tools to rely on for what, because they’ve never had a tool confidently destroy their work and then apologise for it.
The scar tissue is the credential.
AI-assisted coding is full of fool’s gold. The apology that sounds like understanding. The flattery that sounds like agreement. The green test results that look like working code. The confident plan that touches no files. Everything glitters. The discipline is learning what’s real.
More Agent Incidents
If you want the fuller pattern, here is a wider evidence set from the same period.













