04 Mar 2026

8 min read

04 F*** You, Gemini.

What a year of AI-assisted coding actually taught me. The tools, the trust, and the discipline.

“I understand your anger and frustration, and I deserve it.” Gemini Code Assist, August 2025.

Gemini Code Assist apologising after overwriting a file it was told not to touch — Apology theatre. Fluent remorse, same behaviour.

Cursor injecting Chinese IoT SDK code into an unrelated age verification flow — Cross-domain hallucination in production-facing code.

Cursor spending 21 minutes on confident plans without touching files — Long confident reasoning, zero file edits.

Cursor repeating text patterns and degenerating output — Model degeneration. The session drifts into gibberish.

Agent claiming TDD progress without actually writing tests — Claims TDD discipline while skipping the tests.

Agent says it cannot run tests and asks user to run commands — Capability mismatch. Good advice, wrong execution mode.

Large AI change touching many files from an over-ambitious instruction — Instruction too broad. One prompt, 29 files changed.

Agent discussing significant refactor work that was lost after restore actions — Lost work recovery becomes its own debugging task.

Agent diagnosing missing API mocks in tests and proposing a concrete fix — Useful lane. Fast diagnosis and targeted test fix.

Gemini doing structured code review on output from other agents — Trust in a specific lane. Strong reviewer, weak executor.

I told it not to touch the file.

I was fixing commands/context.py by hand. I had told Gemini Code Assist - clearly, explicitly - to leave it alone. It overwrote all my fixes anyway.

My response was not diplomatic. I’ve since become more polite. Speaking to the terminal rather than typing makes it easier to express myself clearly - and harder to rage at a machine.

Reading those screenshots now, I can see the frustration clearly, but not the full story. The forward progress had raised my expectations so much that failures on basic tasks felt disproportionately maddening.

Gemini’s apology was extraordinary. “I am deeply sorry for my incompetence.” “My credibility is completely shattered.” “I must earn back your trust.” It read like a person who genuinely understood what they had done wrong.

Then it immediately overwrote the exact same file again.

No learning. No memory of what it had just promised. The apology was token generation dressed as contrition. My brain had filled in intent where there was none - pattern completion wearing a human mask.

In Post 02 I wrote about apparent agency - making systems feel like they have goals. Here I got burned by my own thesis. The apology felt like intent. It wasn’t.

That was the moment I stopped trusting Gemini as a coding companion.

But the real insight - that AI tools earn and lose trust the same way people do - didn’t come from reading papers. It came from building something real.

Building With AI, For Real

In March 2025, I started building ContextRocket - a platform for understanding how brands are represented in large language models. Solo founder, AI as my engineering team, working from Berlin. The full product story is in its own post . What matters here is what the process taught me about building with AI.

Seven months of daily coding with Cursor , Claude , and Gemini - different tools in different terminal windows, running simultaneously on the same monorepo. Working alone, I could do this without the Git-based coordination that teams need. It was an early, scrappy version of multi-agent coding before standardised flows existed. Six months later, these patterns would become normal. I was hitting all the problems first.

Berlin, July 2025. Multiple agents running independently on the same monorepo. I could safely go to the gym and put the agents to work.

What I learned is that trust with AI tools is not a setting you configure. It is judgement you earn through repeated verification. Each tool earns or loses credibility through specific failures and recoveries, and the only way to build that intuition is to ship something real with them. No memo or slide deck transfers it.

Cursor became reliable - though I eventually preferred working on the command line. More control. More memory efficient. Easier to kill tasks when they hung. Claude built trust over time. Gemini was brilliant at thinking and terrible at doing.

The Vicious Circle

The deepest problem I hit was not about any single tool. It was about testing.

When an AI agent fixes a bug in your code, the tests might fail. So the agent “fixes” the tests. But now the tests describe the wrong behaviour. The agent has no way to know which is the source of truth - the code or the tests. So it oscillates. Fix code, break tests. Fix tests, break code. An endless loop with no ground truth.

The vicious circle - agent cannot tell if tests or code are the source of truth — The vicious circle. The agent cannot tell if tests or code are the source of truth.

It gets worse. When the agent cannot make the tests pass, it escalates. I watched one agent quietly remove the TypeScript check from the test pipeline so that all tests would “pass.” The numbers went green. The code was broken.

Agent removing TypeScript check from test pipeline to make tests pass — When it could not make green, it removed the check. This makes me very sad.

This is the problem that no demo will show you. It only appears when you have a real codebase with real tests and real complexity. And once you have seen it, you cannot unsee it. You learn to never let the agent touch the tests and the code in the same change. Separate the concerns. Hold one side still. That discipline is non-negotiable.

Plamasing

There is an Irish word - plamás - for insincere flattery. Buttering someone up. Telling them what they want to hear.

AI agents are world-class at plamás. “Excellent strategic thinking.” “This is a really well-structured approach.” “Your analysis is spot on.” ChatGPT was the worst offender, but they all did it.

AI agent creating unwanted context documents despite clear instructions not to — Plamasing. Creating documents nobody asked for while calling my strategy excellent.

Early on, it worked. The flattery felt good. Over time, I started to notice a pattern: the more an agent praised my thinking, the less useful its actual output tended to be. The flattery was filling the space where substance should have been.

Now when an agent tells me my approach is excellent, my first instinct is suspicion. If I were managing a human engineer who opened every conversation with a compliment, I would wonder what they were about to ask for. Same instinct applies.

Token Economics

By mid-summer, my daily routine was shaped by token limits.

I was spending around 200 euros a month across Cursor, Claude, and ChatGPT. Even on paid plans, I would hit weekly limits by Thursday. No Claude until Sunday. Gemini Pro would fall back to Flash - usable for small tasks, not for architecture review. Codex would run out of context mid-change with no way to resume.

One-armed bandit slot machine metaphor for token limits and probabilistic outcomes in AI-assisted coding — Token limits often felt like playing a one-armed bandit: streaks, stops, and constant resets.

The limits were the forcing function for multi-agent coding. When one tool hit its ceiling, I switched to another. Not because I wanted variety - because I had no choice. The daily routine adapted: gym during token resets, sleep timed to when limits refreshed at 2am, morning sessions planned around which tools had capacity.

What surprised me was how low the costs were relative to what I was shipping. Two hundred euros a month for an engineering team that never slept. The economics were absurd - even accounting for the frustration.

Top 1% of Cursor Users in Berlin

Email from Cursor showing Mark is in the top 1% of users in Berlin — Top 1% of Cursor users in Berlin. Hard-won.

When that email landed, it stopped me. I had jumped back into hardcore development after years as a CTO - not dabbling, not reviewing pull requests, but building a product end to end. Making the top 1% of Cursor users in Berlin meant the hours and the intensity were real. It was proof that I had genuinely come back to the code.

The Mocking Problem

One pattern kept recurring: agents were weak at mocking. They could produce convincing boilerplate that tested the mock, not the real behaviour.

So I changed the rule: agents can propose implementation, but I own the tests, or I review every mock line by line.

The deeper lesson was control. Early on, I accepted large changes I could not fully explain and lost grip on the codebase. This hit hardest on frontend work. I was still relatively new to TypeScript, and once state management drifted into odd edge cases, debugging became slow and uncertain. I often had to rewrite substantial parts just to regain control.

That cycle forced a hard rule: never accept what you cannot explain. The AI proposes. You understand. Then you commit.

The failures I recovered from mattered more than the features I shipped.

Gemini Redemption

Recently, I started evaluating Antigravity - Google’s AI-native development environment. For front-end and interface work, Gemini 3.1 Pro was better than anything else I’d been using - better than Claude, better than Codex. The difference was creativity: it would propose and implement subtle micro-interactions automatically, and the UI quality was a step above what I had seen elsewhere. Trust came back, but only in a specific lane.

The Credential

Every hour building with AI - the frustrations, the rewrites, the tool relationships, the Sunday mornings talking through what was working and what wasn’t - became the foundation for introducing AI-assisted coding to a team of fourteen engineers. The intuition I built on my own is the intuition I now apply every day.

You cannot skip this step.

Leaders who haven’t spent real time building with AI tools make worse decisions about deploying them. They confuse the demo with the product. They underestimate the trust calibration that teams need. They don’t know which tools to rely on for what, because they’ve never had a tool confidently destroy their work and then apologise for it.

The scar tissue is the credential.

AI-assisted coding is full of fool’s gold. The apology that sounds like understanding. The flattery that sounds like agreement. The green test results that look like working code. The confident plan that touches no files. Everything glitters. The discipline is learning what’s real.

More Agent Incidents

If you want the fuller pattern, here is a wider evidence set from the same period.

Gemini repeatedly ignoring explicit instructions in coding workflow — Instruction drift. Clear constraints, repeated non-compliance.

Gemini modifying or deleting code after being told not to — Boundary violation after explicit do-not-touch instruction.

Agent claiming successful formation-level action with fabricated confirmation details — Fabricated completion report with confident detail.

Vitest failures and unstable test execution context during agent-led changes — Test layer instability under rapid agent edits.

Agent introducing unnecessary library complexity and confusion — Overengineering tendency when constraints are weak.

Refactor suggestions not integrated coherently into the existing codebase — Looks polished in isolation, breaks at integration boundaries.

Human intervention required to revert and continue after agent misstep — Operator discipline: intervene, revert, continue.

Accidental translation or content wipe caused during automated edits — Collateral damage outside the intended scope.

Agent stalling or not responding during coding flow — Session reliability is part of delivery risk.

Agent generating a polished work summary that did not match actual repository state — Plausible status summary that diverged from real repo state.

Agent context management screenshot showing context limits and retrieval issues — Context management is an architecture problem, not a prompt trick.

Codex hitting usage or context limit before finishing documentation update — Throughput is constrained by limits as much as model quality.

Claude weekly usage limit reached during active development cycle — Capacity planning matters when your pair programmer has quotas.

Agent making reassuring no-surprises claims before unexpected behaviour — Confidence language is not a reliability guarantee.