Vibe Engineering: What I've Learned Working with AI Coding Agents
Combining vibes with rigorous engineering practices
I’ve spent the past few months doing almost nothing but working with AI coding agents. No job, just me and LLMs building things together. What follows is everything I wish someone had told me when I started: the mindset shifts, the hard-won lessons, and the techniques that actually work for me.
Fair warning: if you tried ChatGPT for coding in 2023, decided “this can’t code,” and haven’t looked back since, you’re operating on outdated information. Things have changed. A lot.
Agents Are Just Loops (And That’s Fine)
“Agent” has been the word of 2025. I always found it confusing. What even is an agent? It felt like marketing hype until I came across Simon Willison’s definition:
An agent runs tools in a loop to achieve a goal.
That’s it. Very abstract, very simple. What surprised me is that from an engineering perspective, these agents are really lame. It’s literally just a for loop. User input goes to the LLM, response comes back with maybe some tool calls, results go back into the loop, and you keep going. That’s the whole thing.
But that’s actually nice, because it means these systems aren’t magic. You’re just harnessing the power of models that are incredibly capable when given the right setup.
The caveat: you need state-of-the-art models. This is the first time I’ve bought something that costs a hundred dollars a month (except my rent). But open source alternatives are nowhere close. Try them out, go into debt if you have to. (Okay, don’t actually go into debt. But you get the idea.) Open source models like GLM/MiniMax are getting better, but Opus 4.5 is really on another level.
Your Context Window Is a Precious Resource
The context window is maybe the most important concept to internalize. It’s just an array of tokens. The model has no other state; everything it knows about your conversation and your code lives in that window.
The problem is that the more you stuff in there, the worse the model performs. I’ve vibe-picked a number, but ideally you want to stay under 50-75% capacity. Above that, the model starts hallucinating, making mistakes, calling tools incorrectly, etc.
The question you should constantly be asking yourself: “What even is in my context?”
Current tooling is terrible at answering this. GitHub Copilot, for example, puts 30,000 tokens into the context by default. Even for the simplest project, that’s fifteen percent of your memory gone before you’ve done anything.
Geoffrey Huntley recently compared it to a Commodore 64. I’m too young to have used one, but I had a Windows 98 computer with 20 MB of RAM. That’s the right mental model. You’ve got maybe 200 to 500 kilobytes to work with. When the context fills up completely, you get “compaction,” which is basically death for your useful assistant. Everything gets confused and degraded.
This is what some people are starting to call “context engineering.” What you put in the context dictates how effective you’ll be. If you just ask “make me an app with 100M ARR” yeah, it’s not going to work. You need to think about things more carefully.
The Great Decoupling
Gemini called this “the great decoupling,” and I think it’s a useful frame for understanding why some people love LLMs and others hate them.
You have to decouple programming, the craft of physically typing code, from engineering, the architecture of your system, the goals, the “why” of what you’re building.
If you’re focused on formatting, syntax, implementing loops and algorithms, LLMs will soon be better at this than you. In your “expert” domain (for me that’s C++ and Windows internals) they’re going to be worse. But LLMs can already do this for every domain in every programming language. I don’t really know Haskell. But with the right setup, LLMs could write entire systems in Haskell for me.
The mindset shift is that you can suddenly use all programming languages and technologies instead of just the ones you’re already familiar with.
But there are questions the LLM cannot answer: How do these components fit together? Does this provide value to users? How does it provide value? That’s the thinking you still need to do. It’s a natural shift as you become more senior, but working with agents accelerates it.
You Just Got Promoted
Think of it this way: you just got promoted. Your new title is Tech Lead + QA Lead.
Your job is to write clear specifications, plans, and examples. Define what “done” means, how to test it. Review outcomes, not lines of code.
The model is like a genius intern with amnesia. It works at 100x speed but has zero long-term memory. It doesn’t remember anything you tell it, and when the context fills up, it forgets things from the beginning of the conversation. It needs strict guardrails, a kind of “model CI.”
The big mindset shift: code is now extremely cheap.
If I want a UI that displays what files are being edited on my system in real-time, I can get that in ten minutes. If the model messes up your project, just revert and try again. You spend ten bucks instead of eight hours coding by hand.
For a long time, when I wrote something, it was my baby. My blood, sweat, and tears went into it. I didn’t want to delete it. Now I need to change my thinking. I can throw out a whole project because I’m not invested in it the same way.
This has been the hardest part for me: letting go of the details and trusting this system with my precious code.
To be clear, I’m not saying you should only vibe-code everything or never write code yourself. It’s more that your default mode has to change. If you need to dig into details or debug tough issues, you still do that. But the default mindset shifts.
Stop Fighting the Model
One of the most common mistakes I see is people fighting the model. You have to stop.
When you watch it work, you get this tendency to micromanage because you feel like you own this code. If you catch yourself thinking things like:
That’s not how I would do it, stop!
The variable names don’t follow my conventions!
*Fixes formatting by hand*
*Reviews every single line of code*
That is fighting the model. It’s counterproductive; you’ll work slower. That’s why when people say “AI makes me slower.” It’s because they’re making themselves a bottleneck in the process.
You could have thirty agents working for you simultaneously. You cannot watch all of them and micromanage each one. You wouldn’t do this with your human team either.
Instead: write a better harness. If you have tests and the model does something wrong, revert, write a better test, let the model try again. It will see “this failed” and implicitly do what you want because you gave it better feedback. For variable names and formatting? Make a linter rule that automatically fixes everything. The model will edit, build, see “variable name doesn’t follow convention,” and fix it. One extra step, but then you get code the way you want it.
Wrong approach entirely? That means your plan was bad. You didn’t explain properly. Delete the code, go back to planning, articulate better, and let it rip again. You can do something else in the background while it works.
This is the most common issue I see. It makes sense; you’re running a system you can’t fully control, and that’s uncomfortable. There’s no way around it except to play with the stuff and notice yourself doing it. But hopefully knowing about this problem up front helps you recognize it faster than I did.
Project Setup Is Everything
Your project setup is probably the single most important human time investment you can make.
Hard requirement: your project has to build, test, and lint with a single command. No README that says “pass this flag” or “set this library path manually.” Put that in a configuration file once and make the single-command thing work.
If the model runs a build and gets a bunch of CMake output saying “library not found,” that’s wasted context. And context only ever accumulates, things don’t get removed from it (look up “cache pricing” for the why).
Print minimal, actionable error messages. If you have a thousand tests, don’t print “test passed” a thousand times. That’s satisfying/reassuring for humans, but for the model you just want “1000/1000 success” and exit code zero. If a test fails, print the assertion that failed, maybe a call stack, actionable information the model can use to fix the problem.
Everything you output goes directly into the context. This is context engineering in practice: how you prompt the model, and how the model interacts with your project.
A good heuristic: if a human gets lost onboarding to your codebase, an LLM is definitely going to be lost.
For Python specifically, uv is the best thing ever created. Before we had virtualenvs and all that mess. Now it’s just uv run main.py and you tell the model “we’re using uv in this project” and it knows. It was designed with this kind of workflow in mind.
With human teammates, once they learn the project setup, they know it forever. But with LLMs, if you have thirty agents and they all spend the first twenty percent of their context figuring out how to build, that cost accumulates quickly. Your team is huge now. Every new context is essentially a new team member who needs to onboard.
Trust the Harness, Not Your Eyes
Agents run tools in a loop until they achieve a goal. That means you need to build a feedback loop, and you want to trust the output without having to watch everything.
Models can only fix what they see is wrong. If stuff crashes, print a stack trace. Test failures need actionable output. Compilation and linter errors need to be clear and concise.
C++ is tricky here; one character change can produce a thousand lines of error output. That’s terrible for the model because you’re burning context on noise. Higher-level languages are sometimes better purely because of error handling. You could invest in error message parsing that runs in a separate context and returns only the actual problem lines. (Or just write simpler C++ 🤷♂️)
Design for Black Boxes
Design small, isolated systems. I call them “black box microservices,” though I don’t mean the gRPC/Kafka kind necessarily. I mean components with clear inputs and outputs where you can be confident they work without having to look inside and see what slop code is there.
If you have this as a primitive, you can compose black boxes into larger systems. Individual boxes can be rewritten in other languages, removed, replaced easily. You can measure them individually.
This matters more now because if you build a monolith of 100,000 lines, LLMs won’t do as good a job. Small, isolated, composable is the name of the game.
CLI Over IDE
The age-old fight: IDE or terminal? I think CLI wins, but not for the usual reasons.
IDE integrations are a local optimum. IDEs are human interfaces; they encourage fighting the model by showing you diffs and code. Great for humans, problematic when working with agents.
The CLI acts as a forcing function. Models know how to work in terminals; they can’t meaningfully work in GUIs (yet). The single-command requirement is something humans want too, but for models it’s essential. Stdout directly controls the context, so you have direct control over what goes in. With an IDE it all depends on how the agent harness is programmed and this is usually opaque to the user.
If you tell a model to use a tool it doesn’t know, and that tool is naturally discoverable (run the command, it prints subcommands, example), the LLM can figure out any tool. CLI encourages the manager mindset because you simply don’t see the code.
IDE integrations are also often just bad. GitHub Copilot stuffs 30,000 tokens of context in there and you have no visibility into what it’s doing. CLI tools are designed to be more transparent. I’m sure this will improve over time, but for now CLI is where it’s at.
As an aside: if you are on Windows, get off. Agents are trained on Unix-y shells and often fail with the backward/forward slashes on the paths, running background tasks etc. Every command failure pollutes the context, so don’t make it harder on the model than you have to. Things like WSL or Devcontainers work great too if you do not like Linux/Macos.
TDD Is No Longer a Scam
I’ve always thought orthodox Test Driven Development was silly for humans. You end up writing lots of test code, and when you refactor, you have to change all the tests too. Very costly.
But for an LLM, TDD is the best thing ever because it’s a feedback loop by design:
Write a failing test
Implement the feature
Test should now pass
That’s it. That’s the gold standard for working with agents. You can do crazy things when you have a test spec for something, just a matter of setting up the model in a nice environment and it will rewrite your project in another language in a day.
Golden Master Testing
This doesn’t apply to every problem, but it works incredibly well when porting systems to another language or API, or doing major refactoring.
The idea: augment your existing system with lots of debug prints. Print state, print decisions made. Capture that output as a file. That’s your “golden master.” Then do the port (or have the LLM do it). The debug prints in the new system have to match byte-for-byte. Maybe you need custom matching for things like pointer values, but that’s the general approach.
This is insanely good. I wrote LLVM Python bindings with this approach and it just works. My mind was completely blown.
There’s a great post from someone who wrote an HTML5 parser from scratch in Python with an LLM. The LLM wrote the vast majority of the code and it passes 100% of the spec tests. That’s the golden master approach. Give it lots of tests and let the LLM make them pass.
Sometimes they cheat, but that’s a discussion for another time and will improve as the models get better.
Spend Time on Planning
If you ask “build me an interactive disassembler,” it’s probably going to fail. Maybe in a year it won’t, but for now it pays to have a good plan.
The plan can be as detailed as your problem requires. When I was doing those LLVM Python bindings, the memory management of LLVM objects wasn’t obvious at all. I spent serious time coming up with examples, pseudo-code for how to manage memory and state. Go as deep as needed for your problem.
Models are great sounding boards too. Deep Research is excellent for generating ideas. You can get the model to ask you questions about your approach.
And remember: failed attempts are cheap. Eight hours of human work can compress to 15 minutes. In eight hours you can try your plan many times. That doesn’t mean you shouldn’t try to make a good plan to begin with. But if the plan fails and the result is bad, just delete everything and try again.
DevDocs: Surviving Context Resets
This is a technique I’ve been using that actually works. I read about it in a long Reddit post a few months back.
Create a subfolder in your project called devdocs:
devdocs/plan.md: goals, implementation phases, approach
devdocs/progress.md: current status, checkboxes
Once your context gets full, say the model implemented one phase or half a phase, you can kill the session, start fresh, and say “We were working on this progress.md and this plan. Continue the work.”
It picks up where it left off, maybe re-reads some code, then continues working.
The important thing to internalize: your plan is permanent, code is ephemeral. You want to be able to kill any single session at any moment. If the model gets stuck, adjust the plan, revert, regenerate.
Commit often. I’m still experimenting with this; you can see an example in my llvm-nanobind/devdocs. I also wrote a (currently broken) CLI tool to scaffold your project in this stucture: devdocs-cli. This project uses beads instead of progress.md though, you need to see what works for you...
On Slop and Accumulation
I should acknowledge the concern: what happens as things accumulate over time? You accept one change, build on it, build more on that. What does the codebase look like in six months or a year?
If we’re not careful, you can end up with slop instead of well-architected code. This is real. I don’t have all the answers here.
But the difficulty I see is that people don’t even try this stuff because of various concerns. They tried ChatGPT in 2023, decided it couldn’t code, and never looked back. Or they use it as a chat interface for quick questions but never let it actually build anything autonomously.
If you say from the beginning “I don’t trust this,” you’ll never get familiar with how these tools work. It becomes self-reinforcing. The slop concern is valid, but you have to engage with the tools to develop the judgment for when and how to use them.
Beyond Coding: Other Use Cases
A few other things these agents are surprisingly good at:
Debugging failures. When I need GitHub Actions fixed, the agent handles it. Fixing things I don’t really understand/care about is a perfect use case.
Researching existing codebases. When an API doesn’t work the way I want, or I can’t figure out how to do something, I ask Claude Code. It responds in three minutes with a detailed explanation, traces through dozens of function calls and data structures. I don’t have to bother coworkers asking “what’s going on here?”
System administration. I was trying a Linux desktop and had Fedora configuration issues. I have passwordless sudo on the machine, asked the agent “the network connection isn’t showing,” and it just fixed it. Ran diagnostic commands, figured it out, resolved it. Same with my Proxmox cluster. I needed to resize an encrypted volume. Would have taken me hours. The agent just did it. The year of the Linux desktop might come after all, just requires a 200/month subscription to keep your system stable.
What’s Next: Sub-Agents
Sub-agents are the next thing I want to get good at. The key problem is your context is limited. When you ask to fix an API, a sub-agent can explore the codebase, read all the relevant code, and return only “we need to modify this file, here’s how the flow looks.”
The main model doesn’t do the actual fixing. It spawns another sub-agent that writes the fix and reports back. This prevents context rot and allows much longer sessions.
I see people work with orchestrator agents that have sub-agents with more sub-agents, trees of agents doing insane projects. I’m not there yet, but I want to be.
Just Try It
The only way to get good at working with agents is to try. The first month, you’re probably not going to be productive. You’re going to swear a lot. You have to play with it.
I see people try for two hours and say “this doesn’t work at all.” No. It can work, and if you can’t get productive work out of it, honestly, it’s a skill issue. That’s not an insult. It’s a skill, which means you have to develop it.
I’m not saying I’m super good at this. I see people doing crazy stuff with swarms of sub-agents working independently, and I’m not at that level. But I got somewhere by playing with it. It is also just a lot of fun!
The jump in capability from recent models has been insane. A year ago with earlier models, I thought “this might work.” Obviously it was janky. But more and more people are starting to see there’s real value here.
Things change constantly. If you haven’t been paying attention, you’re probably lagging behind. There’s definitely fear of missing out: excitement about your actual work, but feeling like you’re not learning the AI stuff fast enough. There’s so much to explore, it can be overwhelming.
But it’s also fun, right? To learn new stuff. At least for me.
So go try it!
Links
Different version of this article, generated from a transcript of me ranting for 1h15m: The Pragmatic Guide to Vibe Coding: Techniques That Actually Work. This article contains much more specific examples, might be helpful for some people.
devdocs-cli/METHODOLOGY.md - More condensed/formalized version of the devdocs pattern I was experimenting with
Z.ai referral link - cheaper version of Claude Max, starting from $28/year (with an additional discount). I use this for chatbot experiments and other non-critical stuff.
https://shittycodingagent.ai - A minimal, opinionated coding agent. This has been my main driver for the past month.

