
In the last edition of We Should Do Something With AI, I discussed my Daily Doom experiment. How long can a coding agent build software before it all breaks down?
I set up a system to autonomously build a Doom clone. Every night, the system would pick up the GitHub issues and add them to the game. If there were no issues, Claude would make something up. New features, new game mechanics. No humans involved.
The first few days were rough. If you look at the git history, you’ll catch me intervening here and there. But after a few days, the system became fully autonomous. And now, as I write this, we are on day 40 of silicon John Carmack building the engine that lets us fight the infernal hordes.
So, what did we learn over the last few weeks?
AI is an autonomous coder.
Let’s start with the big one. Where I expected the AI slop to break down pretty fast, I’m pleasantly surprised to announce that in the course of those 40 days, I only had to flag one game-crashing bug. Think about how crazy that is! A machine has been building working software for 40 days straight. Claude Code has gotten a lot better at producing reliable code.
AI is an uninspired product manager.
Checking if Claude could continuously build a game was half the experiment. The fun (and more interesting) part was to see if the LLM could come up with ideas. I explicitly set out to let the machine roam free. Most of my interventions were related to bugs. I remember waking up one morning to a notification that, instead of a single gun, I now had 6 different weapons to destroy ghouls. That felt extremely impressive for a cold silicon mind.
But more often than not, the ideas Claude came up with were safe. Contained. Bland. The coding agent would rave about adding more particles to the lava pools it added earlier on. That’s a nice cosmetic improvement, but no big leap. Every addition was a small, uncreative, gradual increment.
To make the game stand out a bit, my human intervention was needed. I pointed it to free sprites to use. I asked it to build more than one map. The small details are machine-driven, but the improvements that mattered required human intervention. That gives us a hint of where humans will be in the loop.
QA remains the bottleneck.
Every morning, I playtested the game to check out the new mechanics and to report the occasional bug. That was easy and low-effort, until I got tired of the single map. I asked Claude to come up with 5 distinct maps, and all of a sudden, testing got a lot harder. I moved from a 5-minute daily check to half an hour a day of speedrunning a game.
Coding agents, on average, write more test cases than their carbon-based colleagues. But as any TDD-fanatic can tell you, you can have 100% code coverage and still have built the wrong thing. There is a difference between software doing what it is instructed to do and testing whether it does the right thing.
So, all tests were green, but some doors wouldn’t open. Some demons would spawn in a room where they could never be killed. Only humans can catch those game-breaking bugs.
This gives us the final clue of where AI-powered software development is headed.
We are far away from a future where software engineering resembles lights-out manufacturing. At the same time, I’m convinced we are close to embracing autonomous development pipelines. A machine will read our tickets and automatically give them a first go. Humans might not even be involved in reviewing their coding agent’s output. But for the foreseeable future, we will be involved in the first and last gates of the process: telling the machine what to do and verifying whether our silicon colleagues’ output is good enough to go live. Boilerplate features and code quality seem pretty much solved. But creative ideas and exploratory testing still require a human.
By no means is Daily Doom a good game. But it is better than it has any right to be, given how low-effort it has been on my side. So, for the coming weeks, I will let it do its thing. I invite you to play it and come up with your own crazy, creative additions. Seriously, I’d love to incorporate your feedback in the last few weeks of this experiment. Let me know by replying to this email.
And this experiment, while fun and whimsical, gives us a glimpse into the future of software development. It is very likely that our machines will automatically fix Sentry bugs and performance issues. It is unlikely that they will build a product that sets us apart from the competition.
For now, we humans are still very much needed.
