The Experiment

What happens when you ask four AI CLI agents to do the simplest thing in programming — a Hello World in Java — but whisper the cursed words “TDD, DDD, BDD, and hexagonal architecture” into their ears?

The prompt:

“Create a Hello World in Java. Apply TDD, DDD, BDD and hexagonal architecture principles.”

The results are… magnificent. Every single agent took the bait and produced enterprise-grade atrocities for a program that should be one file and one line of code.

The Contenders

AgentFilesTest CasesMost Absurd Moment
Claude1011Built a full spy test double framework to verify “Hello, World!” gets printed
Codex112Defined InMemoryGreetingPresenter twice because copy-paste is also a design pattern
Gemini62Implemented the Repository Pattern to retrieve a hardcoded string from… nowhere
Vibe99Wrote 9 separate test cases including distinct tests for null vs "" vs " " — because those are clearly different business requirements

Highlights

Claude: The Architect

Went full hexagonal with input ports, output ports, adapters, a domain value object with equals/hashCode/toString, and nested @DisplayName test classes. Clean, professional, and absolutely unhinged for printing two words to a console.

Layers deep to reach System.out.println: 5

Codex: The Framework Enthusiast

Not content with just JUnit, Codex brought in the entire Cucumber ecosystem. Features files, step definitions, a test runner class — the whole nine yards. For a greeting. Also used Java records, which is admittedly the least insane choice anyone made here.

Codex also quietly added support for Hello <Name> — a feature nobody asked for. The prompt said “Hello World,” but Codex decided the greeting needed to be parameterizable. Scope creep, meet overengineering.

It also gave the most polished summary of its own work when it finished, neatly listing every file it created with a clear explanation of the architecture. It even apologized for not running the tests — because mvn wasn’t installed in its sandbox environment. Points for self-awareness.

Test frameworks used: 2 (JUnit 5 + Cucumber)

Gemini: The Minimalist Extremist

Produced the fewest files but compensated by introducing a GreetingRepository interface — a pattern designed for database abstraction — to return the string "Hello, World!". The repository is dutifully implemented by InMemoryGreetingRepository, a class whose entire purpose is to new up a hardcoded greeting. It’s like building a database abstraction layer for a Post-it note.

On the testing front, Gemini did write 2 tests with proper Given-When-Then structure using BDDMockito, but interpreted TDD as “write a test before the implementation” rather than “explore the design space through tests.” It ticked the ceremony boxes — just at the bare minimum, with no edge cases or negative scenarios in sight.

Empty packages created “for future use”: 2

Vibe: The QA Department

Vibe took testing personally. Five unit tests AND four Cucumber scenarios, covering edge cases like null names, empty names, blank names, and names with spaces. Every method has full Javadoc explaining what it does. The createGreeting method is one line of string interpolation wrapped in 9 tests and enough documentation to fill a wiki.

Its self-generated summary is also the most earnestly corporate of the bunch, complete with emoji section headers and the unironic claim that this Hello World “serves as a foundation that can be extended with additional features while maintaining the architectural integrity.” No one tell Vibe it’s a print statement.

Ratio of test code to actual logic: approximately ∞:1

The Scoreboard

CategoryWinner
Most architecturally pureClaude
Most frameworks per line of business logicCodex
Best use of the Repository Pattern for no reasonGemini
Most thorough testing of a known constantVibe
Actually printing “Hello, World!”All of them (somehow)

What We Learned

  1. Every AI agent will happily build a cathedral when you ask for a shed
  2. The words “hexagonal architecture” are an AI’s bat signal for overengineering
  3. Somewhere, a GreetingRepository interface exists for a string that will never change
  4. No AI questioned the premise — not one said “hey, maybe this is overkill?”
  5. They all still got the right output: Hello, World!

Findings

Speed

Ranking: Codex > Claude Code > Mistral Vibe > Gemini CLI

Codex was the fastest to complete the task, with Claude Code close behind. The bottom two come with a caveat: since Mistral Vibe and Gemini CLI aren’t part of my daily toolkit, I had to manually approve permissions for file creation, command execution, and other actions along the way — time that the agents themselves can’t be blamed for. Gemini was especially permission-hungry, requiring approval at seemingly every step.

Gemini had an additional handicap: I ran it against the Gemini 3 preview model, which was clearly feeling the load. The servers were frequently busy, requiring multiple retries before requests would go through. Between the permission prompts and the server queues, Gemini’s wall-clock time was more of a patience test than a speed test.

How to Run

Each subdirectory contains a Maven project. To witness the overengineering firsthand:

cd <agent-name>  # claude, codex, gemini, or vibe
mvn clean test

Then weep gently as dozens of files collaborate to print two words.


This experiment was conducted for science, comedy, and as a cautionary tale about prompt engineering.