Why LLMs Write Bad Code (And How to Fix It)
For founders whose teams are adopting AI coding tools, this guide shows how structured acceptance criteria turn unreliable LLM output into predictable delivery — without adding process overhead.
Your team is using AI coding assistants and still missing deadlines — the problem isn't the model, it's the missing contract between the engineer and the tool. If your engineers are prompting LLMs the way they'd ask a colleague a quick question, you're burning cycles on iteration loops that compound into slipped sprints. This article is for founders and engineering leads who've adopted AI tooling but aren't seeing the productivity gains they expected. By the end, you'll have a concrete framework to enforce structured AI usage that produces reliable, reviewable output — without adding process overhead.
Why Your Team's AI Experiments Are Wasting Time
The pattern we're seeing across startup engineering teams is consistent: AI coding tools get adopted bottom-up, engineers experiment freely, and a few weeks later the productivity narrative quietly dies. Standups still run long. PRs still come back half-baked. Deadlines still slip.
The root cause is almost never the model. As this analysis from Katana Quant makes clear, LLMs don't write correct code — they write plausible code. The distinction matters enormously. A model optimizes for coherent output, not for satisfying requirements it was never given. When an engineer prompts with "write a function to process payments," the model has no idea what "correct" means in that context. It fills the gap with assumptions, and those assumptions become bugs.
This is the founder time tax in a new form. Instead of founders being pulled into delivery decisions, they're now absorbing the cost of AI rework — through delayed features, surprise bugs in review, and engineers who feel productive but aren't shipping.
The fix is structural, not technical. You don't need a better model. You need acceptance criteria before the prompt.
What "Acceptance Criteria First" Actually Means in Practice
Acceptance criteria aren't a new concept — they come from agile story writing. But most teams treat them as a documentation artifact, not a prompt input. The shift is treating your AC as the first line of every AI coding request.
Here's what that looks like in practice:
Without AC (what most teams do):
Write a function that validates a user's email address.
With AC (what high-output teams do):
Write a function that validates a user's email address.
Acceptance criteria:
- Returns true for standard RFC 5321 compliant addresses
- Returns false for addresses missing @ or domain
- Returns false for addresses with consecutive dots
- Handles null and empty string inputs without throwing
- Includes unit tests covering each case above
- No external dependencies
The difference in output quality is not marginal — it's categorical. The model now has a definition of "done" that it can optimize toward. More importantly, you have a definition of done that makes the output reviewable without running the code.
This is exactly the kind of delivery standard gap our engineering process work at 10ex is designed to close — before it shows up as a slipped sprint.
For a broader look at how AI tooling fits into your startup's technical decision-making, see How to Evaluate New AI Models for Your Startup.
A Framework Your Team Can Adopt This Week
This is the three-layer structure we recommend for startup teams embedding AI into their coding workflow. It works whether your team uses Copilot, Cursor, Claude, or any other assistant.
Layer 1: The Prompt Contract
Every AI coding prompt should include four components:
| Component | What It Contains | Example |
|---|---|---|
| Context | What system/module this touches | "This is in our billing service, Stripe integration" |
| Task | What needs to be built | "Add a retry handler for failed webhook deliveries" |
| Acceptance Criteria | Testable conditions for "done" | "Retries 3x with exponential backoff, logs each attempt, fails gracefully" |
| Constraints | What the output must NOT do | "No new dependencies, must be idempotent" |
This isn't bureaucracy — it's the minimum viable spec. A senior engineer holds all four of these in their head before writing a line of code. You're just making it explicit for the model.
Layer 2: The PR Gate
The prompt contract only works if it's enforced at review time. Add one question to your PR template:
If AI tooling was used: paste the acceptance criteria that were provided in the prompt.
This does two things. First, it creates accountability — engineers who skipped the AC step have to admit it. Second, it gives reviewers a checklist. Instead of reading code to infer intent, they're verifying output against stated criteria. Review cycles get faster and more consistent.
Teams that skip this gate report the same failure mode: AI usage becomes invisible, quality variance stays high, and the productivity gains never materialize in delivery metrics.
Layer 3: The Iteration Budget
Here's the insight most AI productivity content ignores: if a prompt requires more than two rounds of iteration to produce usable output, the acceptance criteria were underspecified — not the model.
Set an explicit team norm: if you're on your third prompt iteration for the same task, stop and write better AC before continuing. This reframes the problem correctly. Engineers stop blaming the tool and start improving their own specification skills — which makes them better engineers regardless of AI.
Track iteration counts informally in standups for the first few weeks. The signal is directional, not precise. Teams that adopt this norm typically see iteration loops collapse within a sprint.
The relationship between delivery constraints and team output is something we explore further in Is the Iron Triangle Finally Dead? — worth reading alongside this framework.
What Good Looks Like — and Where Teams Go Wrong
Good: An engineer opens a task, writes four bullet-point acceptance criteria before touching any AI tool, pastes them into the prompt, gets output, verifies against the criteria, and opens a PR with the AC attached. Total time: faster than writing the code manually, with higher confidence.
Common failure mode #1: Teams adopt the framework for new features but not for bug fixes. Bug fixes are where AI output is most dangerous — the model doesn't know what behavior is correct, only what behavior is different. AC matters more for bugs, not less.
Common failure mode #2: Acceptance criteria get written after the prompt, to justify output that already exists. This is rationalization, not specification. The order matters — AC before prompt, always.
Common failure mode #3: Leadership mandates the framework but doesn't model it. If the tech lead is still prompting casually, the team will follow. This is a behavior change that requires visible adoption from the top of the engineering org.
The Broader Signal: AI Amplifies Your Existing Process Debt
What makes this pattern worth paying attention to beyond the immediate productivity question is what it reveals about your engineering process overall. Teams that struggle to write acceptance criteria for AI prompts are the same teams that struggle to write acceptance criteria for sprint stories. The AI just makes the gap visible faster.
In our work with startup engineering orgs, the teams that get the most out of AI tooling are the ones that already had disciplined delivery practices — clear story definitions, reviewable PRs, explicit definitions of done. AI accelerates what's already working. It amplifies what isn't.
This is why the fix isn't a prompt template — it's a process standard. The template is just the artifact that makes the standard visible.
Your 7-Day Action Plan
Day 1–2: Audit one recent PR where AI tooling was used. Ask the engineer: what acceptance criteria did you give the model? If they can't answer, you've confirmed the gap.
Day 3: Draft a four-field prompt template (Context / Task / AC / Constraints) and share it with your engineering team in Slack or Notion. Keep it one page.
Day 4–5: Add the AC disclosure question to your PR template. Make it optional for now — you're measuring adoption, not mandating it yet.
Day 6–7: Run a 15-minute team retro specifically on AI usage. Ask: where did iteration loops run long this sprint? What would better AC have changed? Let the team surface the pattern themselves.
By the end of the week, you'll have a baseline and a team that's starting to think about AI output as something to specify, not just generate.
If this kind of process work — turning AI adoption from an experiment into a delivery multiplier — is something your team needs embedded leadership on, this is exactly the type of standard we help startup engineering orgs build and enforce. See how 10ex approaches engineering delivery.