Fixing bugs reduces to having their reproduction test case.
A root cause is only useful once it can be reproduced. The practical debugging loop is to localize the fault, turn the alert into a reproduction, and test hypotheses until the fix is validated.

CI failures are easy mode.
One red test. One branch. One stack trace. Usually, you can root cause, patch, and verify the failure in an hour.
Production alerts are different.
You can burn hours reconstructing distributed system state from partial logs, stale traces, and "what changed recently?" After that, the fix is often still just a hypothesis.
The only reliable way through is to turn the alert into a reproduction test case.
localize the fault
The first question is not "what is the fix?"
Not yet.
The first question is: what made the system leave the correct path?
That could be:
- An upstream API change.
- Unexpected user input.
- A scheduler edge case.
- A race between two state transitions.
- A missing assumption in a background job.
At this stage, the goal is not to be clever. The goal is to collect enough evidence to test a hypothesis.
Logs. Metrics. Traces. Recent changes. Customer reports. Issue comments. The one person who remembers the weird deploy from last Friday.
Getting the right team members involved early helps, but only if the conversation keeps moving toward a testable claim:
If this hypothesis is true, this specific behavior should reproduce.
That sentence is the bridge from root cause analysis to engineering work.
build the reproduction
At VMware, on the VSAN team, the brutal bugs were network partitions and split-brain failures.
Concurrency. Consensus. State transitions. Timing.
Those bugs rarely arrived as neat unit-test failures. They arrived as traces, logs, and reports from systems that had already moved on.
Eventually, we started getting a better grip on them by treating reproduction as the product of the investigation, not as a nice-to-have after the fact. A ZooKeeper-like process recorded major system events. We pulled those traces, replayed them in code, added logs, and kept narrowing the scenario until the bug showed up through experimentation.
The reproduction did not have to be perfect on the first pass. It had to be specific enough to make a hypothesis answerable.
That is the critical shift.
You are not asking an engineer to "look into it." You are asking the system to prove or disprove a theory.
run the loop
Best case, the evidence gives you the exact reproduction.
Reality is messier. You rarely get 100% information from production. Logs are sampled. Traces expire. Deploy history is incomplete. The triggering input was not preserved. The bug depends on timing.
So you recreate the most promising hypothesis first.
If it fails, adjust.
Try the next hypothesis.
The loop looks like this:
- Localize the suspicious path.
- Convert the hypothesis into a reproduction.
- Run the reproduction against the broken code.
- If it does not fail, refine the hypothesis.
- If it does fail, patch the code.
- Run the reproduction again against the fixed code.
- Check that unrelated behavior did not change.
This is where teams often punt.
Add logs. Redeploy. Wait for recurrence.
Sometimes that is unavoidable. But it is also a sign that the investigation did not yet produce a testable artifact. The team has more information than before, but not enough to validate a fix.
where FixBugs fits
The FixBugs workflow is built around that loop:
Alert -> FixBugs -> Hypothesis <-> Reproduce -> Fix -> Validate
The point is to push the grunt work into frontier models while keeping the engineering standard intact.
That means using models for:
- Log analysis.
- Code path search.
- Hypothesis generation.
- Reproduction scaffolding.
- Validation planning.
- Adjustment based on failed evidence.
The useful output is not "AI says this is the bug."
The useful output is a reproduction that fails before the patch and passes after it.
Once you have that, the fix is no longer guesswork.