We gave AI agents our bug backlog—here's what we learned

In this article

Give engineers a free week and a stack of small gnarly problems and they do not chase tickets. They build tools. They write the scripts they have been threatening to write for a year. They pull fixes forward that have been quietly rotting in the backlog because no single feature ever justified the detour.

We called it Cleaning with Claude. May 4 to May 8. Our whole engineering team participated. On-call and incident response ran as normal. The rest of us chased tech debt.

One unbreakable rule: AI agents had to drive.

Claude Code, Codex, Cursor in agent mode, Devin, all fair game. But "open the IDE and hit tab" did not count. The engineer was the navigator. The agent was expected to investigate, reason, edit, test, and propose work a human could review.

We thought we were running a bug bash. We ended up with the first sketch of a workflow we had been waiting for somebody else to invent.

The numbers

In five days: 267 cleanup tickets pulled into scope, 217 closed. 12 demo videos submitted on AI workflows, agents, and small skills people built mid-week.That is the easy thing to measure. It is not the most interesting thing that happened.

The workflow that built itself

The shape of a new bug pipeline emerged while engineers were supposed to be cleaning house. It went like this.

Yodel captures the bug where the user sees it. One of our PMs, built a Chrome extension. Cmd+Shift+Y, type what you saw, hit submit. Screenshot, console logs, network errors, and browser diagnostics bundle in automatically. No "please file a Linear ticket using the template" friction.

Linear receives the report.

Snippy, an internal triage tool, picks it up. Snippy reads the Yodel report and routes it to the correct squad's board.

Baloo takes it from there. Baloo is an agent we built. It runs on Claude with Devin doing the heavier codebase context. Every ten minutes, Baloo checks the triage queue, pulls context from Notion, Linear, Slack, and the codebase, and attaches a vetted context packet to the ticket.

If Baloo is more than 80% confident, it opens three draft PRs: a minimal version, a maximal version, and a Goldilocks version. It reflects on the tradeoffs, recommends one, and closes the other two without deleting them. The losing PRs stay linked on the ticket so a reviewer can see the alternatives.

If Baloo is less than 80% confident, it says so in a comment and leaves the ticket with enough context that a human starts the work several steps ahead of where they would otherwise begin.

Forget the names for a second. What collapsed was the distance between "a user noticed something is off" and "an engineer is solving the right problem." The boring middle started moving on its own: capture, routing, context gathering, first pass.

What did not work

Not every ticket was a good fit for an agent.

Some bugs needed product judgment we had not written down anywhere, which left agents guessing. Some agent-generated changes were too broad: a one-line fix would arrive as a five-file refactor. Some context packets looked convincing and missed the actual failure mode. A few PRs solved the wrong half of the problem because the agent latched onto the loudest signal in the logs instead of the root cause.

The three-PR approach also created real review overhead when the right answer was obvious from the start. Sometimes Baloo handed us three flavors of the same fix and we only needed one.

By the end of the week we had a better feel for which problems agents should drive, which ones they should prepare, and which ones still need a human to take the wheel immediately. This is a decaying skill that has to be practiced regularly. The terrain shifts as new models and agent harnesses ship.

If you want to try this with your team

Commit and block a full week. It has to be that black or white. If the rule is something like "spend 20% of your time on AI workflows," engineers get pulled back into feature work and end up spending somewhere between 0% and 5%.

Pick small high-friction problems. Agents shine when the problem is bounded but annoying. Keep on-call and incident response running normally so nothing customer-facing degrades.

Require agent-driving, not agent-assisted typing.

Review everything. Agents can drive. Humans still own correctness.

Reward workflow improvements, not just ticket count. The best output may be tools, scripts, and routing systems that did not exist on Monday.

Let engineers build the missing machinery. Baloo's three-PR approach started as a skill I built for local use in Claude Code, then got wedged into Baloo's workflow mid-week. Think Katamari Damacy, not Rampage. Do not overdesign before the week starts. Most of what makes this work is emergent.

What surprised me

Going in, I assumed the headline would be the ticket count. The count is real and it matters. It is also the least interesting artifact of the week.

The artifact I would point at if someone asked what made the week valuable is not the tickets and not even the new triage pipeline. It is that the team now has a shared idea of what their workflow looks like when we treat agents as part of the team instead of a side project. That is better for long-term growth than any amount of tech debt paid off, although that was wildly successful too.

Our team shipped AI workflows for observability and alerting, automating dashboard creation and tuning. That was not on anyone's list. They just did it.

Agents earned their keep by absorbing the boring middle between noticing a problem and having a credible fix in review. Engineers stopped fighting their own process long enough to design around the tools instead of around the calendar.

Do not bolt agents onto the old workflow and call it transformation. Change the rules of the week so engineers have to design around agents, then watch what they build.

We're building teams that work this way. If that is the kind of engineering culture you want, we're hiring.