Multi-Protocol LLM Debugging Agent

AI Agents LLM Systems Benchmarking Python OpenAI Debugging January 2026 - March 2026

I built a custom LLM debugging agent framework from scratch, designed to systematically benchmark how different agentic workflow protocols, models, and fix strategies perform across a structured set of real debugging tasks. The goal was to move beyond anecdotal comparisons and produce reproducible, structured evidence for which configurations actually work and at what cost.

Agent-Computer Interface

The framework is built around a minimal Agent-Computer Interface exposing 5 tools: file view, file edit, search, test runner, and directory listing. These tools give the agent everything it needs to localize a bug, generate a fix, and verify it passes tests, without exposing unnecessary surface area that could confuse reasoning.

The same interface runs identically across 4 LLM providers: OpenAI, Ollama, DeepSeek, and TritonAI. Every run tracks token usage and estimates cost, so results can be compared not just by accuracy but by cost efficiency per task type and provider.

Protocol Benchmarking

I benchmarked 3 agentic workflow protocols across 90 debugging tasks: ReAct, Reflexion, and act-only. Each protocol represents a different approach to how the agent reasons and recovers from failed attempts. ReAct interleaves reasoning and action steps. Reflexion adds a self-critique loop after failures. Act-only skips explicit reasoning and goes directly to tool calls.

Every run is logged as a structured JSON trajectory capturing the full sequence of tool calls, reasoning steps, and outcomes. Automated report generation processes these trajectories across all protocol and model combinations, making the analysis reproducible without manual data wrangling.

Execution Tuning at Scale

Identifying the right configuration required 2,100+ runs across combinations of protocols, models, localization strategies, and fix generation approaches. I expanded bug localization context from function-level to line-level precision, which meaningfully improved fix accuracy for certain bug types by giving the model a tighter target. I also tested 4 distinct fix generation strategies to understand how the agent’s approach to generating patches interacted with each protocol and model.

The result is a structured map of which protocol and strategy combinations are most effective per bug type, and which configurations offer the best tradeoff between accuracy and cost across the 4 supported providers.