Login
Back to Blog
Same Agent Workflow, Three Model Routes: A Real Crazyrouter Benchmark

Same Agent Workflow, Three Model Routes: A Real Crazyrouter Benchmark

C
Crazyrouter Team
June 3, 2026
2 viewsEnglishAI Coding
Share:

Same Agent Workflow, Three Model Routes: A Real Crazyrouter Benchmark#

Dynamic workflows are exciting, but they introduce a practical question: which model should run each step?

If a workflow has a planner, implementer, adversarial reviewer, and verifier, you can route all steps to one strong model. Or you can route different steps to different models.

The wrong answer is to guess.

So we ran a small real benchmark through Crazyrouter.

Dynamic workflow routing benchmark

Test setup#

We used the Crazyrouter OpenAI-compatible endpoint:

text
https://cn.crazyrouter.com/v1

The workflow task was:

text
Add a CSV export for account billing history with user-level authorization, timezone-safe timestamps, CSV escaping, tests, and rollback notes.

The workflow had four steps:

  1. planner
  2. implementer
  3. adversarial reviewer
  4. verifier

We tested three routing policies:

PolicyPlannerImplementerReviewerVerifier
all_opus_47claude-opus-4-7claude-opus-4-7claude-opus-4-7claude-opus-4-7
all_opus_48claude-opus-4-8claude-opus-4-8claude-opus-4-8claude-opus-4-8
routed_47_48claude-opus-4-7claude-opus-4-8claude-opus-4-7claude-opus-4-8

Raw benchmark artifact:

text
generated/dynamic_workflow_routing_20260603/benchmark_results.json

Results#

RouteCallsTotal latencyTotal tokensOutput tokensScore
all Opus 4.74100.939s8,8535,97714/17
all Opus 4.8482.598s8,3575,78215/17
routed 4.7/4.8485.975s8,6525,87315/17

In this run, all_opus_48 won on latency, total tokens, and score.

That does not mean every workflow should use Opus 4.8 everywhere. It means routing needs evidence.

What the score measured#

This was not a generic benchmark. Each workflow step had step-specific checks.

For example:

  • planner needed affected files, risks, acceptance criteria, tests, rollback;
  • implementer needed CSV handling, authorization, timestamps, tests;
  • reviewer needed security, privacy, tests, rollback;
  • verifier needed commands, tests, evidence, inspection.

The score was a simple keyword-based quality gate. It is not a perfect human evaluation, but it catches whether the output covered required workflow concerns.

Why this matters#

Dynamic workflows can multiply model calls.

A simple AI coding request might become:

text
planner call
+ implementer call
+ reviewer call
+ verifier call
+ retry calls
+ patch-fix calls
+ final summary call

If you always use the most expensive model, cost can grow quickly. If you always use the cheapest model, failures and retries can grow quickly.

The useful metric is not token price.

The useful metric is:

text
cost and latency per successful workflow

What we learned#

1. Opus 4.8 was faster in this workflow#

The all-4.8 route finished in 82.598 seconds. The all-4.7 route took 100.939 seconds.

That is an 18.341 second difference for the same four-step workflow.

In a single call, that may not matter. In a background agent system with many workflow steps, it does.

2. Mixed routing was close, but not better here#

The mixed route used 4.7 for planning/review and 4.8 for implementation/verification.

It scored 15/17, same as all-4.8, but took 85.975 seconds and used 8,652 tokens.

That is still good. But in this run, all-4.8 was simpler and slightly better.

3. A static rule is risky#

A different task might favor a different route. Security review, legal summarization, long-context extraction, frontend implementation, and test generation are not the same workload.

The point is not “always use model X.”

The point is to create a workflow trace and compare model routes on your actual task types.

Minimal reproduction code#

python
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_CRAZYROUTER_API_KEY",
    base_url="https://cn.crazyrouter.com/v1"
)

steps = {
    "planner": "Create a concise implementation plan with risks, tests, and rollback.",
    "implementer": "Write a minimal pseudo-patch plan and code sketch.",
    "reviewer": "Adversarially review security, correctness, privacy, tests, and rollback.",
    "verifier": "Create a verification checklist with concrete commands and evidence."
}

route = {
    "planner": "claude-opus-4-8",
    "implementer": "claude-opus-4-8",
    "reviewer": "claude-opus-4-8",
    "verifier": "claude-opus-4-8"
}

for role, instruction in steps.items():
    response = client.chat.completions.create(
        model=route[role],
        messages=[{"role": "user", "content": instruction}],
        temperature=0
    )
    print(role, response.choices[0].message.content)

Keep the API base URL clean. Do not add UTM parameters to code endpoints.

Start with this loop:

  1. define workflow steps;
  2. define success checks per step;
  3. run the same task across 2-3 routing policies;
  4. log model, latency, tokens, and score;
  5. choose the route based on successful workflow outcome, not model hype.

A gateway makes this practical because the application code can keep the same client and base URL while changing model IDs.

Why Crazyrouter fits this pattern#

Dynamic workflows need three things:

  • model variety;
  • centralized routing;
  • traceable usage.

Crazyrouter gives one OpenAI-compatible API surface for multiple models. That makes it easier to test planner/reviewer/verifier routes without rewriting the product around each provider.

This matters more as AI coding moves from single-agent chats to orchestrated workflows.

Final take#

Dynamic workflows are not just a Claude Code or Codex feature. They are an engineering pattern.

Once you split work into planner, implementer, reviewer, and verifier, model choice becomes a routing problem.

In this run, all Opus 4.8 was the best route. In your workflow, it might be a mixed route. The only way to know is to measure.

Try model routing here: Crazyrouter

Implementation Guides

Related Posts

"Claude Plans, Codex Reviews: Rebuilding the Viral Two-Agent Coding Workflow with Crazyrouter"AI Coding

"Claude Plans, Codex Reviews: Rebuilding the Viral Two-Agent Coding Workflow with Crazyrouter"

"Twitter is full of Codex-in-Claude-Code workflows. We rebuilt the useful part: one agent plans or implements, a second agent reviews adversarially, and the whole process becomes reproducible packets instead of copy-paste chaos."

Jun 3
Claude Code Dynamic Workflows, Rebuilt: A Practical Ultracode-Style Orchestration TemplateAI Coding

Claude Code Dynamic Workflows, Rebuilt: A Practical Ultracode-Style Orchestration Template

Dynamic workflows in Claude Code are trending because they turn one prompt into orchestration, subagents, and verification gates. We rebuilt the useful pattern as a reproducible local workflow with model routing through Crazyrouter.

Jun 3
How to Use Claude Code with Crazyrouter: Base URL Setup, Model Routing, and Cost SavingsTutorial

How to Use Claude Code with Crazyrouter: Base URL Setup, Model Routing, and Cost Savings

Switch Claude Code to Crazyrouter in minutes. Set your base URL, access multiple models through one key, reduce API cost, and keep your existing coding workflow.

Apr 18
OpenAI Codex CLI vs Claude Code vs Gemini CLI: AI Terminal Tools ComparedComparison

OpenAI Codex CLI vs Claude Code vs Gemini CLI: AI Terminal Tools Compared

A head-to-head comparison of the three major AI terminal coding tools — OpenAI Codex CLI, Claude Code, and Gemini CLI. Features, pricing, and real-world performance.

Feb 23
Claude Opus 4.8 vs Opus 4.7: Real API Benchmark Results for DevelopersClaude

Claude Opus 4.8 vs Opus 4.7: Real API Benchmark Results for Developers

We tested claude-opus-4-8 and claude-opus-4-7 through the Crazyrouter OpenAI-compatible API across reasoning, coding, JSON extraction, long context, tool-use planning, multilingual output, and cost reasoning.

May 29
Claude Jupiter v1-p vs GPT-5.5 Benchmark: Real API Test on Reasoning and CodingBenchmark

Claude Jupiter v1-p vs GPT-5.5 Benchmark: Real API Test on Reasoning and Coding

We tested claude-jupiter-v1-p and gpt-5.5 through https://cn.crazyrouter.com/v1 across reasoning, coding, patching, JSON, long-context recall, agent planning, and math tasks. GPT-5.5 scored slightly higher, while Jupiter was much faster but required a payload compatibility fix.

May 27