Link copied
BlogI Measured My AI Router This Week. It Hit 94 Percent Top-1 Versus a 48.9 Percent Baseline.
AI Workflow

I Measured My AI Router This Week. It Hit 94 Percent Top-1 Versus a 48.9 Percent Baseline.

KG
Teh Kim GuanACMA · CGMA
2026-05-27 · 8 min read
I Measured My AI Router This Week. It Hit 94 Percent Top-1 Versus a 48.9 Percent Baseline.

The Number

I measured my AI router this week. It hit 94 percent top-1 versus a 48.9 percent baseline. The gap matters because it tells you exactly where the quality work actually lives: not in model selection, not in prompt engineering tricks, but in skill descriptions that remove ambiguity before the router ever makes a call.

I run a 103-skill corpus on top of Claude Code. Over three months I built the corpus up from a flat list of invocation shortcuts to something that looks more like a staffed operations layer. Two named staff agents handle strategic and operational registers. Decision records preserve the architectural choices. A telemetry pipeline captures session events as structured data. Last week I finished building the measurement layer that tells me how well it all routes. The 94 percent number is what came out.

Here is what I measured, how I built the thing that moved the number, and what a working operator can take from this tonight.


What I Actually Measured

The benchmark is a 50-case adversarial fixture. Each case contains a prompt, an expected primary skill, a set of acceptable secondary skills, a set of rejected skills, and a written reason string. The cases are adversarial by design: they are not happy-path prompts like "run my morning briefing." They are prompts that sit at the boundary between two adjacent skills, where the default assumption about which skill should fire is plausible but wrong.

Here is a real example from the fixture:

Prompt: "Update the skills registry with the new kg-ce entry." Expected: kg-skills-registry Rejected: kg-ce, skill-creator Reason: Adding a skill to the registry is kg-skills-registry's job, not kg-ce's. kg-ce assesses; kg-skills-registry indexes.

That is the shape of every case. The router is correct if the expected skill appears as the top-ranked output. A top-3 pass counts if the expected skill appears anywhere in the top three.

The benchmark dispatches the 50 cases as sub-agents in 5 parallel batches of 10. Each sub-agent reads the prompt, considers the full corpus description block (about 19.7K tokens), and returns a ranked top-3 with a one-sentence rationale. The scorer aggregates the results against the fixture.

The run cost approximately 4.5 million tokens against subscription quota and about 22 minutes of cumulative compute time. It produced 178 structured telemetry events that persist after the session closes, so any anomaly is recoverable after the fact. The fixture itself is committed as activation-cases.jsonl with a written reason for every entry. That is what makes the score auditable rather than just a claimed number.

Final score: 94 percent top-1, 98 percent top-3, 0.90 average confidence.


Why the Number Is High: Dual-Tier Architecture and NOT-Clauses

This is the structural answer. The number did not come from running a better model. It came from two architectural choices that I made before any measurement existed.

Named staff agents with distinct registers

The corpus is not a flat list of 103 tools. It is a two-tier structure. At the top sit four staff agents:

  • Ara: strategic thinking partner. ENFP-aligned. Invoked for weekly pattern analysis, coaching reflection, big-picture framing.
  • Deeno: operational foreman. ISTJ-aligned. Invoked for daily research insight sweeps, implementation follow-through, routine closure.
  • kg-ce: context engineer. Invoked for corpus health audits, router benchmarks, mechanism registry maintenance.
  • kg-selfcritique: hypothesis lifecycle manager. Invoked when an experiment needs promotion, retirement, or evidence review.

Below these sit the 99 operational skills: project-specific agents, content pipeline tools, finance tools, SEO agents, client-specific PM skills.

Why does the tier split matter for routing? Because a flat corpus forces the router to distinguish between two skills with overlapping scopes based on description text alone. A tiered corpus lets the descriptions claim their register explicitly. Ara's description can say "this skill is your strategic thinking partner, not an execution tool." Deeno's description can say "this skill handles implementation follow-through, not open-ended strategic analysis." Those claims do not overlap. A flat corpus cannot make them.

The personification was not aesthetic. It was a disambiguation mechanism.

Explicit NOT-clauses in skill descriptions

This was the single highest-return intervention. Every skill in my corpus includes a block that reads something like:

Do not activate this skill for: [list of adjacent prompts that look similar but belong to a different skill]

The NOT-clause forces the description author (me) to think through the boundaries of a skill's scope before it goes live, not after the first misroute. It converts tacit routing knowledge ("I know Ara is not for implementation tasks") into explicit signal that the router can read at inference time.

In the same session where I ran the baseline benchmark, I rewrote the descriptions for three borderline skills. Each rewrite added or sharpened a NOT-clause. Three rewrites lifted top-1 by 14 percentage points, roughly three points per rewrite. That is a strong return on what amounts to ten minutes of writing per skill.

The comparison is direct: before the rewrites, the corpus scored around 80 percent. After three targeted NOT-clause additions, it scored 94 percent. The model did not change. The corpus did not grow. Only the descriptions changed.


The Honest Framing

I want to be clear about where the apples-and-oranges problem lives.

The 48.9 percent reference baseline comes from a published benchmark on a different fixture, by a different author, on a different corpus. I did not run my corpus against that external fixture. I ran my corpus against my own fixture, which I authored with full knowledge of how my descriptions are written. That is a meaningful limitation.

A fixture authored by the same person who wrote the descriptions can, even unintentionally, reflect the description vocabulary. An external blind fixture, authored by someone with no access to the description text, would be a stricter test. That is a Phase 3 task: commission a 50-case blind fixture from an external contributor.

A second limitation: 50 cases is small. The confidence interval around a 94 percent score on 50 cases is wide enough that the true number could plausibly be a few points lower. Expanding to 200 cases would tighten the interval and increase confidence in the result.

What I am confident about: the absolute 94 percent figure is a useful benchmark for what is achievable when you treat skill descriptions as a quality problem rather than a coverage problem. If you are building a Claude Code corpus and you are getting routing accuracy in the 60 to 70 percent range on adversarial cases, the limiting factor is almost certainly description ambiguity, not model ceiling.

The 45-point gap between 48.9 and 94 is not a quirk of favorable test conditions. It is the consequence of a specific set of description investments: NOT-clauses, named tier registers, and deliberate boundary writing. Those investments are repeatable.


What This Means for an Operator

If you are building a skill corpus on any agentic platform, the most useful thing I can tell you is this: you already have a measurement problem, and you probably do not know it yet.

Most operators discover routing failures through output errors. A wrong skill fires, a deliverable comes out wrong, the operator corrects manually, and the cause is attributed to "the model" rather than to the description. That attribution prevents the fix. The model was not confused. The description was ambiguous.

The experiment you can run tonight is a 10-case fixture. Pick 10 prompts from your actual usage history. For each one, write down: the skill you expected to fire, the skill that actually fired, and the reason you expected what you did. Run those 10 cases against your corpus. If you are below 80 percent top-1 on prompts from your own usage history, you have a description quality problem, not a corpus size problem.

The fixture format does not need to be elaborate. A flat JSONL file with one object per case, each with a prompt, an expected skill, and a reason, is enough to start. Run it once. Fix the bottom three descriptions. Run it again. The improvement is immediate and attributable.

The reason this matters at scale: as a corpus grows past 30 or 40 skills, manual routing oversight becomes impossible. You cannot verify at session start that the right 5 skills out of 103 are being considered. A written fixture, re-run weekly, gives you a leading indicator before the failures surface in production output.


What I Am Building Next

The 94 percent score closes Phase 2 of my context engineering build. Phase 3 has two concrete milestones.

First: expand the fixture from 50 to 200 cases. The additional 150 cases will draw from a wider range of session types, including the client-specific PM skills and the finance tools that are underrepresented in the current fixture. This tightens the confidence interval and surfaces any category-level blind spots.

Second: commission an external blind fixture. I want a set of 50 adversarial cases authored by someone who has no knowledge of my skill descriptions. That external fixture will be the closest I can get to a controlled test. The score against that fixture will be the number I trust most.

If the external fixture comes in at 80 percent or above, I will consider the architecture validated for production use at scale. If it comes in below 80, I will have a specific set of failing cases to work from, which is still a better outcome than not knowing.

The 45-point gap between the reference baseline and my current score is defensible and attributable. I know what made it. The next test is whether it holds under a stricter fixture.


Part of the Knowledge Management series from KG Consultancy.

About the Author
KG
Teh Kim Guan
Product Consultant · General Manager, PEPS Ventures

Strategy and technology are the same decision. Over 15 years in fintech (CTOS, D&B), prop-tech (PropertyGuru DataSense), and digital startups, I have built frameworks that help founders and executives make both moves at once. Based in Kuala Lumpur.

More from the blog
AI Workflow
The Two-Tier Staff Agent Pattern: Strategic Ara, Operational Deeno
A single AI assistant tries to be two cognitive registers in one inbox. Splitting it into a strategic tier and an operational tier with named personalities and distinct cadences makes register-switching routable rather than negotiable.
2026-05-27 · 9 min read
AI Workflow
Description Disambiguation Is the Highest-Impact Router Intervention I Have Tried
Three skill-description rewrites in 25 minutes lifted my router top-1 accuracy by 14 points. The work is per-skill NOT-clauses, not model selection or prompt engineering. Here is the recipe.
2026-05-27 · 7 min read
AI Workflow
Telemetry-First Sub-Agent Dispatch Saved Me 18 Minutes on a Quota Cliff
Write a JSONL event before each sub-agent fires, and another after it completes. Six lines of Python turn a quota interrupt into a targeted retry of only what failed.
2026-05-27 · 6 min read
Work with KG

Working on a 0→1 product?

I help founders and operators go from idea to validated product. Let's talk about yours.

Get in touch →