Model Routing for AI Coding Tools: Bedrock vs Azure Foundry vs Vertex
Most enterprises already run AI coding tools through Bedrock, Azure AI Foundry, or Vertex. Here's what model routing means on each gateway — and how to cut cost without switching vendors.
If your company has standardized AI access, your developers' coding tools almost certainly run through one of three gateways: AWS Bedrock, Azure AI Foundry, or Google Vertex AI. That's the right call for governance — central billing, access control, and data residency.
It also means your AI coding spend is already concentrated in one place. The question isn't which gateway — you've chosen. It's how efficiently you use the models on it. That's what model routing addresses.
What model routing actually is
Model routing is simple to state: send each task to the cheapest model that can still do it well.
AI coding tools, left alone, route everything to the most capable (most expensive) model. But the work is wildly uneven:
- "Rename this variable across the file" — a small model handles it fine.
- "Refactor this service to be concurrency-safe" — you want the top model.
Routing makes that decision per task, automatically, against a quality bar — so the expensive model is reserved for work that benefits from it, and the rest runs cheaper. Crucially, it does this with the models already available on your gateway. No new vendor, no data leaving your environment.
On AWS Bedrock
Bedrock gives you a catalog of models behind one API, with provisioned and on-demand options and its own prompt-caching support. Routing on Bedrock means choosing among the Bedrock-hosted models per task and measuring savings against Bedrock's actual pricing — including the cache discount. The win: you keep Bedrock's governance and billing while stopping the "top model on everything" default.
On Azure AI Foundry
Foundry centers on model deployments and quota. Routing here means directing each task to the appropriate Foundry deployment rather than defaulting every request to your most capable one. Because quota and deployment cost vary, routing also helps you use provisioned capacity more evenly instead of saturating the expensive deployment.
On Google Vertex AI
Vertex offers its own model catalog, context caching, and pricing. Routing on Vertex follows the same pattern: match the task to the cheapest capable Vertex model, and measure against Vertex's billed rates. The routing logic is the same as on the other gateways — what changes is the catalog and the pricing it's measured against.
The common thread
The routing decision — "which model is cheapest-but-capable for this task?" — is gateway-agnostic. What's gateway-specific is:
- The model catalog available.
- The pricing each model is billed at.
- The caching behavior that affects real cost.
So a routing layer worth using has to be portable across Bedrock, Foundry, and Vertex, and it has to measure savings against your gateway's actual prices — not a generic token estimate. (For why token estimates mislead, see Why Token-Reduction Percentages Lie About AI Savings.)
Keep it self-hosted
The reason you adopted Bedrock/Foundry/Vertex in the first place was control. A routing layer should preserve that: run it self-hosted, inside your own infrastructure, in front of your gateway. Prompts and code stay in your environment; routing and measurement happen privately. Adding a routing layer shouldn't mean adding a new place your source code travels to.
Set the policy centrally
Finally, the controls belong with whoever owns the AI budget. Managers should set:
- Routing aggressiveness (quality-first → max-savings),
- Prompt-compression level,
- Per-team and per-repo policy, and
- a quality floor routing never crosses.
Developers keep their existing tools and workflow; the cost policy is set above them.
How Tokor does this
Tokor is KorPro's AI cost optimization product, built for exactly this setup. It's a self-hosted model router and measurement layer that runs in front of your team's AI coding usage on Bedrock, Azure Foundry, or Vertex — starting with Claude Code. It routes each task to the cheapest capable model, lets managers set routing and compression centrally, and proves the savings in net-of-cache dollars measured against your gateway's real pricing.
Tokor is in early access for teams running 20+ developers. Apply to the design partner program.
Related: How to Reduce AI Coding Costs Across a Developer Team.
Ready to Clean Up Your Clusters?
KorPro automatically detects unused resources, orphaned secrets, and wasted spend across all your Kubernetes clusters. Start optimizing in minutes.
Related Articles
How to Reduce AI Coding Costs Across a Developer Team
AI coding tools bill every task at premium-model rates. Here's how engineering teams can cut AI spend without slowing developers down — through model routing, prompt efficiency, and honest measurement.
Why Token-Reduction Percentages Lie About AI Savings
A '40% fewer tokens' headline doesn't mean 40% off your bill. Here's why prompt caching and retries break token-based savings claims — and how to measure real AI coding savings in dollars.
P95 + Headroom: How to Right-Size Kubernetes Without Throttling Workloads
Right-sizing on average utilization is how teams accidentally cause throttling and OOMKills. This is the P95-plus-headroom methodology — how to set requests and limits from real usage, the difference between the two, and the kubectl patches to apply it safely.
Written by
KorPro Team