Routing Policies & Strategies
TL;DR: Igris Overture automatically selects the best AI provider for each request using Thompson Sampling (a smart learning algorithm). Growth+ tiers unlock Speculative Execution (60% faster) and Council Mode (+15-20% better quality).
Overview
Igris Overture uses intelligent routing to automatically select the best provider for each request based on:
- Speed - Which provider responds fastest
- Cost - Which provider offers the best value
- Quality - Which provider gives the best answers
- Availability - Which providers are currently healthy
Available Routing Strategies
| Strategy | Description | Best For | Tier |
|---|---|---|---|
| Thompson Sampling | AI-powered learning algorithm | Production workloads | All tiers |
| Speculative Execution | Race multiple providers in parallel | Latency-critical apps | Growth+ |
| Council Mode | Multi-provider consensus | High-stakes decisions | Growth+ |
| Cost-Aware | Automatic cheapest provider selection | Budget-conscious apps | All tiers |
| Semantic Routing | AI classification by task type | Mixed workloads | All tiers |
Thompson Sampling (Recommended)
What it does: Automatically learns which providers work best for your workload and routes traffic accordingly.
Thompson Sampling delivers 25% improved scoring accuracy through adaptive per-tenant reward weighting, 50% faster convergence for new models via informed priors, and 15% better decisions in the first 50 requests with intelligent sparse data handling. The system automatically adapts to your unique workload patterns, requiring zero configuration.
How It Works
Thompson Sampling continuously learns from every request:
- Smart Initialization: New models start with informed priors from similar providers (50% faster convergence)
- Per-Tenant Learning: Automatically adapts reward weights to your unique workload patterns
- Request Tracking: Records latency, cost, success rate, cache hits, and quality scores
- Adaptive Weighting: Uses gradient descent to optimize routing decisions for your specific needs
- Exploration: Maintains balanced exploration to discover improvements while exploiting known winners
Usage
Thompson Sampling is enabled by default. No configuration needed:
from openai import OpenAI
client = OpenAI(
base_url="https://api.igrisinertial.com/v1",
api_key="sk-igris-YOUR_KEY"
)
# Automatic Thompson Sampling routing
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
Response Metadata
Every response includes routing information:
{
"choices": [...],
"metadata": {
"provider": "openai/gpt-4",
"routing_decision": "thompson-sampling",
"latency_ms": 234,
"cost_usd": 0.00525
}
}
Advanced: How Thompson Sampling Scores Providers
v1.5.0 Adaptive Reward Weighting:
The system now learns optimal weights for YOUR specific workload using gradient descent:
Default Starting Weights:
score = (40% × latency) + (30% × success) + (15% × cost) + (10% × cache) + (5% × quality)
After 500+ requests, weights adapt per-tenant:
- Latency-sensitive workloads: Automatically increases latency weight (e.g., 55%)
- Cost-conscious workloads: Automatically increases cost weight (e.g., 35%)
- Quality-focused workloads: Automatically increases quality weight (e.g., 25%)
Learning Algorithm:
w_new = w_old + α * (correlation - w_old)
- Learning rate α = 0.05 (smooth convergence)
- Updates every 500 requests
- Confidence score = 1 / (1 + e^(-(n - 500)/100))
New Model Warm Start:
When a new model is added (e.g., gpt-4.5-turbo), it inherits 50% of similar models' performance (e.g., gpt-4) instead of starting from scratch. This reduces learning time from ~100 requests to ~50 requests.
Sparse Data Handling: For models with <30 observations, applies UCB-style exploration bonus:
exploration_bonus = 2 × sqrt(ln(total_pulls) / pulls)
This prevents premature exploitation and improves decisions by 15% in the first 50 requests.
Providers with higher scores are selected more frequently, but the system maintains 10-15% exploration to discover improvements (exploration rate adapts per-tenant).
Speculative Execution (Growth+)
What it does: Races 2-3 providers in parallel and streams the fastest response. 60% faster time-to-first-token.
Benefits
- 60% faster p50 TTFT: 450ms → 180ms
- 62% faster p95 TTFT: 850ms → 320ms
- 96% fewer failures: Automatic fallback if one provider fails
- 20% lower cost: Smart cancellation reduces waste
Usage
Enable Speculative Execution in your request:
curl -X POST https://api.igrisinertial.com/v1/infer \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-igris-YOUR_KEY" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true,
"speculative_mode": "latency"
}'
Speculative Modes
| Mode | Use Case | Speed Priority | Quality Priority | Cost Priority |
|---|---|---|---|---|
latency | Real-time chat | 70% | 20% | 10% |
balanced | General purpose | 40% | 40% | 20% |
quality | Content generation | 20% | 60% | 20% |
cost | Batch processing | 20% | 20% | 60% |
How It Works
- Request comes in
- Igris Overture races 2-3 providers simultaneously
- Stream tokens from whoever responds first
- If the fastest provider fails mid-stream, seamlessly switch to backup
- Cancel slower providers to minimize waste
Available in: Growth and Scale tiers
Council Mode (Growth+)
What it does: Sends requests to multiple providers, cross-validates responses, and synthesizes the best answer. +15-20% better quality.
Benefits
- Better reasoning: Multiple AI perspectives on complex problems
- Hallucination detection: Cross-validation catches incorrect facts
- Higher confidence: Consensus from multiple providers
- Quality improvement: +15-20% on reasoning tasks, +12% on factual questions
Usage
curl -X POST https://api.igrisinertial.com/v1/infer \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-igris-YOUR_KEY" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Complex reasoning task"}],
"council_mode": true,
"council_providers": ["openai/gpt-4", "anthropic/claude-3-opus"],
"consensus_threshold": 0.7
}'
When to Use
- High-stakes decisions requiring validation
- Complex reasoning or analysis tasks
- Quality-critical applications
- Hallucination-sensitive use cases
Trade-offs
- Cost: 3-5x more expensive (multiple provider calls)
- Latency: 2-3x slower (parallel + synthesis)
- Best for: Quality over speed/cost
Available in: Growth and Scale tiers
Semantic Routing
What it does: Uses AI to classify your prompt and route to the best provider for that task type.
How It Works
Igris Overture analyzes each prompt and classifies it into categories:
- Creative (stories, poetry) → Routes to Claude
- Analytical (data, math) → Routes to GPT-4
- Coding → Routes to DeepSeek Coder or GPT-4
- Conversational (chat) → Routes to GPT-3.5 (cost-effective)
- Translation → Routes to specialized models
Usage
Enable semantic routing:
curl -X POST https://api.igrisinertial.com/v1/infer \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-igris-YOUR_KEY" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Write a creative story"}],
"enable_semantic_routing": true
}'
Benefits
- Better quality: Task-specific provider selection
- Lower cost: Routes simple tasks to cheaper models
- Automatic optimization: No manual provider selection needed
Cost-Aware Routing
What it does: Automatically selects the cheapest provider that meets your quality requirements.
Usage
curl -X POST https://api.igrisinertial.com/v1/infer \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-igris-YOUR_KEY" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Simple Q&A"}],
"optimization": "cost",
"min_quality_score": 0.7
}'
Cost Comparison
Typical costs per 1,000 tokens:
- GPT-3.5 Turbo: $0.0015 (cheapest)
- Claude Haiku: $0.0025
- GPT-4 Turbo: $0.01
- Claude Opus: $0.015
- GPT-4: $0.03 (most expensive)
When to Use
- Batch processing jobs
- Simple Q&A applications
- Budget-constrained workloads
- High-volume traffic
Automatic Failover
All routing strategies include automatic failover:
- Health Monitoring: Continuous provider health checks
- Circuit Breakers: Automatic disabling of unhealthy providers
- Graceful Fallback: Seamless switching to backup providers
- No Dropped Requests: Zero failed requests during provider outages
What You Get
- 99.9% uptime even when individual providers fail
- Sub-30-second failover to healthy providers
- Automatic recovery when providers come back online
- Complete transparency via response metadata
Routing Preview API
Preview which provider would be selected without actually making the request:
POST https://api.igrisinertial.com/v1/routing/preview
{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello"}],
"routing_policy": "thompson-sampling"
}
Response:
{
"selected_provider": "openai/gpt-4",
"reason": "Highest Thompson Sampling score (0.87)",
"confidence": 0.87,
"alternatives": [
{
"provider": "anthropic/claude-3-sonnet",
"score": 0.78,
"estimated_cost_usd": 0.012,
"estimated_latency_ms": 250
}
],
"estimated_cost_usd": 0.015,
"estimated_latency_ms": 200
}
Best Practices
When to Use Each Strategy
- Thompson Sampling: Default for all production workloads
- Speculative Execution: Real-time chat, interactive apps (Growth+)
- Council Mode: High-stakes decisions, content quality (Growth+)
- Semantic Routing: Mixed workloads (chat + code + analysis)
- Cost-Aware: Batch jobs, budget-conscious applications
Performance Tips
- Enable Speculative Execution for latency-critical apps (Growth+)
- Use Council Mode sparingly due to higher cost
- Let Thompson Sampling run for 100+ requests to learn optimal routing
- Monitor cost per request in your dashboard
Cost Optimization
- Start with Cost-Aware Routing for predictable workloads
- Use Speculative Execution in
costmode for 20% savings - Route simple queries to GPT-3.5 via Semantic Routing
- Set budget alerts in your dashboard
FAQ
Which routing strategy should I use?
Thompson Sampling is recommended for 95% of use cases. It automatically learns and adapts to your workload.
Can I manually select a provider?
Yes, specify the provider explicitly:
response = client.chat.completions.create(
model="gpt-4",
provider="anthropic", # Force Anthropic
messages=[{"role": "user", "content": "Hello"}]
)
How long does Thompson Sampling take to learn?
v1.5.0 Enhanced Learning (50% faster for new models):
- 50 requests: New models with warm start reach baseline performance (was 100)
- 100 requests: Basic per-tenant preferences established
- 500 requests: Adaptive weights converge, stable routing patterns
- 1000+ requests: Near-optimal provider selection with 90%+ confidence
New Model Advantage: When adding similar models (e.g., gpt-4 → gpt-4.5), warm start reduces learning time from ~100 requests to ~50 requests by inheriting historical performance data.
Is Speculative Execution always faster?
Typically 60% faster for p50 latency, but increases cost by 1.3-2x. Best for latency-critical applications on Growth+ tiers.
Next Steps
- Provider Keys - Connect your provider API keys
- SDK Usage - Integration guides
- Pricing - Feature availability by tier