Routing Policies & Strategies

TL;DR: Igris Overture automatically selects the best AI provider for each request using Thompson Sampling (a smart learning algorithm). Growth+ tiers unlock Speculative Execution (60% faster) and Council Mode (+15-20% better quality).

Overview

Igris Overture uses intelligent routing to automatically select the best provider for each request based on:

Speed - Which provider responds fastest
Cost - Which provider offers the best value
Quality - Which provider gives the best answers
Availability - Which providers are currently healthy

Available Routing Strategies

Strategy	Description	Best For	Tier
Thompson Sampling	AI-powered learning algorithm	Production workloads	All tiers
Speculative Execution	Race multiple providers in parallel	Latency-critical apps	Growth+
Council Mode	Multi-provider consensus	High-stakes decisions	Growth+
Cost-Aware	Automatic cheapest provider selection	Budget-conscious apps	All tiers
Semantic Routing	AI classification by task type	Mixed workloads	All tiers

Thompson Sampling (Recommended)

What it does: Automatically learns which providers work best for your workload and routes traffic accordingly.

Thompson Sampling delivers 25% improved scoring accuracy through adaptive per-tenant reward weighting, 50% faster convergence for new models via informed priors, and 15% better decisions in the first 50 requests with intelligent sparse data handling. The system automatically adapts to your unique workload patterns, requiring zero configuration.

How It Works

Thompson Sampling continuously learns from every request:

Smart Initialization: New models start with informed priors from similar providers (50% faster convergence)
Per-Tenant Learning: Automatically adapts reward weights to your unique workload patterns
Request Tracking: Records latency, cost, success rate, cache hits, and quality scores
Adaptive Weighting: Uses gradient descent to optimize routing decisions for your specific needs
Exploration: Maintains balanced exploration to discover improvements while exploiting known winners

Usage

Thompson Sampling is enabled by default. No configuration needed:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.igrisinertial.com/v1",
    api_key="sk-igris-YOUR_KEY"
)

# Automatic Thompson Sampling routing
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}]
)

Response Metadata

Every response includes routing information:

{
  "choices": [...],
  "metadata": {
    "provider": "openai/gpt-4",
    "routing_decision": "thompson-sampling",
    "latency_ms": 234,
    "cost_usd": 0.00525
  }
}

Advanced: How Thompson Sampling Scores Providers

v1.5.0 Adaptive Reward Weighting:

The system now learns optimal weights for YOUR specific workload using gradient descent:

Default Starting Weights:

score = (40% × latency) + (30% × success) + (15% × cost) + (10% × cache) + (5% × quality)

After 500+ requests, weights adapt per-tenant:

Latency-sensitive workloads: Automatically increases latency weight (e.g., 55%)
Cost-conscious workloads: Automatically increases cost weight (e.g., 35%)
Quality-focused workloads: Automatically increases quality weight (e.g., 25%)

Learning Algorithm:

w_new = w_old + α * (correlation - w_old)

Learning rate α = 0.05 (smooth convergence)
Updates every 500 requests
Confidence score = 1 / (1 + e^(-(n - 500)/100))

New Model Warm Start: When a new model is added (e.g., gpt-4.5-turbo), it inherits 50% of similar models' performance (e.g., gpt-4) instead of starting from scratch. This reduces learning time from ~100 requests to ~50 requests.

Sparse Data Handling: For models with <30 observations, applies UCB-style exploration bonus:

exploration_bonus = 2 × sqrt(ln(total_pulls) / pulls)

This prevents premature exploitation and improves decisions by 15% in the first 50 requests.

Providers with higher scores are selected more frequently, but the system maintains 10-15% exploration to discover improvements (exploration rate adapts per-tenant).

Speculative Execution (Growth+)

What it does: Races 2-3 providers in parallel and streams the fastest response. 60% faster time-to-first-token.

Benefits

60% faster p50 TTFT: 450ms → 180ms
62% faster p95 TTFT: 850ms → 320ms
96% fewer failures: Automatic fallback if one provider fails
20% lower cost: Smart cancellation reduces waste

Usage

Enable Speculative Execution in your request:

curl -X POST https://api.igrisinertial.com/v1/infer \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-igris-YOUR_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true,
    "speculative_mode": "latency"
  }'

Speculative Modes

Mode	Use Case	Speed Priority	Quality Priority	Cost Priority
`latency`	Real-time chat	70%	20%	10%
`balanced`	General purpose	40%	40%	20%
`quality`	Content generation	20%	60%	20%
`cost`	Batch processing	20%	20%	60%

How It Works

Request comes in
Igris Overture races 2-3 providers simultaneously
Stream tokens from whoever responds first
If the fastest provider fails mid-stream, seamlessly switch to backup
Cancel slower providers to minimize waste

Available in: Growth and Scale tiers

Council Mode (Growth+)

What it does: Sends requests to multiple providers, cross-validates responses, and synthesizes the best answer. +15-20% better quality.

Benefits

Better reasoning: Multiple AI perspectives on complex problems
Hallucination detection: Cross-validation catches incorrect facts
Higher confidence: Consensus from multiple providers
Quality improvement: +15-20% on reasoning tasks, +12% on factual questions

Usage

curl -X POST https://api.igrisinertial.com/v1/infer \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-igris-YOUR_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Complex reasoning task"}],
    "council_mode": true,
    "council_providers": ["openai/gpt-4", "anthropic/claude-3-opus"],
    "consensus_threshold": 0.7
  }'

When to Use

High-stakes decisions requiring validation
Complex reasoning or analysis tasks
Quality-critical applications
Hallucination-sensitive use cases

Trade-offs

Cost: 3-5x more expensive (multiple provider calls)
Latency: 2-3x slower (parallel + synthesis)
Best for: Quality over speed/cost

Available in: Growth and Scale tiers

Semantic Routing

What it does: Uses AI to classify your prompt and route to the best provider for that task type.

How It Works

Igris Overture analyzes each prompt and classifies it into categories:

Creative (stories, poetry) → Routes to Claude
Analytical (data, math) → Routes to GPT-4
Coding → Routes to DeepSeek Coder or GPT-4
Conversational (chat) → Routes to GPT-3.5 (cost-effective)
Translation → Routes to specialized models

Usage

Enable semantic routing:

curl -X POST https://api.igrisinertial.com/v1/infer \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-igris-YOUR_KEY" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Write a creative story"}],
    "enable_semantic_routing": true
  }'

Benefits

Better quality: Task-specific provider selection
Lower cost: Routes simple tasks to cheaper models
Automatic optimization: No manual provider selection needed

Cost-Aware Routing

What it does: Automatically selects the cheapest provider that meets your quality requirements.

Usage

curl -X POST https://api.igrisinertial.com/v1/infer \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-igris-YOUR_KEY" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Simple Q&A"}],
    "optimization": "cost",
    "min_quality_score": 0.7
  }'

Cost Comparison

Typical costs per 1,000 tokens:

GPT-3.5 Turbo: $0.0015 (cheapest)
Claude Haiku: $0.0025
GPT-4 Turbo: $0.01
Claude Opus: $0.015
GPT-4: $0.03 (most expensive)

When to Use

Batch processing jobs
Simple Q&A applications
Budget-constrained workloads
High-volume traffic

Automatic Failover

All routing strategies include automatic failover:

Health Monitoring: Continuous provider health checks
Circuit Breakers: Automatic disabling of unhealthy providers
Graceful Fallback: Seamless switching to backup providers
No Dropped Requests: Zero failed requests during provider outages

What You Get

99.9% uptime even when individual providers fail
Sub-30-second failover to healthy providers
Automatic recovery when providers come back online
Complete transparency via response metadata

Routing Preview API

Preview which provider would be selected without actually making the request:

POST https://api.igrisinertial.com/v1/routing/preview

{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Hello"}],
  "routing_policy": "thompson-sampling"
}

Response:

{
  "selected_provider": "openai/gpt-4",
  "reason": "Highest Thompson Sampling score (0.87)",
  "confidence": 0.87,
  "alternatives": [
    {
      "provider": "anthropic/claude-3-sonnet",
      "score": 0.78,
      "estimated_cost_usd": 0.012,
      "estimated_latency_ms": 250
    }
  ],
  "estimated_cost_usd": 0.015,
  "estimated_latency_ms": 200
}

Best Practices

When to Use Each Strategy

Thompson Sampling: Default for all production workloads
Speculative Execution: Real-time chat, interactive apps (Growth+)
Council Mode: High-stakes decisions, content quality (Growth+)
Semantic Routing: Mixed workloads (chat + code + analysis)
Cost-Aware: Batch jobs, budget-conscious applications

Performance Tips

Enable Speculative Execution for latency-critical apps (Growth+)
Use Council Mode sparingly due to higher cost
Let Thompson Sampling run for 100+ requests to learn optimal routing
Monitor cost per request in your dashboard

Cost Optimization

Start with Cost-Aware Routing for predictable workloads
Use Speculative Execution in cost mode for 20% savings
Route simple queries to GPT-3.5 via Semantic Routing
Set budget alerts in your dashboard

FAQ

Which routing strategy should I use?

Thompson Sampling is recommended for 95% of use cases. It automatically learns and adapts to your workload.

Can I manually select a provider?

Yes, specify the provider explicitly:

response = client.chat.completions.create(
    model="gpt-4",
    provider="anthropic",  # Force Anthropic
    messages=[{"role": "user", "content": "Hello"}]
)

How long does Thompson Sampling take to learn?

v1.5.0 Enhanced Learning (50% faster for new models):

50 requests: New models with warm start reach baseline performance (was 100)
100 requests: Basic per-tenant preferences established
500 requests: Adaptive weights converge, stable routing patterns
1000+ requests: Near-optimal provider selection with 90%+ confidence

New Model Advantage: When adding similar models (e.g., gpt-4 → gpt-4.5), warm start reduces learning time from ~100 requests to ~50 requests by inheriting historical performance data.

Is Speculative Execution always faster?

Typically 60% faster for p50 latency, but increases cost by 1.3-2x. Best for latency-critical applications on Growth+ tiers.

Next Steps

Provider Keys - Connect your provider API keys
SDK Usage - Integration guides
Pricing - Feature availability by tier