Routing Policies & Strategies

TL;DR: Igris Overture automatically selects the best AI provider for each request using Thompson Sampling (a smart learning algorithm). Growth+ tiers unlock Speculative Execution (60% faster) and Council Mode (+15-20% better quality).


Overview

Igris Overture uses intelligent routing to automatically select the best provider for each request based on:

  • Speed - Which provider responds fastest
  • Cost - Which provider offers the best value
  • Quality - Which provider gives the best answers
  • Availability - Which providers are currently healthy

Available Routing Strategies

StrategyDescriptionBest ForTier
Thompson SamplingAI-powered learning algorithmProduction workloadsAll tiers
Speculative ExecutionRace multiple providers in parallelLatency-critical appsGrowth+
Council ModeMulti-provider consensusHigh-stakes decisionsGrowth+
Cost-AwareAutomatic cheapest provider selectionBudget-conscious appsAll tiers
Semantic RoutingAI classification by task typeMixed workloadsAll tiers

Thompson Sampling (Recommended)

What it does: Automatically learns which providers work best for your workload and routes traffic accordingly.

Thompson Sampling delivers 25% improved scoring accuracy through adaptive per-tenant reward weighting, 50% faster convergence for new models via informed priors, and 15% better decisions in the first 50 requests with intelligent sparse data handling. The system automatically adapts to your unique workload patterns, requiring zero configuration.

How It Works

Thompson Sampling continuously learns from every request:

  1. Smart Initialization: New models start with informed priors from similar providers (50% faster convergence)
  2. Per-Tenant Learning: Automatically adapts reward weights to your unique workload patterns
  3. Request Tracking: Records latency, cost, success rate, cache hits, and quality scores
  4. Adaptive Weighting: Uses gradient descent to optimize routing decisions for your specific needs
  5. Exploration: Maintains balanced exploration to discover improvements while exploiting known winners

Usage

Thompson Sampling is enabled by default. No configuration needed:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.igrisinertial.com/v1",
    api_key="sk-igris-YOUR_KEY"
)

# Automatic Thompson Sampling routing
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}]
)

Response Metadata

Every response includes routing information:

{
  "choices": [...],
  "metadata": {
    "provider": "openai/gpt-4",
    "routing_decision": "thompson-sampling",
    "latency_ms": 234,
    "cost_usd": 0.00525
  }
}
Advanced: How Thompson Sampling Scores Providers

v1.5.0 Adaptive Reward Weighting:

The system now learns optimal weights for YOUR specific workload using gradient descent:

Default Starting Weights:

score = (40% × latency) + (30% × success) + (15% × cost) + (10% × cache) + (5% × quality)

After 500+ requests, weights adapt per-tenant:

  • Latency-sensitive workloads: Automatically increases latency weight (e.g., 55%)
  • Cost-conscious workloads: Automatically increases cost weight (e.g., 35%)
  • Quality-focused workloads: Automatically increases quality weight (e.g., 25%)

Learning Algorithm:

w_new = w_old + α * (correlation - w_old)
  • Learning rate α = 0.05 (smooth convergence)
  • Updates every 500 requests
  • Confidence score = 1 / (1 + e^(-(n - 500)/100))

New Model Warm Start: When a new model is added (e.g., gpt-4.5-turbo), it inherits 50% of similar models' performance (e.g., gpt-4) instead of starting from scratch. This reduces learning time from ~100 requests to ~50 requests.

Sparse Data Handling: For models with <30 observations, applies UCB-style exploration bonus:

exploration_bonus = 2 × sqrt(ln(total_pulls) / pulls)

This prevents premature exploitation and improves decisions by 15% in the first 50 requests.

Providers with higher scores are selected more frequently, but the system maintains 10-15% exploration to discover improvements (exploration rate adapts per-tenant).


Speculative Execution (Growth+)

What it does: Races 2-3 providers in parallel and streams the fastest response. 60% faster time-to-first-token.

Benefits

  • 60% faster p50 TTFT: 450ms → 180ms
  • 62% faster p95 TTFT: 850ms → 320ms
  • 96% fewer failures: Automatic fallback if one provider fails
  • 20% lower cost: Smart cancellation reduces waste

Usage

Enable Speculative Execution in your request:

curl -X POST https://api.igrisinertial.com/v1/infer \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-igris-YOUR_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true,
    "speculative_mode": "latency"
  }'

Speculative Modes

ModeUse CaseSpeed PriorityQuality PriorityCost Priority
latencyReal-time chat70%20%10%
balancedGeneral purpose40%40%20%
qualityContent generation20%60%20%
costBatch processing20%20%60%

How It Works

  1. Request comes in
  2. Igris Overture races 2-3 providers simultaneously
  3. Stream tokens from whoever responds first
  4. If the fastest provider fails mid-stream, seamlessly switch to backup
  5. Cancel slower providers to minimize waste

Available in: Growth and Scale tiers


Council Mode (Growth+)

What it does: Sends requests to multiple providers, cross-validates responses, and synthesizes the best answer. +15-20% better quality.

Benefits

  • Better reasoning: Multiple AI perspectives on complex problems
  • Hallucination detection: Cross-validation catches incorrect facts
  • Higher confidence: Consensus from multiple providers
  • Quality improvement: +15-20% on reasoning tasks, +12% on factual questions

Usage

curl -X POST https://api.igrisinertial.com/v1/infer \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-igris-YOUR_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Complex reasoning task"}],
    "council_mode": true,
    "council_providers": ["openai/gpt-4", "anthropic/claude-3-opus"],
    "consensus_threshold": 0.7
  }'

When to Use

  • High-stakes decisions requiring validation
  • Complex reasoning or analysis tasks
  • Quality-critical applications
  • Hallucination-sensitive use cases

Trade-offs

  • Cost: 3-5x more expensive (multiple provider calls)
  • Latency: 2-3x slower (parallel + synthesis)
  • Best for: Quality over speed/cost

Available in: Growth and Scale tiers


Semantic Routing

What it does: Uses AI to classify your prompt and route to the best provider for that task type.

How It Works

Igris Overture analyzes each prompt and classifies it into categories:

  • Creative (stories, poetry) → Routes to Claude
  • Analytical (data, math) → Routes to GPT-4
  • Coding → Routes to DeepSeek Coder or GPT-4
  • Conversational (chat) → Routes to GPT-3.5 (cost-effective)
  • Translation → Routes to specialized models

Usage

Enable semantic routing:

curl -X POST https://api.igrisinertial.com/v1/infer \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-igris-YOUR_KEY" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Write a creative story"}],
    "enable_semantic_routing": true
  }'

Benefits

  • Better quality: Task-specific provider selection
  • Lower cost: Routes simple tasks to cheaper models
  • Automatic optimization: No manual provider selection needed

Cost-Aware Routing

What it does: Automatically selects the cheapest provider that meets your quality requirements.

Usage

curl -X POST https://api.igrisinertial.com/v1/infer \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-igris-YOUR_KEY" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Simple Q&A"}],
    "optimization": "cost",
    "min_quality_score": 0.7
  }'

Cost Comparison

Typical costs per 1,000 tokens:

  • GPT-3.5 Turbo: $0.0015 (cheapest)
  • Claude Haiku: $0.0025
  • GPT-4 Turbo: $0.01
  • Claude Opus: $0.015
  • GPT-4: $0.03 (most expensive)

When to Use

  • Batch processing jobs
  • Simple Q&A applications
  • Budget-constrained workloads
  • High-volume traffic

Automatic Failover

All routing strategies include automatic failover:

  1. Health Monitoring: Continuous provider health checks
  2. Circuit Breakers: Automatic disabling of unhealthy providers
  3. Graceful Fallback: Seamless switching to backup providers
  4. No Dropped Requests: Zero failed requests during provider outages

What You Get

  • 99.9% uptime even when individual providers fail
  • Sub-30-second failover to healthy providers
  • Automatic recovery when providers come back online
  • Complete transparency via response metadata

Routing Preview API

Preview which provider would be selected without actually making the request:

POST https://api.igrisinertial.com/v1/routing/preview

{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Hello"}],
  "routing_policy": "thompson-sampling"
}

Response:

{
  "selected_provider": "openai/gpt-4",
  "reason": "Highest Thompson Sampling score (0.87)",
  "confidence": 0.87,
  "alternatives": [
    {
      "provider": "anthropic/claude-3-sonnet",
      "score": 0.78,
      "estimated_cost_usd": 0.012,
      "estimated_latency_ms": 250
    }
  ],
  "estimated_cost_usd": 0.015,
  "estimated_latency_ms": 200
}

Best Practices

When to Use Each Strategy

  1. Thompson Sampling: Default for all production workloads
  2. Speculative Execution: Real-time chat, interactive apps (Growth+)
  3. Council Mode: High-stakes decisions, content quality (Growth+)
  4. Semantic Routing: Mixed workloads (chat + code + analysis)
  5. Cost-Aware: Batch jobs, budget-conscious applications

Performance Tips

  • Enable Speculative Execution for latency-critical apps (Growth+)
  • Use Council Mode sparingly due to higher cost
  • Let Thompson Sampling run for 100+ requests to learn optimal routing
  • Monitor cost per request in your dashboard

Cost Optimization

  • Start with Cost-Aware Routing for predictable workloads
  • Use Speculative Execution in cost mode for 20% savings
  • Route simple queries to GPT-3.5 via Semantic Routing
  • Set budget alerts in your dashboard

FAQ

Which routing strategy should I use?

Thompson Sampling is recommended for 95% of use cases. It automatically learns and adapts to your workload.

Can I manually select a provider?

Yes, specify the provider explicitly:

response = client.chat.completions.create(
    model="gpt-4",
    provider="anthropic",  # Force Anthropic
    messages=[{"role": "user", "content": "Hello"}]
)

How long does Thompson Sampling take to learn?

v1.5.0 Enhanced Learning (50% faster for new models):

  • 50 requests: New models with warm start reach baseline performance (was 100)
  • 100 requests: Basic per-tenant preferences established
  • 500 requests: Adaptive weights converge, stable routing patterns
  • 1000+ requests: Near-optimal provider selection with 90%+ confidence

New Model Advantage: When adding similar models (e.g., gpt-4gpt-4.5), warm start reduces learning time from ~100 requests to ~50 requests by inheriting historical performance data.

Is Speculative Execution always faster?

Typically 60% faster for p50 latency, but increases cost by 1.3-2x. Best for latency-critical applications on Growth+ tiers.


Next Steps