Observability

TL;DR: Complete visibility into every request with real-time metrics, distributed tracing, and cost tracking. Export to your existing monitoring tools.


What You Can Track

Igris Overture provides comprehensive observability:

  • Request Metrics: Success rate, latency, throughput
  • Cost Tracking: Per-request, per-provider, per-tenant costs
  • Provider Performance: Which providers are fastest, cheapest, most reliable
  • Routing Decisions: See why each request went to a specific provider
  • Distributed Traces: Follow requests across the entire system
  • Error Tracking: Detailed error rates and types

Metrics Endpoint

All metrics are exposed in Prometheus format at /metrics:

curl https://api.igrisinertial.com/metrics

This endpoint is compatible with:

  • Prometheus
  • Datadog
  • Grafana Cloud
  • New Relic
  • Any Prometheus-compatible monitoring system

Key Metrics Available

Request Metrics

Track overall request health:

  • Total requests: Count of all requests
  • Success rate: Percentage of successful requests
  • Request duration: Latency histograms (p50, p95, p99)
  • Requests per second: Current throughput

Inference Metrics

LLM-specific metrics:

  • Requests by provider: OpenAI, Anthropic, Google, etc.
  • Requests by model: GPT-4, Claude 3, Gemini, etc.
  • Token usage: Prompt tokens, completion tokens
  • Cost per request: Real-time cost in USD

Routing Metrics

Understand routing decisions:

  • Thompson Sampling scores: Which provider is winning
  • Semantic routing classifications: Creative, analytical, coding, etc.
  • Speculative execution: Race winners and latency improvements
  • Circuit breaker status: Which providers are healthy/unhealthy

Distributed Tracing

Every request gets a unique trace ID for end-to-end visibility:

{
  "id": "chatcmpl-abc123",
  "metadata": {
    "trace_id": "550e8400-e29b-41d4-a716-446655440000",
    "provider": "anthropic",
    "latency_ms": 187
  }
}

Trace Spans

Each request creates spans for:

  1. HTTP Request (parent span)
  2. Authentication (validating API key)
  3. Rate Limiting (checking tenant limits)
  4. Routing Decision (Thompson Sampling or semantic routing)
  5. Provider Request (actual LLM API call)
  6. Cost Tracking (recording usage and cost)

Viewing Traces

Use trace IDs to search in your tracing system:

  • Jaeger
  • Zipkin
  • Honeycomb
  • Datadog APM
  • New Relic

Example trace timeline:

HTTP Request (total: 234ms)
├─ Auth (2ms)
├─ Rate Limit (1ms)
├─ Routing Decision (5ms)
│  └─ Thompson Sampling (4ms)
├─ Provider Request (187ms)
│  └─ Anthropic API (185ms)
└─ Cost Tracking (3ms)

Cost Tracking

Per-Request Cost

Every response includes cost breakdown:

{
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 42,
    "total_tokens": 57
  },
  "metadata": {
    "cost_usd": 0.00171,
    "provider": "openai",
    "model": "gpt-4",
    "cost_breakdown": {
      "prompt_cost": 0.00045,
      "completion_cost": 0.00126
    }
  }
}

Aggregate Cost Metrics

Track spending over time:

# Get cost metrics by provider
curl https://api.igrisinertial.com/v1/metrics/cost?group_by=provider

# Get cost metrics by tenant
curl https://api.igrisinertial.com/v1/metrics/cost?group_by=tenant

# Get cost metrics by model
curl https://api.igrisinertial.com/v1/metrics/cost?group_by=model

Response:

{
  "period": "today",
  "total_cost_usd": 456.78,
  "by_provider": {
    "openai": 234.56,
    "anthropic": 222.22
  },
  "by_model": {
    "gpt-4": 178.90,
    "gpt-3.5-turbo": 55.66,
    "claude-3-sonnet": 222.22
  }
}

Dashboards

Cloud Hosted Dashboards

If you're using cloud hosted Igris Overture, dashboards are built-in:

  1. Overview Dashboard

    • Requests per second
    • Success rate
    • Average latency
    • Total cost today
  2. Provider Performance

    • Latency by provider
    • Success rate by provider
    • Cost efficiency comparison
  3. Cost Analytics

    • Spend over time
    • Top spending tenants
    • Cost per provider/model
    • Budget alerts
  4. Routing Insights

    • Thompson Sampling scores
    • Provider selection distribution
    • Speculative execution wins

View Dashboard →

Self-Hosted Monitoring

For self-hosted deployments, export metrics to your existing stack:

Prometheus + Grafana:

# prometheus.yml
scrape_configs:
  - job_name: 'igris-overture'
    static_configs:
      - targets: ['api.igris.internal:8081']
    metrics_path: '/metrics'
    scrape_interval: 15s

Datadog:

# datadog.yaml
instances:
  - prometheus_url: http://api.igris.internal:8081/metrics
    namespace: igris
    metrics:
      - '*'

Grafana Cloud:

Use the Prometheus remote write endpoint to send metrics directly to Grafana Cloud.


Alerts

Set up alerts for critical events:

Budget Alerts

Get notified when spending approaches limits:

POST /v1/tenants/{tenant_id}/budget
{
  "monthly_budget_usd": 5000.00,
  "alert_threshold": 0.90,
  "notification_channels": ["email", "webhook"]
}

Performance Alerts

Monitor latency and error rates:

  • High Latency: Alert when p95 latency > 2000ms
  • High Error Rate: Alert when error rate > 5%
  • Provider Failure: Alert when circuit breaker opens
  • Rate Limit: Alert when approaching rate limits

Custom Webhooks

Send alerts to your own systems:

{
  "event": "high_latency_alert",
  "tenant_id": "tenant_abc123",
  "metric": "p95_latency_ms",
  "current_value": 2345,
  "threshold": 2000,
  "provider": "openai",
  "timestamp": "2025-11-30T12:00:00Z"
}

Log Integration

Structured Logging

All logs are JSON-formatted for easy parsing:

{
  "timestamp": "2025-11-30T12:00:00Z",
  "level": "info",
  "message": "inference_request_completed",
  "trace_id": "550e8400-e29b-41d4-a716-446655440000",
  "tenant_id": "tenant_abc123",
  "provider": "anthropic",
  "model": "claude-3-sonnet",
  "latency_ms": 187,
  "cost_usd": 0.00034,
  "status": "success"
}

Log Aggregation

Compatible with standard log aggregation tools:

  • ELK Stack (Elasticsearch, Logstash, Kibana)
  • Loki + Grafana
  • Splunk
  • Datadog Logs
  • CloudWatch Logs

Example Queries

Prometheus Queries

Requests per second:

rate(http_requests_total[5m])

Success rate:

sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

P95 latency:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Cost per hour:

sum(rate(cost_usd_total[1h]))

Best Practices

Monitoring

  1. Set up dashboards early - Don't wait until you have issues
  2. Monitor all three: latency, error rate, throughput
  3. Track costs daily - Catch unexpected spending quickly
  4. Use trace IDs to debug specific failed requests

Alerting

  1. Start with budget alerts - Most critical for cost control
  2. Alert on trends not just thresholds (e.g., latency increasing)
  3. Use different channels for different severity (email vs. PagerDuty)
  4. Test your alerts before going to production

Optimization

  1. Review provider performance weekly - Thompson Sampling adapts, but review manually
  2. Check for cost anomalies - Unusual spikes might indicate issues
  3. Monitor circuit breaker state - Frequent opens = provider reliability issues
  4. Track speculative execution waste - Should be <30%

API Reference

GET /metrics

Prometheus-formatted metrics endpoint.

Response format: Prometheus exposition format

Access: Requires admin token or metrics-specific token

GET /v1/metrics/cost

Aggregate cost metrics.

Query parameters:

  • period: today, week, month
  • group_by: provider, model, tenant
  • tenant_id: Filter by tenant (admin only)

Response:

{
  "period": "today",
  "total_cost_usd": 456.78,
  "breakdown": {...}
}
GET /v1/metrics/latency

Latency percentiles.

Query parameters:

  • period: 5m, 1h, 1d
  • provider: Filter by provider
  • model: Filter by model

Response:

{
  "period": "1h",
  "p50_ms": 234,
  "p95_ms": 456,
  "p99_ms": 678
}

Next Steps