Observability

Monitor local inference performance, fallback activations, and system health with built-in Prometheus metrics and structured logging.

What You Can Track

Igris Runtime provides comprehensive observability for offline AI:

Local Inference Performance: Token generation speed, latency, throughput
Fallback Activations: When and why cloud-to-local fallback triggered
Model Loading: Startup times, memory usage
Agent Execution: Reflection iterations, planning steps, swarm consensus
Tool Use: Tool execution times, success rates
MCP Swarm: Peer discovery, context sync status
QLoRA Training: Training progress, adapter generation

Metrics Endpoint

All metrics are exposed in Prometheus format at /metrics:

curl http://localhost:8080/metrics

This endpoint is compatible with:

Prometheus
Grafana
Datadog
Victoria Metrics
Any Prometheus-compatible monitoring system

Key Metrics Available

Inference Metrics

Track local model performance:

igris_inference_requests_total: Total inference requests
igris_inference_duration_seconds: Inference latency histogram (p50, p95, p99)
igris_tokens_generated_total: Total tokens generated
igris_tokens_per_second: Current token generation speed
igris_model_load_duration_seconds: Model loading time at startup

Example metrics output:

# HELP igris_inference_requests_total Total number of inference requests
# TYPE igris_inference_requests_total counter
igris_inference_requests_total{model="phi-3-mini",mode="standard"} 1234

# HELP igris_inference_duration_seconds Inference request duration
# TYPE igris_inference_duration_seconds histogram
igris_inference_duration_seconds_bucket{le="0.1"} 45
igris_inference_duration_seconds_bucket{le="0.5"} 234
igris_inference_duration_seconds_bucket{le="1.0"} 456

# HELP igris_tokens_per_second Current token generation speed
# TYPE igris_tokens_per_second gauge
igris_tokens_per_second 24.5

Fallback Metrics

Track cloud-to-local fallback behavior:

igris_fallback_activations_total: Number of times fallback triggered
igris_cloud_request_failures_total: Cloud provider failures by type
igris_fallback_duration_seconds: Time to switch from cloud to local

Why fallback activated:

timeout - Cloud provider took too long
unreachable - Network connectivity issue
error - Cloud API returned error
offline - Intentional offline-only mode

Agent Metrics

Track advanced agent execution:

igris_reflection_iterations_total: Total reflection loops executed
igris_reflection_quality_score: Quality scores before/after reflection
igris_planning_steps_total: Planning agent steps executed
igris_swarm_agents_active: Number of active agents in swarm
igris_swarm_consensus_votes: Consensus voting results

Tool Use Metrics

Monitor tool execution:

igris_tool_executions_total: Tool calls by type (http, shell, filesystem)
igris_tool_duration_seconds: Tool execution time
igris_tool_failures_total: Failed tool executions by error type

MCP Swarm Metrics

Track peer-to-peer context sharing:

igris_mcp_peers_discovered: Number of discovered peers
igris_mcp_contexts_synced_total: Contexts synchronized
igris_mcp_sync_duration_seconds: Context sync latency

QLoRA Training Metrics

Monitor on-device training:

igris_lora_training_started_total: Training sessions started
igris_lora_training_duration_seconds: Training time per adapter
igris_lora_adapter_size_bytes: Generated adapter size

Health Check Endpoint

Runtime provides a health check endpoint at /v1/health:

curl http://localhost:8080/v1/health

Response (healthy):

{
  "status": "healthy",
  "model_loaded": true,
  "uptime_seconds": 3600,
  "version": "1.6.0"
}

Response (unhealthy):

{
  "status": "unhealthy",
  "model_loaded": false,
  "error": "Model file not found"
}

Use this endpoint for:

Kubernetes liveness/readiness probes
Docker health checks
Load balancer health monitoring

Structured Logging

Runtime uses structured JSON logging via RUST_LOG:

export RUST_LOG=info
./igris-runtime

Log levels:

error - Only errors
warn - Warnings and errors
info - General info (recommended)
debug - Detailed debugging
trace - Very verbose (development only)

JSON Log Format

{
  "timestamp": "2025-12-18T10:30:45Z",
  "level": "info",
  "target": "igris_server",
  "message": "inference_completed",
  "model": "phi-3-mini",
  "mode": "reflection",
  "tokens_generated": 128,
  "duration_ms": 432,
  "fallback_activated": false
}

Log Examples

Successful local inference:

{
  "level": "info",
  "message": "local_inference_completed",
  "model": "phi-3-mini",
  "tokens": 64,
  "duration_ms": 234,
  "tokens_per_sec": 27.3
}

Fallback activation:

{
  "level": "warn",
  "message": "cloud_fallback_triggered",
  "provider": "openai",
  "reason": "timeout",
  "timeout_ms": 5000,
  "switched_to": "local"
}

Reflection agent:

{
  "level": "info",
  "message": "reflection_iteration_completed",
  "iteration": 2,
  "quality_score": 0.85,
  "threshold": 0.7,
  "accepted": true
}

MCP peer discovered:

{
  "level": "info",
  "message": "mcp_peer_discovered",
  "peer_id": "runtime-2",
  "peer_address": "192.168.1.45:8080"
}

Prometheus Integration

Scrape Configuration

Add to prometheus.yml:

scrape_configs:
  - job_name: 'igris-runtime'
    scrape_interval: 15s
    static_configs:
      - targets:
        - 'localhost:8080'
    metrics_path: '/metrics'

Grafana Dashboards

Key panels to create:

Inference Performance
- Token generation speed (gauge)
- Request latency histogram
- Requests per second
Fallback Monitoring
- Fallback activations over time
- Cloud vs local request ratio
- Fallback reasons breakdown
Resource Usage
- CPU usage
- Memory usage
- Model memory footprint
Agent Activity
- Reflection iterations
- Planning steps
- Swarm agent count

Example PromQL Queries

Tokens per second:

igris_tokens_per_second

P95 inference latency:

histogram_quantile(0.95, rate(igris_inference_duration_seconds_bucket[5m]))

Fallback rate:

rate(igris_fallback_activations_total[5m])

Reflection quality improvement:

avg(igris_reflection_quality_score{stage="after"} - igris_reflection_quality_score{stage="before"})

Monitoring MCP Swarm

Multi-Instance Dashboards

When running MCP swarm mode, monitor all instances:

scrape_configs:
  - job_name: 'igris-runtime-swarm'
    static_configs:
      - targets:
        - 'runtime-1:8080'
        - 'runtime-2:8080'
        - 'runtime-3:8080'

Track:

Total peers discovered across cluster
Context sync latency between instances
Load distribution (requests per instance)

MCP-Specific Metrics

# Total peers in swarm
sum(igris_mcp_peers_discovered)

# Context sync success rate
rate(igris_mcp_contexts_synced_total[5m])

# Average sync latency
avg(igris_mcp_sync_duration_seconds)

Log Aggregation

ELK Stack (Elasticsearch + Kibana)

Filebeat configuration:

filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/igris-runtime/*.log
    json.keys_under_root: true
    json.add_error_key: true

output.elasticsearch:
  hosts: ["localhost:9200"]

Loki + Grafana

Promtail configuration:

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: igris-runtime
    static_configs:
      - targets:
          - localhost
        labels:
          job: igris-runtime
          __path__: /var/log/igris-runtime/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            message: message

Alerting

Recommended Alerts

High inference latency:

- alert: HighInferenceLatency
  expr: histogram_quantile(0.95, rate(igris_inference_duration_seconds_bucket[5m])) > 2.0
  for: 5m
  annotations:
    summary: "Inference latency above 2 seconds"

Frequent fallback activations:

- alert: FrequentFallbacks
  expr: rate(igris_fallback_activations_total[5m]) > 0.5
  for: 10m
  annotations:
    summary: "Fallback activating more than 30 times per minute"

Model not loaded:

- alert: ModelNotLoaded
  expr: up{job="igris-runtime"} == 0
  for: 1m
  annotations:
    summary: "Runtime instance is down"

Low token throughput:

- alert: LowTokenThroughput
  expr: igris_tokens_per_second < 5
  for: 5m
  annotations:
    summary: "Token generation below 5 tokens/sec"

Performance Monitoring

Baseline Metrics

Track these metrics to understand normal behavior:

Phi-3 Mini Q4 (4 CPU cores):

Tokens/sec: 15-30
First token latency: 50-100ms
P95 request latency: <1000ms

Mistral 7B Q4 (8 CPU cores):

Tokens/sec: 10-20
First token latency: 100-200ms
P95 request latency: <2000ms

Optimization Targets

If metrics are outside expected ranges:

Low tokens/sec:

Increase threads in config
Enable GPU layers (n_gpu_layers)
Use smaller quantization (Q4 instead of Q8)

High latency:

Enable prompt caching (prompt_cache_dir)
Reduce context_size
Check CPU usage (should be near 100%)

High fallback rate:

Increase cloud timeout (first_token_timeout_ms)
Check network connectivity
Verify cloud API keys

Development vs Production

Development Logging

Verbose logging for debugging:

export RUST_LOG=debug
./igris-runtime

Logs include:

Request/response bodies
Detailed tool execution
Model inference steps

Production Logging

Minimal logging for performance:

export RUST_LOG=info
./igris-runtime

Logs include:

Request completion
Errors and warnings
Performance metrics

Troubleshooting with Metrics

High Memory Usage

Check:

process_resident_memory_bytes{job="igris-runtime"}

Common causes:

Large context_size
Multiple concurrent requests
Memory leak (restart needed)

Slow Inference

Check:

igris_inference_duration_seconds
igris_tokens_per_second

Common causes:

Too few threads
Model too large for hardware
High CPU usage from other processes

Fallback Not Working

Check logs for:

{
  "level": "error",
  "message": "local_model_not_loaded",
  "error": "Model file not found"
}

Verify:

Model file exists at model_path
Sufficient RAM available
File permissions correct

Next Steps

Configuration - Tune performance settings
Deployment - Production monitoring setup
Architecture - Understand system internals