Observability
Monitor local inference performance, fallback activations, and system health with built-in Prometheus metrics and structured logging.
What You Can Track
Igris Runtime provides comprehensive observability for offline AI:
- Local Inference Performance: Token generation speed, latency, throughput
- Fallback Activations: When and why cloud-to-local fallback triggered
- Model Loading: Startup times, memory usage
- Agent Execution: Reflection iterations, planning steps, swarm consensus
- Tool Use: Tool execution times, success rates
- MCP Swarm: Peer discovery, context sync status
- QLoRA Training: Training progress, adapter generation
Metrics Endpoint
All metrics are exposed in Prometheus format at /metrics:
curl http://localhost:8080/metrics
This endpoint is compatible with:
- Prometheus
- Grafana
- Datadog
- Victoria Metrics
- Any Prometheus-compatible monitoring system
Key Metrics Available
Inference Metrics
Track local model performance:
igris_inference_requests_total: Total inference requestsigris_inference_duration_seconds: Inference latency histogram (p50, p95, p99)igris_tokens_generated_total: Total tokens generatedigris_tokens_per_second: Current token generation speedigris_model_load_duration_seconds: Model loading time at startup
Example metrics output:
# HELP igris_inference_requests_total Total number of inference requests
# TYPE igris_inference_requests_total counter
igris_inference_requests_total{model="phi-3-mini",mode="standard"} 1234
# HELP igris_inference_duration_seconds Inference request duration
# TYPE igris_inference_duration_seconds histogram
igris_inference_duration_seconds_bucket{le="0.1"} 45
igris_inference_duration_seconds_bucket{le="0.5"} 234
igris_inference_duration_seconds_bucket{le="1.0"} 456
# HELP igris_tokens_per_second Current token generation speed
# TYPE igris_tokens_per_second gauge
igris_tokens_per_second 24.5
Fallback Metrics
Track cloud-to-local fallback behavior:
igris_fallback_activations_total: Number of times fallback triggeredigris_cloud_request_failures_total: Cloud provider failures by typeigris_fallback_duration_seconds: Time to switch from cloud to local
Why fallback activated:
timeout- Cloud provider took too longunreachable- Network connectivity issueerror- Cloud API returned erroroffline- Intentional offline-only mode
Agent Metrics
Track advanced agent execution:
igris_reflection_iterations_total: Total reflection loops executedigris_reflection_quality_score: Quality scores before/after reflectionigris_planning_steps_total: Planning agent steps executedigris_swarm_agents_active: Number of active agents in swarmigris_swarm_consensus_votes: Consensus voting results
Tool Use Metrics
Monitor tool execution:
igris_tool_executions_total: Tool calls by type (http, shell, filesystem)igris_tool_duration_seconds: Tool execution timeigris_tool_failures_total: Failed tool executions by error type
MCP Swarm Metrics
Track peer-to-peer context sharing:
igris_mcp_peers_discovered: Number of discovered peersigris_mcp_contexts_synced_total: Contexts synchronizedigris_mcp_sync_duration_seconds: Context sync latency
QLoRA Training Metrics
Monitor on-device training:
igris_lora_training_started_total: Training sessions startedigris_lora_training_duration_seconds: Training time per adapterigris_lora_adapter_size_bytes: Generated adapter size
Health Check Endpoint
Runtime provides a health check endpoint at /v1/health:
curl http://localhost:8080/v1/health
Response (healthy):
{
"status": "healthy",
"model_loaded": true,
"uptime_seconds": 3600,
"version": "1.6.0"
}
Response (unhealthy):
{
"status": "unhealthy",
"model_loaded": false,
"error": "Model file not found"
}
Use this endpoint for:
- Kubernetes liveness/readiness probes
- Docker health checks
- Load balancer health monitoring
Structured Logging
Runtime uses structured JSON logging via RUST_LOG:
export RUST_LOG=info
./igris-runtime
Log levels:
error- Only errorswarn- Warnings and errorsinfo- General info (recommended)debug- Detailed debuggingtrace- Very verbose (development only)
JSON Log Format
{
"timestamp": "2025-12-18T10:30:45Z",
"level": "info",
"target": "igris_server",
"message": "inference_completed",
"model": "phi-3-mini",
"mode": "reflection",
"tokens_generated": 128,
"duration_ms": 432,
"fallback_activated": false
}
Log Examples
Successful local inference:
{
"level": "info",
"message": "local_inference_completed",
"model": "phi-3-mini",
"tokens": 64,
"duration_ms": 234,
"tokens_per_sec": 27.3
}
Fallback activation:
{
"level": "warn",
"message": "cloud_fallback_triggered",
"provider": "openai",
"reason": "timeout",
"timeout_ms": 5000,
"switched_to": "local"
}
Reflection agent:
{
"level": "info",
"message": "reflection_iteration_completed",
"iteration": 2,
"quality_score": 0.85,
"threshold": 0.7,
"accepted": true
}
MCP peer discovered:
{
"level": "info",
"message": "mcp_peer_discovered",
"peer_id": "runtime-2",
"peer_address": "192.168.1.45:8080"
}
Prometheus Integration
Scrape Configuration
Add to prometheus.yml:
scrape_configs:
- job_name: 'igris-runtime'
scrape_interval: 15s
static_configs:
- targets:
- 'localhost:8080'
metrics_path: '/metrics'
Grafana Dashboards
Key panels to create:
-
Inference Performance
- Token generation speed (gauge)
- Request latency histogram
- Requests per second
-
Fallback Monitoring
- Fallback activations over time
- Cloud vs local request ratio
- Fallback reasons breakdown
-
Resource Usage
- CPU usage
- Memory usage
- Model memory footprint
-
Agent Activity
- Reflection iterations
- Planning steps
- Swarm agent count
Example PromQL Queries
Tokens per second:
igris_tokens_per_second
P95 inference latency:
histogram_quantile(0.95, rate(igris_inference_duration_seconds_bucket[5m]))
Fallback rate:
rate(igris_fallback_activations_total[5m])
Reflection quality improvement:
avg(igris_reflection_quality_score{stage="after"} - igris_reflection_quality_score{stage="before"})
Monitoring MCP Swarm
Multi-Instance Dashboards
When running MCP swarm mode, monitor all instances:
scrape_configs:
- job_name: 'igris-runtime-swarm'
static_configs:
- targets:
- 'runtime-1:8080'
- 'runtime-2:8080'
- 'runtime-3:8080'
Track:
- Total peers discovered across cluster
- Context sync latency between instances
- Load distribution (requests per instance)
MCP-Specific Metrics
# Total peers in swarm
sum(igris_mcp_peers_discovered)
# Context sync success rate
rate(igris_mcp_contexts_synced_total[5m])
# Average sync latency
avg(igris_mcp_sync_duration_seconds)
Log Aggregation
ELK Stack (Elasticsearch + Kibana)
Filebeat configuration:
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/igris-runtime/*.log
json.keys_under_root: true
json.add_error_key: true
output.elasticsearch:
hosts: ["localhost:9200"]
Loki + Grafana
Promtail configuration:
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: igris-runtime
static_configs:
- targets:
- localhost
labels:
job: igris-runtime
__path__: /var/log/igris-runtime/*.log
pipeline_stages:
- json:
expressions:
level: level
message: message
Alerting
Recommended Alerts
High inference latency:
- alert: HighInferenceLatency
expr: histogram_quantile(0.95, rate(igris_inference_duration_seconds_bucket[5m])) > 2.0
for: 5m
annotations:
summary: "Inference latency above 2 seconds"
Frequent fallback activations:
- alert: FrequentFallbacks
expr: rate(igris_fallback_activations_total[5m]) > 0.5
for: 10m
annotations:
summary: "Fallback activating more than 30 times per minute"
Model not loaded:
- alert: ModelNotLoaded
expr: up{job="igris-runtime"} == 0
for: 1m
annotations:
summary: "Runtime instance is down"
Low token throughput:
- alert: LowTokenThroughput
expr: igris_tokens_per_second < 5
for: 5m
annotations:
summary: "Token generation below 5 tokens/sec"
Performance Monitoring
Baseline Metrics
Track these metrics to understand normal behavior:
Phi-3 Mini Q4 (4 CPU cores):
- Tokens/sec: 15-30
- First token latency: 50-100ms
- P95 request latency: <1000ms
Mistral 7B Q4 (8 CPU cores):
- Tokens/sec: 10-20
- First token latency: 100-200ms
- P95 request latency: <2000ms
Optimization Targets
If metrics are outside expected ranges:
Low tokens/sec:
- Increase
threadsin config - Enable GPU layers (
n_gpu_layers) - Use smaller quantization (Q4 instead of Q8)
High latency:
- Enable prompt caching (
prompt_cache_dir) - Reduce
context_size - Check CPU usage (should be near 100%)
High fallback rate:
- Increase cloud timeout (
first_token_timeout_ms) - Check network connectivity
- Verify cloud API keys
Development vs Production
Development Logging
Verbose logging for debugging:
export RUST_LOG=debug
./igris-runtime
Logs include:
- Request/response bodies
- Detailed tool execution
- Model inference steps
Production Logging
Minimal logging for performance:
export RUST_LOG=info
./igris-runtime
Logs include:
- Request completion
- Errors and warnings
- Performance metrics
Troubleshooting with Metrics
High Memory Usage
Check:
process_resident_memory_bytes{job="igris-runtime"}
Common causes:
- Large
context_size - Multiple concurrent requests
- Memory leak (restart needed)
Slow Inference
Check:
igris_inference_duration_seconds
igris_tokens_per_second
Common causes:
- Too few threads
- Model too large for hardware
- High CPU usage from other processes
Fallback Not Working
Check logs for:
{
"level": "error",
"message": "local_model_not_loaded",
"error": "Model file not found"
}
Verify:
- Model file exists at
model_path - Sufficient RAM available
- File permissions correct
Next Steps
- Configuration - Tune performance settings
- Deployment - Production monitoring setup
- Architecture - Understand system internals