Local LLM Fallback
Automatic failover to on-device models when cloud providers are unreachable.
Overview
The local LLM fallback is Runtime's killer feature. When all cloud providers fail or are unreachable, Runtime automatically switches to an on-device model (Phi-3, Mistral, Llama, etc.) to continue serving requests.
Key benefits:
- Zero downtime when internet is unavailable
- Works in air-gapped environments
- Free inference (no API costs)
- Low latency (~50-200ms first token)
How It Works
- Cloud first: Runtime tries configured cloud providers
- Timeout detection: If providers don't respond within 5 seconds (configurable)
- Automatic fallback: Runtime switches to local model
- User gets response: Works exactly the same from the API perspective
Configuration
{
local_fallback: {
enabled: true,
model_path: "models/phi-3-mini-4k-instruct-q4.gguf",
context_size: 4096,
threads: 4, // CPU threads
max_tokens: 512,
temperature: 0.7,
cost_per_1k_tokens: 0.0 // Free!
}
}
Supported Models
Any GGUF model compatible with llama.cpp:
- Phi-3 Mini (2.3 GB) - Fast, CPU-optimized
- Mistral 7B (4.1 GB) - Better quality
- Llama 3 8B (4.7 GB) - High quality
- Gemma 7B (4.5 GB) - Google's model
See Local Models for complete list.
Performance
| Metric | Typical Performance |
|---|---|
| First token latency | 50-100ms |
| Tokens/second | 15-30 (CPU) |
| Model load time | 2-3 seconds |
| Memory usage | ~2.5 GB |
Usage
Fallback activates automatically - no code changes needed:
# This request will use cloud if available, local if not
curl -X POST http://localhost:8080/v1/chat/completions \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello"}]
}'
Testing Offline
Enable airplane mode and make a request - you'll still get a response from the local model.
See Quick Start for details.