Quick Start
TL;DR: Get Igris Runtime up and running with local LLM fallback in 10 minutes. Download a model, configure the fallback, and start serving AI requests that work offline.
Get Started in 3 Steps
Step 1: Download a Local Model
Runtime uses GGUF models from Hugging Face. The included script downloads Phi-3-mini (recommended for getting started):
cd igris-runtime
./download-model.sh
What this does:
- Downloads
Phi-3-mini-4k-instruct-q4.gguf(~2.3 GB) - Saves to
models/directory - Takes 2-5 minutes depending on your connection
Alternative models: You can use any GGUF model compatible with llama.cpp:
- Phi-3 Mini Q4 (2.3 GB) - Best for CPU, fast inference
- Mistral 7B Q4 (4.1 GB) - Better quality, slower
- Llama 3 8B Q4 (4.7 GB) - High quality, requires more RAM
See Local Models for the full list.
Step 2: Configure Local Fallback
Create a config.json5 file in the project root:
{
// Server configuration
server: {
host: "0.0.0.0",
port: 8080
},
// Local LLM fallback (THE KEY FEATURE)
local_fallback: {
enabled: true,
model_path: "models/phi-3-mini-4k-instruct-q4.gguf",
context_size: 4096,
threads: 4, // Adjust based on your CPU cores
max_tokens: 512,
temperature: 0.7,
cost_per_1k_tokens: 0.0 // Free!
},
// Cloud providers (optional - works without them)
providers: [
// Add cloud providers if you want cloud+local hybrid mode
// Leave empty for pure offline operation
],
// Routing configuration
routing: {
speculative: {
enabled: true,
max_providers: 3,
first_token_timeout_ms: 5000 // Fallback to local after 5s
}
}
}
Step 3: Run the Server
Development mode:
cargo run
Production mode:
cargo run --release
Expected output:
Igris Runtime v1.6 starting...
Local LLM provider initialized successfully
Model: models/phi-3-mini-4k-instruct-q4.gguf
Context size: 4096
Threads: 4
Server listening on 0.0.0.0:8080
Local LLM fallback: ENABLED ✓
That's it! Your Runtime is now serving requests.
Make Your First Request
Using cURL
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi3",
"messages": [
{"role": "user", "content": "What is 2+2?"}
]
}'
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "phi3",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "2+2 equals 4."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 6,
"total_tokens": 14
}
}
Using Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-required" # Auth is optional for local
)
response = client.chat.completions.create(
model="phi3",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Using Node.js
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:8080/v1',
apiKey: 'not-required' // Auth optional
});
const response = await client.chat.completions.create({
model: 'phi3',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);
Test Offline Capability
Enable airplane mode, then make a request:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "What is machine learning?"}]
}'
What happens:
- Runtime tries to reach cloud provider for GPT-4
- Request times out after 5 seconds (no internet)
- Runtime automatically falls back to local Phi-3 model
- You get a response even with no internet connection
This is the killer feature: your AI keeps working offline.
Advanced Features
Streaming Responses
Get tokens as they're generated:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi3",
"messages": [{"role": "user", "content": "Tell me a short story"}],
"stream": true
}'
Python with streaming:
response = client.chat.completions.create(
model="phi3",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end='', flush=True)
Reflection Mode
Self-improving responses with critique loops:
curl -X POST http://localhost:8080/v1/chat/completions \
-d '{
"model": "phi3",
"mode": "reflection",
"messages": [{"role": "user", "content": "Write a professional email"}]
}'
The model will:
- Generate an initial draft
- Critique its own output
- Regenerate based on critique
- Repeat until quality threshold is met
Tool Use
Let the LLM call external tools:
curl -X POST http://localhost:8080/v1/chat/completions \
-d '{
"model": "phi3",
"mode": "tools",
"messages": [{"role": "user", "content": "Check the weather in SF"}]
}'
Configure tools in config.json5:
{
tools: {
enable_http: true,
allowed_http_domains: ["api.weather.com"],
max_execution_time_ms: 30000
}
}
Multi-Agent Swarm
Multiple agents collaborate on complex tasks:
curl -X POST http://localhost:8080/v1/chat/completions \
-d '{
"model": "phi3",
"mode": "swarm",
"messages": [{"role": "user", "content": "Design a REST API for a blog"}]
}'
What happens:
- Researcher agent gathers requirements
- Engineer agent proposes implementation
- Critic agent finds potential issues
- Synthesizer agent combines insights into final answer
Health Check & Metrics
Check if Runtime is Running
curl http://localhost:8080/v1/health
# => "OK"
View Prometheus Metrics
curl http://localhost:8080/metrics
Sample metrics:
# HELP igris_requests_total Total number of requests
igris_requests_total{provider="local",model="phi3"} 42
# HELP igris_request_duration_seconds Request duration
igris_request_duration_seconds_sum{provider="local"} 12.34
# HELP igris_fallback_activations_total Fallback activations
igris_fallback_activations_total 15
Swagger UI
Interactive API documentation:
http://localhost:8080/swagger-ui
Next Steps
1. Try Different Models
Download and configure alternative models:
2. Enable MCP Swarm Mode
Share context across multiple Runtime instances:
3. Set Up On-Device Training
Fine-tune your local model based on usage:
4. Deploy to Production
Docker, Kubernetes, or bare metal:
5. Configure Advanced Features
Reflection, planning, tools, and more:
Troubleshooting
Model Not Found
Error: Model file not found: models/phi-3-mini-4k-instruct-q4.gguf
Solution: Run ./download-model.sh to download the model.
Out of Memory
thread 'main' panicked at 'out of memory'
Solution: Reduce context_size to 2048 and threads to 2 in config.json5.
Slow Inference
If responses take >5 seconds:
- Increase threads: Set
threads: 8if you have 8+ CPU cores - Reduce max_tokens: Set
max_tokens: 256for faster responses - Use smaller model: Try Phi-3 Q2 quantization (faster but lower quality)
Port Already in Use
Error: Address already in use (os error 48)
Solution: Change port in config.json5 or kill the process using port 8080:
lsof -ti:8080 | xargs kill -9
Get Help
- Documentation: You're reading it!
- GitHub Issues: Report bugs and request features
- GitHub Discussions: Ask questions and share tips