Frequently Asked Questions
Common questions about Igris Runtime.
General
What is Igris Runtime?
Igris Runtime is an offline-first AI inference engine that runs local LLM models (Phi-3, Mistral, Llama) with automatic cloud fallback. It keeps your AI working even when the internet doesn't.
How is it different from Ollama or LM Studio?
| Feature | Igris Runtime | Ollama | LM Studio |
|---|---|---|---|
| Cloud fallback | ✅ Automatic | ❌ No | ❌ No |
| AI Agents | ✅ Reflection, Planning, Swarm | ❌ No | ❌ No |
| Tool calling | ✅ HTTP, Shell, FS | ❌ No | ❌ No |
| On-device training | ✅ QLoRA | ❌ No | ❌ No |
| MCP Swarm | ✅ P2P context sync | ❌ No | ❌ No |
Setup & Installation
What are the system requirements?
Minimum:
- 4 GB RAM
- 4 CPU cores
- 5 GB disk space (including model)
- Rust 1.75+
Recommended:
- 8 GB RAM
- 8 CPU cores
- 10 GB disk space
- GPU (optional, for acceleration)
Can I run this on a Raspberry Pi?
Yes! Raspberry Pi 4/5 with 4GB+ RAM works well with Phi-3 Mini. Expect 10-15 tokens/sec.
What models are supported?
Any GGUF model compatible with llama.cpp:
- Phi-3 Mini (2.3 GB) - Recommended
- Mistral 7B (4.1 GB)
- Llama 3 8B (4.7 GB)
- Gemma 7B (4.5 GB)
See Local Models for complete list.
Local LLM Fallback
How does the fallback work?
- Request comes in for "gpt-4"
- Runtime tries cloud provider (5 second timeout)
- If cloud fails/unreachable, automatically switches to local Phi-3
- You get a response either way
It's completely transparent - no code changes needed.
Does it work 100% offline?
Yes! Configure providers: [] (empty) and Runtime works entirely offline using only the local model.
How fast is the local model?
Typical performance (Phi-3 Mini Q4):
- First token: 50-100ms
- Throughput: 15-30 tokens/sec (CPU)
- With GPU: 30-60 tokens/sec
Can I use multiple models?
Currently one model at a time. Multi-model support is planned for future releases.
Advanced Features
What are Reflection Agents?
Self-improving AI that critiques and refines its own responses. The model generates an answer, scores it, and regenerates if quality is below threshold.
Use cases: Content generation, code review, quality assurance.
What is MCP Swarm Mode?
Peer-to-peer context sharing across multiple Runtime instances. All instances auto-discover each other and sync conversation context in real-time.
Use cases: High availability, edge AI, distributed workloads.
Can I fine-tune the local model?
Yes! QLoRA training enables on-device fine-tuning. After 100 requests (configurable), Runtime automatically trains a domain-specific adapter.
Benefits:
- Model specializes to your use case
- No data leaves the device
- Small adapters (< 64 MB)
See QLoRA Training for details.
Performance
Why is inference slow?
Common causes:
- Too few threads: Increase
threadsin config - Large model: Use Phi-3 instead of Llama 3
- High quantization: Use Q4 instead of Q8
- Large context: Reduce
context_size
See Configuration for optimization tips.
How can I make it faster?
- Increase CPU threads: Match your core count
- Enable GPU layers: Set
n_gpu_layersif you have GPU - Use prompt caching: Enable
prompt_cache_dir - Reduce context: Lower
context_sizeto 2048 - Smaller model: Use Q4 quantization
Will GPU acceleration help?
Yes! With NVIDIA GPU, expect 5-10x faster inference. Set n_gpu_layers: 32 (or -1 for all layers).
Deployment
Can I run this in Docker?
Yes! We provide official Dockerfiles. See Deployment Guide.
Does it work on Kubernetes?
Yes! We provide example Kubernetes manifests. See Deployment Guide.
Can I deploy to AWS/GCP/Azure?
Yes! Runtime runs anywhere you can run a Docker container or Linux binary. Works on EC2, GCE, Azure VMs, etc.
What about edge devices?
Runtime works great on edge:
- Raspberry Pi 4/5
- NVIDIA Jetson
- Intel NUC
- Other ARM64/x86_64 devices
See Deployment - Edge Devices.
Security
Is my data safe?
Yes:
- Local inference keeps all data on device
- LoRA adapters encrypted at rest (AES-256-GCM)
- MCP context encrypted
- No telemetry or data collection
Should I enable authentication?
Yes for production! Enable API key authentication:
{
auth: {
enabled: true,
api_key: "${IGRIS_API_KEY}"
}
}
Can I use it in air-gapped environments?
Absolutely! That's a core use case. Just:
- Download model offline
- Copy Runtime binary
- Run with
providers: [](no cloud)
Works 100% offline with zero network calls.
Tool Use
Is tool calling safe?
Tools are disabled by default and require explicit whitelisting:
- HTTP: Domain whitelist
- Shell: Command whitelist
- Filesystem: Path whitelist
Always use whitelisting in production.
Can I disable tools?
Yes:
{
tools: {
enable_http: false,
enable_shell: false,
enable_filesystem: false
}
}
What tools are available?
- HTTP: GET/POST requests to external APIs
- Shell: Execute shell commands (sandboxed)
- Filesystem: Read/write/list files (sandboxed)
See Tool Use for details.
Troubleshooting
Model file not found
Error: Model file not found: models/phi-3-mini-4k-instruct-q4.gguf
Solution: Run ./download-model.sh to download the model.
Out of memory
thread 'main' panicked at 'out of memory'
Solutions:
- Use smaller model (Phi-3 instead of Llama)
- Reduce
context_sizeto 2048 - Reduce
threadsto 2 - Close other applications
Port already in use
Error: Address already in use (os error 48)
Solutions:
- Change port in config.json5
- Kill process using port:
lsof -ti:8080 | xargs kill -9
Binary size too large
Solution: Compress with UPX:
upx --best --lzma target/release/igris-runtime
Reduces from ~12 MB to ~4 MB.
Costs
Is Igris Runtime free?
Yes! It's open source (MIT/Apache-2.0 license).
What about inference costs?
- Local inference: Free (electricity only)
- Cloud fallback: You pay your cloud provider directly (OpenAI, Anthropic, etc.)
Do I need API keys?
Only if you want cloud fallback. For pure offline mode, no API keys needed.
Support
Where can I get help?
- Documentation: You're reading it!
- GitHub Issues: Report bugs
- GitHub Discussions: Ask questions
How do I report a bug?
Open an issue on GitHub with:
- Runtime version (
./igris-runtime --version) - Config file (redact secrets!)
- Error logs
- Steps to reproduce
Can I contribute?
Yes! Pull requests welcome. See CONTRIBUTING.md.
Roadmap
What's coming next?
- Model hot-swapping (no restart needed)
- Multi-model support (run multiple models)
- WebAssembly plugins
- Distributed training across swarm
- More agent types (code generation, research)
How often are releases?
Approximately monthly for minor releases, quarterly for major features.
Comparisons
Runtime vs Ollama?
Choose Runtime if you need:
- Cloud fallback for reliability
- AI agents (reflection, planning, swarm)
- Tool calling
- On-device training
- MCP context sharing
Choose Ollama if you need:
- Simpler setup
- Just local inference
- Community model library
Runtime vs LM Studio?
Runtime advantages:
- Production-ready server
- Advanced agents
- Cloud fallback
- Rust performance
- Smaller binary
LM Studio advantages:
- GUI interface
- Easier for beginners
Still have questions? Ask on GitHub Discussions