Frequently Asked Questions

Common questions about Igris Runtime.

General

What is Igris Runtime?

Igris Runtime is an offline-first AI inference engine that runs local LLM models (Phi-3, Mistral, Llama) with automatic cloud fallback. It keeps your AI working even when the internet doesn't.

How is it different from Ollama or LM Studio?

Feature	Igris Runtime	Ollama	LM Studio
Cloud fallback	✅ Automatic	❌ No	❌ No
AI Agents	✅ Reflection, Planning, Swarm	❌ No	❌ No
Tool calling	✅ HTTP, Shell, FS	❌ No	❌ No
On-device training	✅ QLoRA	❌ No	❌ No
MCP Swarm	✅ P2P context sync	❌ No	❌ No

Setup & Installation

What are the system requirements?

Minimum:

4 GB RAM
4 CPU cores
5 GB disk space (including model)
Rust 1.75+

Recommended:

8 GB RAM
8 CPU cores
10 GB disk space
GPU (optional, for acceleration)

Can I run this on a Raspberry Pi?

Yes! Raspberry Pi 4/5 with 4GB+ RAM works well with Phi-3 Mini. Expect 10-15 tokens/sec.

What models are supported?

Any GGUF model compatible with llama.cpp:

Phi-3 Mini (2.3 GB) - Recommended
Mistral 7B (4.1 GB)
Llama 3 8B (4.7 GB)
Gemma 7B (4.5 GB)

See Local Models for complete list.

Local LLM Fallback

How does the fallback work?

Request comes in for "gpt-4"
Runtime tries cloud provider (5 second timeout)
If cloud fails/unreachable, automatically switches to local Phi-3
You get a response either way

It's completely transparent - no code changes needed.

Does it work 100% offline?

Yes! Configure providers: [] (empty) and Runtime works entirely offline using only the local model.

How fast is the local model?

Typical performance (Phi-3 Mini Q4):

First token: 50-100ms
Throughput: 15-30 tokens/sec (CPU)
With GPU: 30-60 tokens/sec

Can I use multiple models?

Currently one model at a time. Multi-model support is planned for future releases.

Advanced Features

What are Reflection Agents?

Self-improving AI that critiques and refines its own responses. The model generates an answer, scores it, and regenerates if quality is below threshold.

Use cases: Content generation, code review, quality assurance.

What is MCP Swarm Mode?

Peer-to-peer context sharing across multiple Runtime instances. All instances auto-discover each other and sync conversation context in real-time.

Use cases: High availability, edge AI, distributed workloads.

Can I fine-tune the local model?

Yes! QLoRA training enables on-device fine-tuning. After 100 requests (configurable), Runtime automatically trains a domain-specific adapter.

Benefits:

Model specializes to your use case
No data leaves the device
Small adapters (< 64 MB)

See QLoRA Training for details.

Performance

Why is inference slow?

Common causes:

Too few threads: Increase threads in config
Large model: Use Phi-3 instead of Llama 3
High quantization: Use Q4 instead of Q8
Large context: Reduce context_size

See Configuration for optimization tips.

How can I make it faster?

Increase CPU threads: Match your core count
Enable GPU layers: Set n_gpu_layers if you have GPU
Use prompt caching: Enable prompt_cache_dir
Reduce context: Lower context_size to 2048
Smaller model: Use Q4 quantization

Will GPU acceleration help?

Yes! With NVIDIA GPU, expect 5-10x faster inference. Set n_gpu_layers: 32 (or -1 for all layers).

Deployment

Can I run this in Docker?

Yes! We provide official Dockerfiles. See Deployment Guide.

Does it work on Kubernetes?

Yes! We provide example Kubernetes manifests. See Deployment Guide.

Can I deploy to AWS/GCP/Azure?

Yes! Runtime runs anywhere you can run a Docker container or Linux binary. Works on EC2, GCE, Azure VMs, etc.

What about edge devices?

Runtime works great on edge:

Raspberry Pi 4/5
NVIDIA Jetson
Intel NUC
Other ARM64/x86_64 devices

See Deployment - Edge Devices.

Security

Is my data safe?

Yes:

Local inference keeps all data on device
LoRA adapters encrypted at rest (AES-256-GCM)
MCP context encrypted
No telemetry or data collection

Should I enable authentication?

Yes for production! Enable API key authentication:

{
  auth: {
    enabled: true,
    api_key: "${IGRIS_API_KEY}"
  }
}

Can I use it in air-gapped environments?

Absolutely! That's a core use case. Just:

Download model offline
Copy Runtime binary
Run with providers: [] (no cloud)

Works 100% offline with zero network calls.

Tool Use

Is tool calling safe?

Tools are disabled by default and require explicit whitelisting:

HTTP: Domain whitelist
Shell: Command whitelist
Filesystem: Path whitelist

Always use whitelisting in production.

Can I disable tools?

Yes:

{
  tools: {
    enable_http: false,
    enable_shell: false,
    enable_filesystem: false
  }
}

What tools are available?

HTTP: GET/POST requests to external APIs
Shell: Execute shell commands (sandboxed)
Filesystem: Read/write/list files (sandboxed)

See Tool Use for details.

Troubleshooting

Model file not found

Error: Model file not found: models/phi-3-mini-4k-instruct-q4.gguf

Solution: Run ./download-model.sh to download the model.

Out of memory

thread 'main' panicked at 'out of memory'

Solutions:

Use smaller model (Phi-3 instead of Llama)
Reduce context_size to 2048
Reduce threads to 2
Close other applications

Port already in use

Error: Address already in use (os error 48)

Solutions:

Change port in config.json5
Kill process using port: lsof -ti:8080 | xargs kill -9

Binary size too large

Solution: Compress with UPX:

upx --best --lzma target/release/igris-runtime

Reduces from ~12 MB to ~4 MB.

Costs

Is Igris Runtime free?

Yes! It's open source (MIT/Apache-2.0 license).

What about inference costs?

Local inference: Free (electricity only)
Cloud fallback: You pay your cloud provider directly (OpenAI, Anthropic, etc.)

Do I need API keys?

Only if you want cloud fallback. For pure offline mode, no API keys needed.

Support

Where can I get help?

Documentation: You're reading it!
GitHub Issues: Report bugs
GitHub Discussions: Ask questions

How do I report a bug?

Open an issue on GitHub with:

Runtime version (./igris-runtime --version)
Config file (redact secrets!)
Error logs
Steps to reproduce

Can I contribute?

Yes! Pull requests welcome. See CONTRIBUTING.md.

Roadmap

What's coming next?

Model hot-swapping (no restart needed)
Multi-model support (run multiple models)
WebAssembly plugins
Distributed training across swarm
More agent types (code generation, research)

How often are releases?

Approximately monthly for minor releases, quarterly for major features.

Comparisons

Runtime vs Ollama?

Choose Runtime if you need:

Cloud fallback for reliability
AI agents (reflection, planning, swarm)
Tool calling
On-device training
MCP context sharing

Choose Ollama if you need:

Simpler setup
Just local inference
Community model library

Runtime vs LM Studio?

Runtime advantages:

Production-ready server
Advanced agents
Cloud fallback
Rust performance
Smaller binary

LM Studio advantages:

GUI interface
Easier for beginners

Still have questions? Ask on GitHub Discussions