Documentation Index
Fetch the complete documentation index at: https://docs.orpheus.run/llms.txt
Use this file to discover all available pages before exploring further.
Requests Hang Indefinitely
Symptoms:
orpheus run never returns
- No error message, just waiting
- Request shows as “STARTED” in execlog but never completes
Diagnosis:
# Check if request is stuck
orpheus execlog list <agent> --status STARTED
# Check queue metrics
curl http://localhost:7777/v1/stats
Common Causes:
- Worker crashed during execution (OOM, segfault)
- Agent code has infinite loop
- External API not responding
Solutions:
Immediate fix:
# Restart daemon to clear stuck workers
sudo systemctl restart orpheusd # Linux
# or
orpheus vm restart # macOS
Long-term fix:
# In agent.yaml - increase timeout if legitimate
timeout: 120 # Allow more time for slow operations
“Queue is Full” Errors
Symptoms:
- HTTP 503 “agent queue is full”
- Requests rejected immediately
Diagnosis:
# Check queue depth
curl http://localhost:7777/v1/agents/<agent>/stats
Common Causes:
- Queue size too small for burst traffic
- All workers busy, queue fills up
- max_workers limit reached
Solutions:
Increase queue capacity:
# agent.yaml
scaling:
queue_size: 100 # Was 50
Increase worker count:
scaling:
max_workers: 20 # Was 10
Client-side backoff:
import time
for attempt in range(5):
try:
response = orpheus.run(agent, input)
break
except QueueFullError:
time.sleep(pow(2, attempt)) # Exponential backoff: 1s, 2s, 4s, 8s, 16s
Task Timeout Errors
Symptoms:
- Error: “task timeout after 60s”
- Task is killed mid-execution
Diagnosis:
# Check timeout configuration
orpheus config show <agent> | grep timeout
# Check task duration distribution
orpheus execlog list <agent> --limit 100 | grep duration
Common Causes:
- Slow external API calls (LLM APIs can take 10-30s)
- Large file processing
- Inefficient code
Solutions:
Increase timeout:
# agent.yaml
timeout: 120 # Allow 2 minutes
Optimize agent code:
# Add progress logging to identify bottleneck
logger.info("Starting LLM call...")
result = llm.call(prompt) # Identify slow operations
logger.info("LLM call completed")
Chunk large operations:
# Process in chunks to stay within timeout
for chunk in chunks(large_data, size=100):
process(chunk)
“Seccomp Blocked Syscall” Errors
Symptoms:
- Error: “Operation not permitted”
- EPERM error in logs
Diagnosis:
# Check agent logs for syscall errors
sudo journalctl -u orpheusd | grep <agent> | grep -i "permission\|eperm"
Common Causes:
- Using os.system() or subprocess for admin commands (mount, reboot, etc.)
- Trying to debug with ptrace
- Using native extensions with blocked syscalls
Solutions:
Check allowed syscalls in the Security Model
Rewrite code to avoid blocked operations:
# Instead of mounting (blocked):
import os
os.mount("/dev/sda", "/mnt", "ext4") # ✗ Blocked by seccomp
# Use provided mounts:
with open("/workspace/data.txt", "w") as f: # ✓ Allowed
f.write(data)
If you have a legitimate use case for a blocked syscall, file an issue on GitHub.
Circuit Breaker Opened
Symptoms:
- Error: “circuit breaker open (cooldown: 45s remaining)”
- Workers not spawning
- Pool stuck at low capacity
Diagnosis:
# Check logs for spawn failures
sudo journalctl -u orpheusd | grep "circuit breaker"
sudo journalctl -u orpheusd | grep "spawn failed"
Common Causes:
- Repeated worker spawn failures (5 consecutive failures opens circuit)
- Image not available
- runc errors
Solutions:
Wait for cooldown (60 seconds):
# Circuit breaker will automatically retry after 60s
# Monitor logs:
sudo journalctl -u orpheusd | tail -f
Fix underlying issue:
# Check if image exists
ls ~/.orpheus/images/<agent>/rootfs
# Check runc works
runc --version
# Manually test spawn
orpheus test-spawn <agent>
Manual reset (restart daemon):
sudo systemctl restart orpheusd
Agent Won’t Deploy
Error: agent.yaml not found
Fix: Ensure your directory has agent.yaml at the root:
my-agent/
├── agent.yaml # Required
└── agent.py
Error: entrypoint function not found
Fix: Check that your handler function exists and matches agent.yaml:
# agent.yaml
entrypoint: handler
# agent.py
def handler(input_data): # Must match
...
Daemon Not Running
Error: connection refused or daemon not reachable
Fix:
# Check status
orpheus status
# Start daemon (macOS)
orpheus vm start
# Start daemon (Linux)
sudo systemctl start orpheusd
Out of Memory (OOM)
Symptoms:
- Worker crashes during execution
- Exit code 137 (SIGKILL)
- Logs show “killed” or “OOM”
Diagnosis:
# Check memory limit
orpheus config show <agent> | grep memory
Solutions:
Increase memory limit:
# agent.yaml
memory: 1024 # Increase to 1GB
Optimize memory usage:
# Stream large files instead of loading all at once
with open("large_file.txt", "r") as f:
for line in f: # Iterate, don't load all
process(line)
Slow Autoscaling
Symptoms:
- First concurrent request is slow
- Takes several seconds to scale up
Common Causes:
- min_workers too low (starting from 1 worker)
- Cold start (first worker spawn takes 50-200ms)
Solutions:
Pre-warm workers:
# agent.yaml
scaling:
min_workers: 3 # Keep 3 workers always running
max_workers: 10
Cold Start Delays
Symptom: First request is slow
Cause: min_workers: 0 or worker died
Fix: Keep workers warm:
Session Not Sticky
Symptom: Requests with same session hit different workers
Possible causes:
- Worker was busy, request went to another
- Worker died between requests
This is expected behavior. Session affinity is best-effort. For guaranteed state, use workspace.
Files Missing After Restart
Cause: Files were in /tmp (ephemeral) instead of /workspace (persistent)
Fix: Always use /workspace for data you need to keep:
# Wrong (lost on restart)
with open('/tmp/data.json', 'w') as f: ...
# Right (persists forever)
with open('/workspace/data.json', 'w') as f: ...
High Memory Usage on Host
Symptoms:
- Host memory usage growing over time
- Many orpheus-worker processes
Diagnosis:
# Check number of containers
ps aux | grep runc | wc -l
# Check worker count
orpheus status
Solutions:
Reduce max_workers:
# agent.yaml
scaling:
max_workers: 10 # Reduce from higher value
Enable aggressive scale-down:
scaling:
scale_down_threshold: 0.8 # Scale down earlier
Port Already in Use
Error: address already in use :7777
Fix:
# Find what's using the port
lsof -i :7777
# Kill it or use different port
orpheusd --tcp-bind :7778
Getting Help
Before Filing an Issue
- Check this troubleshooting guide
- Check the Security Model for security-related issues
- Check Capacity Planning for scaling issues
- Search existing issues on GitHub
Filing a Good Issue
Include:
- Orpheus version:
orpheus version
- Operating system:
uname -a
- Daemon logs:
sudo journalctl -u orpheusd | tail -50
- Agent config:
cat agent.yaml
- Steps to reproduce
Debug with ExecLog
Find what went wrong →