Common Issues

Requests Hang Indefinitely

Symptoms:

orpheus run never returns
No error message, just waiting
Request shows as “STARTED” in execlog but never completes

Diagnosis:

# Check if request is stuck
orpheus execlog list <agent> --status STARTED

# Check queue metrics
curl http://localhost:7777/v1/stats

Common Causes:

Worker crashed during execution (OOM, segfault)
Agent code has infinite loop
External API not responding

Solutions: Immediate fix:

# Restart daemon to clear stuck workers
sudo systemctl restart orpheusd  # Linux
# or
orpheus vm restart  # macOS

Long-term fix:

# In agent.yaml - increase timeout if legitimate
timeout: 120  # Allow more time for slow operations

“Queue is Full” Errors

Symptoms:

HTTP 503 “agent queue is full”
Requests rejected immediately

Diagnosis:

# Check queue depth
curl http://localhost:7777/v1/agents/<agent>/stats

Common Causes:

Queue size too small for burst traffic
All workers busy, queue fills up
max_workers limit reached

Solutions: Increase queue capacity:

# agent.yaml
scaling:
  queue_size: 100  # Was 50

Increase worker count:

scaling:
  max_workers: 20  # Was 10

Client-side backoff:

import time
for attempt in range(5):
    try:
        response = orpheus.run(agent, input)
        break
    except QueueFullError:
        time.sleep(pow(2, attempt))  # Exponential backoff: 1s, 2s, 4s, 8s, 16s

Task Timeout Errors

Symptoms:

Error: “task timeout after 60s”
Task is killed mid-execution

Diagnosis:

# Check timeout configuration
orpheus config show <agent> | grep timeout

# Check task duration distribution
orpheus execlog list <agent> --limit 100 | grep duration

Common Causes:

Slow external API calls (LLM APIs can take 10-30s)
Large file processing
Inefficient code

Solutions: Increase timeout:

# agent.yaml
timeout: 120  # Allow 2 minutes

Optimize agent code:

# Add progress logging to identify bottleneck
logger.info("Starting LLM call...")
result = llm.call(prompt)  # Identify slow operations
logger.info("LLM call completed")

Chunk large operations:

# Process in chunks to stay within timeout
for chunk in chunks(large_data, size=100):
    process(chunk)

“Seccomp Blocked Syscall” Errors

Symptoms:

Error: “Operation not permitted”
EPERM error in logs

Diagnosis:

# Check agent logs for syscall errors
sudo journalctl -u orpheusd | grep <agent> | grep -i "permission\|eperm"

Common Causes:

Using os.system() or subprocess for admin commands (mount, reboot, etc.)
Trying to debug with ptrace
Using native extensions with blocked syscalls

Solutions: Check allowed syscalls in the Security Model Rewrite code to avoid blocked operations:

# Instead of mounting (blocked):
import os
os.mount("/dev/sda", "/mnt", "ext4")  # ✗ Blocked by seccomp

# Use provided mounts:
with open("/workspace/data.txt", "w") as f:  # ✓ Allowed
    f.write(data)

If you have a legitimate use case for a blocked syscall, file an issue on GitHub.

Circuit Breaker Opened

Symptoms:

Error: “circuit breaker open (cooldown: 45s remaining)”
Workers not spawning
Pool stuck at low capacity

Diagnosis:

# Check logs for spawn failures
sudo journalctl -u orpheusd | grep "circuit breaker"
sudo journalctl -u orpheusd | grep "spawn failed"

Common Causes:

Repeated worker spawn failures (5 consecutive failures opens circuit)
Image not available
runc errors

Solutions: Wait for cooldown (60 seconds):

# Circuit breaker will automatically retry after 60s
# Monitor logs:
sudo journalctl -u orpheusd | tail -f

Fix underlying issue:

# Check if image exists
ls ~/.orpheus/images/<agent>/rootfs

# Check runc works
runc --version

# Manually test spawn
orpheus test-spawn <agent>

Manual reset (restart daemon):

sudo systemctl restart orpheusd

Agent Won’t Deploy

Error: agent.yaml not found Fix: Ensure your directory has agent.yaml at the root:

my-agent/
├── agent.yaml    # Required
└── agent.py

Error: entrypoint function not found Fix: Check that your handler function exists and matches agent.yaml:

# agent.yaml
entrypoint: handler

# agent.py
def handler(input_data):  # Must match
    ...

Daemon Not Running

Error: connection refused or daemon not reachable Fix:

# Check status
orpheus status

# Start daemon (macOS)
orpheus vm start

# Start daemon (Linux)
sudo systemctl start orpheusd

Out of Memory (OOM)

Symptoms:

Worker crashes during execution
Exit code 137 (SIGKILL)
Logs show “killed” or “OOM”

Diagnosis:

# Check memory limit
orpheus config show <agent> | grep memory

Solutions: Increase memory limit:

# agent.yaml
memory: 1024  # Increase to 1GB

Optimize memory usage:

# Stream large files instead of loading all at once
with open("large_file.txt", "r") as f:
    for line in f:  # Iterate, don't load all
        process(line)

Slow Autoscaling

Symptoms:

First concurrent request is slow
Takes several seconds to scale up

Common Causes:

min_workers too low (starting from 1 worker)
Cold start (first worker spawn takes 50-200ms)

Solutions: Pre-warm workers:

# agent.yaml
scaling:
  min_workers: 3  # Keep 3 workers always running
  max_workers: 10

Cold Start Delays

Symptom: First request is slow Cause: min_workers: 0 or worker died Fix: Keep workers warm:

scaling:
  min_workers: 1

Session Not Sticky

Symptom: Requests with same session hit different workers Possible causes:

Worker was busy, request went to another
Worker died between requests

This is expected behavior. Session affinity is best-effort. For guaranteed state, use workspace.

Files Missing After Restart

Cause: Files were in /tmp (ephemeral) instead of /workspace (persistent) Fix: Always use /workspace for data you need to keep:

# Wrong (lost on restart)
with open('/tmp/data.json', 'w') as f: ...

# Right (persists forever)
with open('/workspace/data.json', 'w') as f: ...

High Memory Usage on Host

Symptoms:

Host memory usage growing over time
Many orpheus-worker processes

Diagnosis:

# Check number of containers
ps aux | grep runc | wc -l

# Check worker count
orpheus status

Solutions: Reduce max_workers:

# agent.yaml
scaling:
  max_workers: 10  # Reduce from higher value

Enable aggressive scale-down:

scaling:
  scale_down_threshold: 0.8  # Scale down earlier

Port Already in Use

Error: address already in use :7777 Fix:

# Find what's using the port
lsof -i :7777

# Kill it or use different port
orpheusd --tcp-bind :7778

Getting Help

Before Filing an Issue

Check this troubleshooting guide
Check the Security Model for security-related issues
Check Capacity Planning for scaling issues
Search existing issues on GitHub

Filing a Good Issue

Include:

Orpheus version: orpheus version
Operating system: uname -a
Daemon logs: sudo journalctl -u orpheusd | tail -50
Agent config: cat agent.yaml
Steps to reproduce

Community

GitHub Issues: https://github.com/orpheus-systems/orpheus/issues
Discussions: https://github.com/orpheus-systems/orpheus/discussions

Debug with ExecLog

Find what went wrong →

Getting Started

Guides

Concepts

Reference

Examples

Troubleshooting

Requests Hang Indefinitely

“Queue is Full” Errors

Task Timeout Errors

“Seccomp Blocked Syscall” Errors

Circuit Breaker Opened

Agent Won’t Deploy

Daemon Not Running

Out of Memory (OOM)

Slow Autoscaling

Cold Start Delays

Session Not Sticky

Files Missing After Restart

High Memory Usage on Host

Port Already in Use

Getting Help

Before Filing an Issue

Filing a Good Issue

Community

Debug with ExecLog

Getting Started

Guides

Concepts

Reference

Examples

Troubleshooting

​Requests Hang Indefinitely

​“Queue is Full” Errors

​Task Timeout Errors

​“Seccomp Blocked Syscall” Errors

​Circuit Breaker Opened

​Agent Won’t Deploy

​Daemon Not Running

​Out of Memory (OOM)

​Slow Autoscaling

​Cold Start Delays

​Session Not Sticky

​Files Missing After Restart

​High Memory Usage on Host

​Port Already in Use

​Getting Help

​Before Filing an Issue

​Filing a Good Issue

​Community

Debug with ExecLog

Requests Hang Indefinitely

“Queue is Full” Errors

Task Timeout Errors

“Seccomp Blocked Syscall” Errors

Circuit Breaker Opened

Agent Won’t Deploy

Daemon Not Running

Out of Memory (OOM)

Slow Autoscaling

Cold Start Delays

Session Not Sticky

Files Missing After Restart

High Memory Usage on Host

Port Already in Use

Getting Help

Before Filing an Issue

Filing a Good Issue

Community