Skip to main content

RAG Search Agent

This agent implements a Retrieval Augmented Generation (RAG) pipeline. It demonstrates how to manage persistent vector indexes and cache search results using the Workspace.

How it Works

  1. Ingestion: You can send documents to be “indexed”.
  2. Persistence: The agent saves the document embeddings (simulated) into a vector store located in /workspace/index/.
  3. Search: When you ask a query, it:
    • Checks the result cache in /workspace/cache/.
    • If miss, scans the persistent index.
    • Returns the most relevant document.

Key Features

  • Persistent Index: The vector database lives in /workspace, surviving redeploys.
  • Result Caching: Expensive search operations are cached to disk to save compute.
  • Stateful updates: You can incrementally add documents to the index over time.

Usage

# Add a document
orpheus run rag-search '{"action": "index", "doc": "Orpheus uses warm workers."}'

# Search
orpheus run rag-search '{"action": "search", "query": "warm workers"}'

Source Code

name: rag-search
runtime: python3
module: rag_agent
entrypoint: handler

memory: 1024
timeout: 600  # 10 min - LLM inference can be slow

# Model specification - ServiceManager handles model server lifecycle
model: mistral
engine: ollama  # Using Ollama for local inference (macOS-friendly)

# Environment variables
env:
  - INDEX_DIR=/agent/data/faiss_index
  - EMBEDDING_MODEL=nomic-embed-text
  - OLLAMA_MODEL=mistral

# Telemetry configuration - custom labels for Prometheus filtering
telemetry:
  enabled: true
  labels:
    team: ml_platform
    tier: standard
    use_case: rag

# Scaling configuration
scaling:
  min_workers: 1
  max_workers: 20
  target_utilization: 1.5
  scale_up_threshold: 2.0
  scale_down_threshold: 0.3
  scale_up_delay: "15s"
  scale_down_delay: "60s"
  queue_size: 50