Skip to content

Technical Reference

This page covers architecture, configuration, the adapter protocol, and operational details. If you are looking for how to get started or choose a topology, see Getting Started and Topologies.

Architecture

┌──────────────────────────────────────────────────────────┐
│            orchestrators/tree_orchestrator.py           │
│          (team decomposition + git branching)           │
│                                                          │
│  Decompose → team branches → merge → integration QA     │
└─────────┬────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────┐
│          orchestrators/dag_orchestrator.py     │
│         (orchestrator + wave loop)             │
│                                                 │
│  Decompose → BUILD wave → QA wave → FIX wave   │
└──────────┬──────────────────────────┬───────────┘
           │                          │
    ┌──────▼──────┐           ┌───────▼───────┐
    │ MessageBroker│           │ Shared        │
    │ ROUTER :5555 │           │ Workspace     │
    │ PUB    :5556 │           │ runs/shared-* │
    └──────┬──────┘           └───────────────┘
           │
    ┌──────┼──────────┬───────────┐
    ▼      ▼          ▼           ▼
  Alice   Bob      Charlie      Dave
 (DEALER) (DEALER)  (DEALER)   (DEALER)
  + SUB    + SUB     + SUB      + SUB

Each agent is an independent subprocess running runtime/builtin_agent_runner.py with:

  • a DEALER socket to the broker router
  • a SUB socket to the broker publisher
  • a shared filesystem workspace

Adapter Protocol

You can plug in your own implementation without changing orchestrators. There are two adapter styles:

Function adapter

The simplest path. Write a Python function with this signature:

def run(input: dict, *, session=None, **kwargs) -> dict:

Run it with:

epsilon runs create \
  --topology dag \
  --task "your task" \
  --implementation python:path/to/file.py:run

Process adapter

For non-Python implementations or when you need full subprocess isolation. Your process reads JSON from stdin and writes JSON to stdout:

Input (first line on stdin):

{"type": "run_request", "task": "...", "workspace": "...", "agent_id": "...", "topics": ["general"]}

Output (final line on stdout):

{"type": "run_result", "status": "ok", "summary": "..."}

Optional stdout messages during execution:

{"type": "log", "message": "..."}
{"type": "send_message", "content": "...", "topic": "general"}
{"type": "check_messages"}

Message polling reply (from the runner, on stdin):

{"type": "message_batch", "messages": [...]}

Run it with:

epsilon runs create \
  --topology dag \
  --task "your task" \
  --implementation "python3 path/to/adapter.py"

Agent Tools

The built-in agent has access to these tools:

Tool Description
run_bash Execute shell commands
read_file Read file contents
write_file Create or overwrite files
edit_file Surgical string replacement
sql_query Parameterized SQL query execution
web_search Search the web
fetch_url Fetch URL contents
call_llm Call an allowlisted delegate LLM
plan Enter planning mode
submit_plan Submit subtasks after planning
mark_complete Advance to the next subtask
done Signal task completion
send_message Broadcast or direct a message to another agent
check_messages Receive messages from other agents
submit_task Add a task to the shared queue
request_task Pull the next task from the queue

Configuration

runtime_settings.json

Controls the built-in agent's model, iteration limits, and delegate LLM settings:

{
  "defaultSettingsPack": "default",
  "settingsPacks": {
    "default": {
      "description": "Default configuration",
      "model": "openai/gpt-5.2",
      "max_iterations": 100,
      "max_runtime_seconds": 600,
      "max_tokens": 4096
    },
    "anthropic": {
      "description": "Anthropic model configuration",
      "model": "anthropic/claude-opus-4-6",
      "max_iterations": 100,
      "max_runtime_seconds": 600,
      "max_tokens": 4096
    }
  },
  "delegate_llm": {
    "enabled": true,
    "default_model": "openai/gpt-5.2",
    "allowed_models": [
      "openai/gpt-5.2",
      "anthropic/claude-opus-4-6"
    ],
    "limits": {
      "max_tokens_default": 512,
      "max_tokens_min": 32,
      "max_tokens_max": 1024,
      "temperature_default": 0.0,
      "temperature_min": 0.0,
      "temperature_max": 0.7,
      "timeout_seconds_default": 30,
      "timeout_seconds_min": 5,
      "timeout_seconds_max": 60,
      "prompt_max_chars": 12000,
      "response_max_chars": 12000
    }
  }
}

The call_llm tool only allows models listed in delegate_llm.allowed_models. Token counts, temperature, and timeouts are bounded by delegate_llm.limits.

Environment Variables

Core

Variable Default Description
OPENAI_API_KEY OpenAI API key
ANTHROPIC_API_KEY Anthropic API key
SETTINGS_PACK default Config pack from runtime_settings.json
AGENT_MODEL from pack LiteLLM model override
ORCHESTRATOR_MODEL from pack Model for decomposition and review calls
LLM_TIMEOUT_SECONDS 120 Timeout per model call
LLM_MAX_RETRIES 2 Retries per model call
LLM_API_BASE unset Optional LiteLLM API base override
SQL_DATABASE_URL unset Default SQLAlchemy DB URL
MAX_ITERATIONS 100 Max tool calls per agent
MAX_RUNTIME_SECONDS 600 Hard timeout per agent
SHARED_WORKSPACE auto Shared directory path

Multi-Agent Protocol

Variable Default Description
PROTOCOL_ENABLED false Enable ZeroMQ messaging
AGENT_ID auto Unique agent identifier
BROKER_MODE host or connect
BROKER_ROUTER tcp://localhost:5555 Broker router address
BROKER_SUB tcp://localhost:5556 Broker pub address
AGENT_TOPICS general Subscription topics
WORK_QUEUE_ENABLED false Enable work queue tools
PROTOCOL_HEARTBEAT_INTERVAL_SECONDS 5 Agent heartbeat interval
BROKER_HEARTBEAT_TIMEOUT_SECONDS 30 Broker liveness timeout
BROKER_LEASE_TIMEOUT_SECONDS 60 Task lease timeout
BROKER_SWEEP_INTERVAL_SECONDS 1 Broker maintenance sweep interval
BROKER_MAX_REDELIVERIES 5 Max redeliveries before dead-letter
BROKER_MAX_FAIL_RETRIES 0 Max retries after explicit TASK_FAIL
BROKER_REDELIVERY_BACKOFF_BASE_SECONDS 0 Redelivery backoff base
BROKER_REDELIVERY_BACKOFF_MAX_SECONDS 30 Max redelivery backoff

Orchestrator

Variable Default Description
MAX_WAVES 3 QA/fix retry waves
QA_ITERATIONS 30 QA agent iteration budget
FIX_ITERATIONS 15 Fix agent iteration budget
FIX_RUNTIME_SECONDS 120 Fix agent runtime budget
ORCHESTRATOR_MODEL from pack Model for task decomposition
COLLAB_EXECUTOR host host or docker backend
COLLAB_DOCKER_IMAGE epsilon Docker image
COLLAB_DOCKER_AUTO_BUILD 0 Auto-build missing image
COLLAB_DOCKER_USER unset Optional container user override

Tree Orchestrator

Variable Default Description
MAX_WAVES 2 QA/fix waves per team
INTEGRATION_WAVES 2 Integration QA waves after merge

QA Loop

When MAX_WAVES > 0, the orchestrator runs a QA agent after each build wave. The QA agent:

  1. reads source files
  2. installs dependencies
  3. runs tests
  4. starts the server and exercises endpoints
  5. checks for common integration mistakes
  6. writes qa_report.json

If QA fails, the orchestrator assigns errors back to responsible agents, reruns fix tasks, and repeats until QA passes or the wave budget is exhausted.

Messaging Protocol

The protocol is split into three planes:

  • Transport plane: ZeroMQ sockets move bytes
  • Topology plane: routing policy decides delivery
  • Coordination plane: heartbeats, leases, renewals, and redelivery

Current reliability semantics:

  • at-least-once delivery for work queue tasks
  • lease-based queue ownership
  • heartbeat-driven liveness eviction
  • dead-letter protection for poison tasks
  • bounded retries for explicit task failure
  • broadcast and directed messaging
  • last-value cache replay for topic state

Detailed contract: PROTOCOL_CONTRACT.md

Docker

Build the image:

docker build -t epsilon .

Run with Docker:

docker run --env-file .env epsilon "Build a URL shortener microservice"

Or use Docker as the executor backend:

COLLAB_EXECUTOR=docker COLLAB_DOCKER_IMAGE=epsilon \
  epsilon runs create --topology dag --task "Build a URL shortener microservice"

Scale Benchmark Harness

Start a benchmark run:

python scripts/run_scale_benchmark.py \
  --benchmark wiki \
  --task-count 300 \
  --executor direct_wiki \
  --start-broker \
  --broker-router tcp://<broker-host>:5555 \
  --broker-sub tcp://<broker-host>:5556

Start worker daemons:

python runtime/worker_daemon.py \
  --worker-id worker-01 \
  --broker-router tcp://<broker-host>:5555 \
  --broker-sub tcp://<broker-host>:5556 \
  --max-concurrent-local 1

Benchmark modes: --benchmark wiki, --benchmark compiler

Executors: --executor direct_wiki, --executor agent

Project Structure

├── orchestrate.py              # pattern dispatcher
├── orchestrators/
│   ├── patterns.py             # pattern registry
│   ├── dag_orchestrator.py
│   ├── tree_orchestrator.py
│   ├── pipeline_orchestrator.py
│   ├── supervisor_orchestrator.py
│   ├── work_queue_orchestrator.py
│   ├── sharded_queue_orchestrator.py
│   ├── map_reduce_orchestrator.py
│   ├── population_search_orchestrator.py
│   ├── population_search_engine.py
│   └── queue_runtime.py
├── runtime/
│   ├── builtin_agent_runner.py  # native agent startup
│   ├── epsilon_sdk.py           # adapter SDK
│   ├── epsilon_runner.py        # process adapter bridge
│   ├── epsilon_function_runner.py # function adapter bridge
│   └── worker_daemon.py         # queue worker daemon
├── agent/
│   ├── worker.py                # agent main loop
│   ├── tool_registry.py         # tool definitions
│   ├── prompts.py               # system prompts
│   └── models/                  # LiteLLM client
├── agent_protocol/              # ZeroMQ messaging
├── epsilon/                     # CLI and Python client
├── examples/                    # SDK starter templates
├── manifests/                   # sample task manifests
├── runtime_settings.json        # model and agent config
└── runs/                        # recorded run outputs