Technical Reference¶

This page covers architecture, configuration, the adapter protocol, and operational details. If you are looking for how to get started or choose a topology, see Getting Started and Topologies.

Architecture¶

┌──────────────────────────────────────────────────────────┐
│            orchestrators/tree_orchestrator.py           │
│          (team decomposition + git branching)           │
│                                                          │
│  Decompose → team branches → merge → integration QA     │
└─────────┬────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────┐
│          orchestrators/dag_orchestrator.py     │
│         (orchestrator + wave loop)             │
│                                                 │
│  Decompose → BUILD wave → QA wave → FIX wave   │
└──────────┬──────────────────────────┬───────────┘
           │                          │
    ┌──────▼──────┐           ┌───────▼───────┐
    │ MessageBroker│           │ Shared        │
    │ ROUTER :5555 │           │ Workspace     │
    │ PUB    :5556 │           │ runs/shared-* │
    └──────┬──────┘           └───────────────┘
           │
    ┌──────┼──────────┬───────────┐
    ▼      ▼          ▼           ▼
  Alice   Bob      Charlie      Dave
 (DEALER) (DEALER)  (DEALER)   (DEALER)
  + SUB    + SUB     + SUB      + SUB

Each agent is an independent subprocess running runtime/builtin_agent_runner.py with:

a DEALER socket to the broker router
a SUB socket to the broker publisher
a shared filesystem workspace

Adapter Protocol¶

You can plug in your own implementation without changing orchestrators. There are two adapter styles:

Function adapter¶

The simplest path. Write a Python function with this signature:

def run(input: dict, *, session=None, **kwargs) -> dict:

Run it with:

epsilon runs create \
  --topology dag \
  --task "your task" \
  --implementation python:path/to/file.py:run

Process adapter¶

For non-Python implementations or when you need full subprocess isolation. Your process reads JSON from stdin and writes JSON to stdout:

Input (first line on stdin):

{"type": "run_request", "task": "...", "workspace": "...", "agent_id": "...", "topics": ["general"]}

Output (final line on stdout):

{"type": "run_result", "status": "ok", "summary": "..."}

Optional stdout messages during execution:

{"type": "log", "message": "..."}
{"type": "send_message", "content": "...", "topic": "general"}
{"type": "check_messages"}

Message polling reply (from the runner, on stdin):

{"type": "message_batch", "messages": [...]}

Run it with:

epsilon runs create \
  --topology dag \
  --task "your task" \
  --implementation "python3 path/to/adapter.py"

Agent Tools¶

The built-in agent has access to these tools:

Tool	Description
`run_bash`	Execute shell commands
`read_file`	Read file contents
`write_file`	Create or overwrite files
`edit_file`	Surgical string replacement
`sql_query`	Parameterized SQL query execution
`web_search`	Search the web
`fetch_url`	Fetch URL contents
`call_llm`	Call an allowlisted delegate LLM
`plan`	Enter planning mode
`submit_plan`	Submit subtasks after planning
`mark_complete`	Advance to the next subtask
`done`	Signal task completion
`send_message`	Broadcast or direct a message to another agent
`check_messages`	Receive messages from other agents
`submit_task`	Add a task to the shared queue
`request_task`	Pull the next task from the queue

Configuration¶

`runtime_settings.json`¶

Controls the built-in agent's model, iteration limits, and delegate LLM settings:

{
  "defaultSettingsPack": "default",
  "settingsPacks": {
    "default": {
      "description": "Default configuration",
      "model": "openai/gpt-5.2",
      "max_iterations": 100,
      "max_runtime_seconds": 600,
      "max_tokens": 4096
    },
    "anthropic": {
      "description": "Anthropic model configuration",
      "model": "anthropic/claude-opus-4-6",
      "max_iterations": 100,
      "max_runtime_seconds": 600,
      "max_tokens": 4096
    }
  },
  "delegate_llm": {
    "enabled": true,
    "default_model": "openai/gpt-5.2",
    "allowed_models": [
      "openai/gpt-5.2",
      "anthropic/claude-opus-4-6"
    ],
    "limits": {
      "max_tokens_default": 512,
      "max_tokens_min": 32,
      "max_tokens_max": 1024,
      "temperature_default": 0.0,
      "temperature_min": 0.0,
      "temperature_max": 0.7,
      "timeout_seconds_default": 30,
      "timeout_seconds_min": 5,
      "timeout_seconds_max": 60,
      "prompt_max_chars": 12000,
      "response_max_chars": 12000
    }
  }
}

The call_llm tool only allows models listed in delegate_llm.allowed_models. Token counts, temperature, and timeouts are bounded by delegate_llm.limits.

Environment Variables¶

Core¶

Variable	Default	Description
`OPENAI_API_KEY`	—	OpenAI API key
`ANTHROPIC_API_KEY`	—	Anthropic API key
`SETTINGS_PACK`	`default`	Config pack from `runtime_settings.json`
`AGENT_MODEL`	from pack	LiteLLM model override
`ORCHESTRATOR_MODEL`	from pack	Model for decomposition and review calls
`LLM_TIMEOUT_SECONDS`	`120`	Timeout per model call
`LLM_MAX_RETRIES`	`2`	Retries per model call
`LLM_API_BASE`	unset	Optional LiteLLM API base override
`SQL_DATABASE_URL`	unset	Default SQLAlchemy DB URL
`MAX_ITERATIONS`	`100`	Max tool calls per agent
`MAX_RUNTIME_SECONDS`	`600`	Hard timeout per agent
`SHARED_WORKSPACE`	auto	Shared directory path

Multi-Agent Protocol¶

Variable	Default	Description
`PROTOCOL_ENABLED`	`false`	Enable ZeroMQ messaging
`AGENT_ID`	auto	Unique agent identifier
`BROKER_MODE`	—	`host` or `connect`
`BROKER_ROUTER`	`tcp://localhost:5555`	Broker router address
`BROKER_SUB`	`tcp://localhost:5556`	Broker pub address
`AGENT_TOPICS`	`general`	Subscription topics
`WORK_QUEUE_ENABLED`	`false`	Enable work queue tools
`PROTOCOL_HEARTBEAT_INTERVAL_SECONDS`	`5`	Agent heartbeat interval
`BROKER_HEARTBEAT_TIMEOUT_SECONDS`	`30`	Broker liveness timeout
`BROKER_LEASE_TIMEOUT_SECONDS`	`60`	Task lease timeout
`BROKER_SWEEP_INTERVAL_SECONDS`	`1`	Broker maintenance sweep interval
`BROKER_MAX_REDELIVERIES`	`5`	Max redeliveries before dead-letter
`BROKER_MAX_FAIL_RETRIES`	`0`	Max retries after explicit `TASK_FAIL`
`BROKER_REDELIVERY_BACKOFF_BASE_SECONDS`	`0`	Redelivery backoff base
`BROKER_REDELIVERY_BACKOFF_MAX_SECONDS`	`30`	Max redelivery backoff

Orchestrator¶

Variable	Default	Description
`MAX_WAVES`	`3`	QA/fix retry waves
`QA_ITERATIONS`	`30`	QA agent iteration budget
`FIX_ITERATIONS`	`15`	Fix agent iteration budget
`FIX_RUNTIME_SECONDS`	`120`	Fix agent runtime budget
`ORCHESTRATOR_MODEL`	from pack	Model for task decomposition
`COLLAB_EXECUTOR`	`host`	`host` or `docker` backend
`COLLAB_DOCKER_IMAGE`	`epsilon`	Docker image
`COLLAB_DOCKER_AUTO_BUILD`	`0`	Auto-build missing image
`COLLAB_DOCKER_USER`	unset	Optional container user override

Tree Orchestrator¶

Variable	Default	Description
`MAX_WAVES`	`2`	QA/fix waves per team
`INTEGRATION_WAVES`	`2`	Integration QA waves after merge

QA Loop¶

When MAX_WAVES > 0, the orchestrator runs a QA agent after each build wave. The QA agent:

reads source files
installs dependencies
runs tests
starts the server and exercises endpoints
checks for common integration mistakes
writes qa_report.json

If QA fails, the orchestrator assigns errors back to responsible agents, reruns fix tasks, and repeats until QA passes or the wave budget is exhausted.

Messaging Protocol¶

The protocol is split into three planes:

Transport plane: ZeroMQ sockets move bytes
Topology plane: routing policy decides delivery
Coordination plane: heartbeats, leases, renewals, and redelivery

Current reliability semantics:

at-least-once delivery for work queue tasks
lease-based queue ownership
heartbeat-driven liveness eviction
dead-letter protection for poison tasks
bounded retries for explicit task failure
broadcast and directed messaging
last-value cache replay for topic state

Detailed contract: PROTOCOL_CONTRACT.md

Docker¶

Build the image:

docker build -t epsilon .

Run with Docker:

docker run --env-file .env epsilon "Build a URL shortener microservice"

Or use Docker as the executor backend:

COLLAB_EXECUTOR=docker COLLAB_DOCKER_IMAGE=epsilon \
  epsilon runs create --topology dag --task "Build a URL shortener microservice"

Scale Benchmark Harness¶

Start a benchmark run:

python scripts/run_scale_benchmark.py \
  --benchmark wiki \
  --task-count 300 \
  --executor direct_wiki \
  --start-broker \
  --broker-router tcp://<broker-host>:5555 \
  --broker-sub tcp://<broker-host>:5556

Start worker daemons:

python runtime/worker_daemon.py \
  --worker-id worker-01 \
  --broker-router tcp://<broker-host>:5555 \
  --broker-sub tcp://<broker-host>:5556 \
  --max-concurrent-local 1

Benchmark modes: --benchmark wiki, --benchmark compiler

Executors: --executor direct_wiki, --executor agent

Project Structure¶

├── orchestrate.py              # pattern dispatcher
├── orchestrators/
│   ├── patterns.py             # pattern registry
│   ├── dag_orchestrator.py
│   ├── tree_orchestrator.py
│   ├── pipeline_orchestrator.py
│   ├── supervisor_orchestrator.py
│   ├── work_queue_orchestrator.py
│   ├── sharded_queue_orchestrator.py
│   ├── map_reduce_orchestrator.py
│   ├── population_search_orchestrator.py
│   ├── population_search_engine.py
│   └── queue_runtime.py
├── runtime/
│   ├── builtin_agent_runner.py  # native agent startup
│   ├── epsilon_sdk.py           # adapter SDK
│   ├── epsilon_runner.py        # process adapter bridge
│   ├── epsilon_function_runner.py # function adapter bridge
│   └── worker_daemon.py         # queue worker daemon
├── agent/
│   ├── worker.py                # agent main loop
│   ├── tool_registry.py         # tool definitions
│   ├── prompts.py               # system prompts
│   └── models/                  # LiteLLM client
├── agent_protocol/              # ZeroMQ messaging
├── epsilon/                     # CLI and Python client
├── examples/                    # SDK starter templates
├── manifests/                   # sample task manifests
├── runtime_settings.json        # model and agent config
└── runs/                        # recorded run outputs