Technical Reference¶
This page covers architecture, configuration, the adapter protocol, and operational details. If you are looking for how to get started or choose a topology, see Getting Started and Topologies.
Architecture¶
┌──────────────────────────────────────────────────────────┐
│ orchestrators/tree_orchestrator.py │
│ (team decomposition + git branching) │
│ │
│ Decompose → team branches → merge → integration QA │
└─────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ orchestrators/dag_orchestrator.py │
│ (orchestrator + wave loop) │
│ │
│ Decompose → BUILD wave → QA wave → FIX wave │
└──────────┬──────────────────────────┬───────────┘
│ │
┌──────▼──────┐ ┌───────▼───────┐
│ MessageBroker│ │ Shared │
│ ROUTER :5555 │ │ Workspace │
│ PUB :5556 │ │ runs/shared-* │
└──────┬──────┘ └───────────────┘
│
┌──────┼──────────┬───────────┐
▼ ▼ ▼ ▼
Alice Bob Charlie Dave
(DEALER) (DEALER) (DEALER) (DEALER)
+ SUB + SUB + SUB + SUB
Each agent is an independent subprocess running runtime/builtin_agent_runner.py with:
- a
DEALERsocket to the broker router - a
SUBsocket to the broker publisher - a shared filesystem workspace
Adapter Protocol¶
You can plug in your own implementation without changing orchestrators. There are two adapter styles:
Function adapter¶
The simplest path. Write a Python function with this signature:
def run(input: dict, *, session=None, **kwargs) -> dict:
Run it with:
epsilon runs create \
--topology dag \
--task "your task" \
--implementation python:path/to/file.py:run
Process adapter¶
For non-Python implementations or when you need full subprocess isolation. Your process reads JSON from stdin and writes JSON to stdout:
Input (first line on stdin):
{"type": "run_request", "task": "...", "workspace": "...", "agent_id": "...", "topics": ["general"]}
Output (final line on stdout):
{"type": "run_result", "status": "ok", "summary": "..."}
Optional stdout messages during execution:
{"type": "log", "message": "..."}
{"type": "send_message", "content": "...", "topic": "general"}
{"type": "check_messages"}
Message polling reply (from the runner, on stdin):
{"type": "message_batch", "messages": [...]}
Run it with:
epsilon runs create \
--topology dag \
--task "your task" \
--implementation "python3 path/to/adapter.py"
Agent Tools¶
The built-in agent has access to these tools:
| Tool | Description |
|---|---|
run_bash |
Execute shell commands |
read_file |
Read file contents |
write_file |
Create or overwrite files |
edit_file |
Surgical string replacement |
sql_query |
Parameterized SQL query execution |
web_search |
Search the web |
fetch_url |
Fetch URL contents |
call_llm |
Call an allowlisted delegate LLM |
plan |
Enter planning mode |
submit_plan |
Submit subtasks after planning |
mark_complete |
Advance to the next subtask |
done |
Signal task completion |
send_message |
Broadcast or direct a message to another agent |
check_messages |
Receive messages from other agents |
submit_task |
Add a task to the shared queue |
request_task |
Pull the next task from the queue |
Configuration¶
runtime_settings.json¶
Controls the built-in agent's model, iteration limits, and delegate LLM settings:
{
"defaultSettingsPack": "default",
"settingsPacks": {
"default": {
"description": "Default configuration",
"model": "openai/gpt-5.2",
"max_iterations": 100,
"max_runtime_seconds": 600,
"max_tokens": 4096
},
"anthropic": {
"description": "Anthropic model configuration",
"model": "anthropic/claude-opus-4-6",
"max_iterations": 100,
"max_runtime_seconds": 600,
"max_tokens": 4096
}
},
"delegate_llm": {
"enabled": true,
"default_model": "openai/gpt-5.2",
"allowed_models": [
"openai/gpt-5.2",
"anthropic/claude-opus-4-6"
],
"limits": {
"max_tokens_default": 512,
"max_tokens_min": 32,
"max_tokens_max": 1024,
"temperature_default": 0.0,
"temperature_min": 0.0,
"temperature_max": 0.7,
"timeout_seconds_default": 30,
"timeout_seconds_min": 5,
"timeout_seconds_max": 60,
"prompt_max_chars": 12000,
"response_max_chars": 12000
}
}
}
The call_llm tool only allows models listed in delegate_llm.allowed_models. Token counts, temperature, and timeouts are bounded by delegate_llm.limits.
Environment Variables¶
Core¶
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
— | OpenAI API key |
ANTHROPIC_API_KEY |
— | Anthropic API key |
SETTINGS_PACK |
default |
Config pack from runtime_settings.json |
AGENT_MODEL |
from pack | LiteLLM model override |
ORCHESTRATOR_MODEL |
from pack | Model for decomposition and review calls |
LLM_TIMEOUT_SECONDS |
120 |
Timeout per model call |
LLM_MAX_RETRIES |
2 |
Retries per model call |
LLM_API_BASE |
unset | Optional LiteLLM API base override |
SQL_DATABASE_URL |
unset | Default SQLAlchemy DB URL |
MAX_ITERATIONS |
100 |
Max tool calls per agent |
MAX_RUNTIME_SECONDS |
600 |
Hard timeout per agent |
SHARED_WORKSPACE |
auto | Shared directory path |
Multi-Agent Protocol¶
| Variable | Default | Description |
|---|---|---|
PROTOCOL_ENABLED |
false |
Enable ZeroMQ messaging |
AGENT_ID |
auto | Unique agent identifier |
BROKER_MODE |
— | host or connect |
BROKER_ROUTER |
tcp://localhost:5555 |
Broker router address |
BROKER_SUB |
tcp://localhost:5556 |
Broker pub address |
AGENT_TOPICS |
general |
Subscription topics |
WORK_QUEUE_ENABLED |
false |
Enable work queue tools |
PROTOCOL_HEARTBEAT_INTERVAL_SECONDS |
5 |
Agent heartbeat interval |
BROKER_HEARTBEAT_TIMEOUT_SECONDS |
30 |
Broker liveness timeout |
BROKER_LEASE_TIMEOUT_SECONDS |
60 |
Task lease timeout |
BROKER_SWEEP_INTERVAL_SECONDS |
1 |
Broker maintenance sweep interval |
BROKER_MAX_REDELIVERIES |
5 |
Max redeliveries before dead-letter |
BROKER_MAX_FAIL_RETRIES |
0 |
Max retries after explicit TASK_FAIL |
BROKER_REDELIVERY_BACKOFF_BASE_SECONDS |
0 |
Redelivery backoff base |
BROKER_REDELIVERY_BACKOFF_MAX_SECONDS |
30 |
Max redelivery backoff |
Orchestrator¶
| Variable | Default | Description |
|---|---|---|
MAX_WAVES |
3 |
QA/fix retry waves |
QA_ITERATIONS |
30 |
QA agent iteration budget |
FIX_ITERATIONS |
15 |
Fix agent iteration budget |
FIX_RUNTIME_SECONDS |
120 |
Fix agent runtime budget |
ORCHESTRATOR_MODEL |
from pack | Model for task decomposition |
COLLAB_EXECUTOR |
host |
host or docker backend |
COLLAB_DOCKER_IMAGE |
epsilon |
Docker image |
COLLAB_DOCKER_AUTO_BUILD |
0 |
Auto-build missing image |
COLLAB_DOCKER_USER |
unset | Optional container user override |
Tree Orchestrator¶
| Variable | Default | Description |
|---|---|---|
MAX_WAVES |
2 |
QA/fix waves per team |
INTEGRATION_WAVES |
2 |
Integration QA waves after merge |
QA Loop¶
When MAX_WAVES > 0, the orchestrator runs a QA agent after each build wave. The QA agent:
- reads source files
- installs dependencies
- runs tests
- starts the server and exercises endpoints
- checks for common integration mistakes
- writes
qa_report.json
If QA fails, the orchestrator assigns errors back to responsible agents, reruns fix tasks, and repeats until QA passes or the wave budget is exhausted.
Messaging Protocol¶
The protocol is split into three planes:
- Transport plane: ZeroMQ sockets move bytes
- Topology plane: routing policy decides delivery
- Coordination plane: heartbeats, leases, renewals, and redelivery
Current reliability semantics:
- at-least-once delivery for work queue tasks
- lease-based queue ownership
- heartbeat-driven liveness eviction
- dead-letter protection for poison tasks
- bounded retries for explicit task failure
- broadcast and directed messaging
- last-value cache replay for topic state
Detailed contract: PROTOCOL_CONTRACT.md
Docker¶
Build the image:
docker build -t epsilon .
Run with Docker:
docker run --env-file .env epsilon "Build a URL shortener microservice"
Or use Docker as the executor backend:
COLLAB_EXECUTOR=docker COLLAB_DOCKER_IMAGE=epsilon \
epsilon runs create --topology dag --task "Build a URL shortener microservice"
Scale Benchmark Harness¶
Start a benchmark run:
python scripts/run_scale_benchmark.py \
--benchmark wiki \
--task-count 300 \
--executor direct_wiki \
--start-broker \
--broker-router tcp://<broker-host>:5555 \
--broker-sub tcp://<broker-host>:5556
Start worker daemons:
python runtime/worker_daemon.py \
--worker-id worker-01 \
--broker-router tcp://<broker-host>:5555 \
--broker-sub tcp://<broker-host>:5556 \
--max-concurrent-local 1
Benchmark modes: --benchmark wiki, --benchmark compiler
Executors: --executor direct_wiki, --executor agent
Project Structure¶
├── orchestrate.py # pattern dispatcher
├── orchestrators/
│ ├── patterns.py # pattern registry
│ ├── dag_orchestrator.py
│ ├── tree_orchestrator.py
│ ├── pipeline_orchestrator.py
│ ├── supervisor_orchestrator.py
│ ├── work_queue_orchestrator.py
│ ├── sharded_queue_orchestrator.py
│ ├── map_reduce_orchestrator.py
│ ├── population_search_orchestrator.py
│ ├── population_search_engine.py
│ └── queue_runtime.py
├── runtime/
│ ├── builtin_agent_runner.py # native agent startup
│ ├── epsilon_sdk.py # adapter SDK
│ ├── epsilon_runner.py # process adapter bridge
│ ├── epsilon_function_runner.py # function adapter bridge
│ └── worker_daemon.py # queue worker daemon
├── agent/
│ ├── worker.py # agent main loop
│ ├── tool_registry.py # tool definitions
│ ├── prompts.py # system prompts
│ └── models/ # LiteLLM client
├── agent_protocol/ # ZeroMQ messaging
├── epsilon/ # CLI and Python client
├── examples/ # SDK starter templates
├── manifests/ # sample task manifests
├── runtime_settings.json # model and agent config
└── runs/ # recorded run outputs