Scaling & Multi-Region
How Olympus scales from a single instance to a globally distributed deployment.
Single-writer, distributed reads
A matching engine is a single-writer system by design. There's one sequencer, one order book, one ledger — this is how every major exchange operates, from Nasdaq to CME. You can't shard a central limit order book across regions without breaking price-time priority or accepting split-brain state.
Olympus doesn't fight this constraint. The matching engine runs in one location. Everything around it — market data, read APIs, WebSocket streaming — can be distributed globally.
Architecture
Region A (primary) Region B / C (read replicas)
+-----------------------+ +---------------------------+
| API (reads + writes) | | API (reads only) |
orders > | Sequencer | snapshot | Snapshot consumer |
| CoreEngine | stream > | WebSocket fan-out |
| Event Processor | | depth / balance / kline |
| RocksDB | +---------------------------+
| reth (EVM) |
| Bridge / Settlement | +---------------------------+
| | tick log | Standby (DR) |
| | stream > | Continuous replay |
+-----------------------+ | Warm failover |
+---------------------------+Primary instance
Runs the full binary as it exists today: API server, sequencer, core engine, event processor, RocksDB persistence, embedded reth node, bridge and settlement. All writes (order placement, cancellation, admin operations) are handled here.
Two additions for the distributed model:
-
Snapshot publishing — after each tick, the engine already publishes an
EngineSnapshotviaArcSwapfor local API reads. The same snapshot gets serialized and published to a message bus (NATS, Redis pub/sub, or Kafka) for remote consumers. -
Tick log streaming — the append log write that already happens in the event processor gets mirrored to a durable stream for the standby instance.
Read replicas
Lightweight instances deployed in each target region. They consume the snapshot stream and serve read-only API endpoints:
GET /api/v1/depth— order book depthGET /api/v1/exchangeInfo— instrument configsGET /api/v1/ticker/bookTicker— best bid/askGET /api/v1/balance— account balancesGET /ws— WebSocket streaming (order book, trades, klines, allMids)
Read replicas don't run a sequencer, engine, or reth node. They receive snapshots, store them in local memory via ArcSwap (the same pattern the primary uses), and serve requests from that. The data is one tick stale plus network latency to the region — the same staleness model the primary's API layer already operates under.
Write requests (POST /api/v1/order, DELETE /api/v1/order, admin endpoints) are either rejected at the replica or proxied to the primary. Proxying is simpler for clients (they hit the same endpoint regardless of region) but adds a round trip.
Warm standby
A second full instance in a different region or availability zone, continuously replaying the tick log stream. It doesn't serve traffic — it just maintains near-current engine state.
On primary failure:
- Standby finishes replaying any buffered ticks
- Standby starts accepting writes (becomes the new primary)
- DNS or load balancer cuts over
- Read replicas switch to the new primary's snapshot stream
Recovery time is bounded by how far behind the standby's replay is, which under normal conditions is sub-second. The deterministic replay guarantee means the standby produces identical state from the same tick log — no reconciliation needed.
The embedded reth node would need a separate replication strategy (database snapshot + block replay) since it has its own state. For the initial version, the standby can re-deploy the settlement contract and start fresh — committed batch history is immutable on the primary's chain, and the standby's chain is a new instance.
Implementation phases
Phase 1: WebSocket broadcast optimisation
Status: Done — MarketDataChannels with tokio::broadcast is already in place.
The previous implementation serialized the order book independently for each WebSocket connection. The current architecture uses broadcast channels: a MarketDataPublisher serializes each message once and sends it to all subscribers via tokio::broadcast. This is the prerequisite for handling high connection counts.
Phase 2: Snapshot publishing
Add a snapshot serialization + publish step to the engine thread or event processor:
// After ArcSwap::store(), serialize and publish
let snap_bytes = bincode::serialize(&snapshot)?;
message_bus.publish("olympus.snapshots", snap_bytes).await;Key decisions:
- Format:
bincodefor speed, or MessagePack for cross-language consumers - Transport: NATS for low latency, Kafka for durability and replay
- Frequency: Every tick (1ms default) or throttled (e.g., every 50ms) depending on bandwidth
The snapshot struct (EngineSnapshot) already derives Serialize/Deserialize, so the serialization path exists.
Continuous mode interaction: In continuous matching mode, snapshot publishing is debounced (
OLYMPUS_SNAPSHOT_INTERVAL_US, default 500µs). Read replicas consuming the snapshot stream will see updates at this interval rather than per-order. The debounce interval sets the floor for read replica staleness in continuous mode.
Phase 3: Read replica binary
A second binary (or a --read-replica mode of the existing binary) that:
- Connects to the message bus and subscribes to snapshot stream
- Deserializes snapshots into
ArcSwap<EngineSnapshot> - Runs the same Axum API server with read-only routes
- Runs the same WebSocket handler with the same
MarketDataChannelsbroadcast
Most of the code is reusable — the API handlers already read from ArcSwap<EngineSnapshot> with no awareness of how the snapshot gets there. The only change is the snapshot source: instead of a local engine thread calling ArcSwap::store(), a consumer task receives snapshots from the message bus.
Phase 4: Tick log streaming + warm standby
Mirror the RocksDB append log writes to a durable stream:
// In the event processor, after append_log.append():
tick_stream.publish("olympus.ticks", &serialized_tick).await;The standby instance consumes this stream and replays ticks through its own CoreEngine. On failover, it has near-current state and can be promoted.
Phase 5: Write proxying
Read replicas accept write requests and forward them to the primary:
Client (Singapore) > Read Replica (Singapore) > Primary (London) > Sequencer
< Response
< ResponseThis simplifies client configuration (single endpoint per region) at the cost of one additional network hop on writes. The alternative — clients connect directly to the primary for writes — is lower latency but requires clients to know about two endpoints.
Cloud provider requirements
| Requirement | Purpose |
|---|---|
| Persistent block storage (EBS, Persistent Disk) | RocksDB on primary and standby |
| Message bus (NATS, Kafka, Redis) | Snapshot + tick log streaming |
| Multi-region compute | Read replicas and standby |
| DNS failover or global load balancer | Traffic routing, DR cutover |
| Private network peering between regions | Low-latency snapshot delivery |
Railway's single-volume-per-service constraint and lack of cross-region volume sharing make it unsuitable for this model. AWS, GCP, or Azure provide the primitives needed: regional compute, durable messaging, and network peering.
What doesn't change
The core architecture stays the same. The matching engine is still a single-writer on a dedicated thread. The sequencer still assigns monotonic sequences. The event processor still persists to RocksDB and processes bridge instructions. The settlement contract still commits batch roots on-chain.
Scaling is additive — publishing snapshots, adding read replicas, streaming the tick log — not a rewrite. The ArcSwap pattern that makes local API reads lock-free is the same pattern that makes remote API reads work: swap in a new snapshot, serve from it.