Olympus
Performance

Continuous Mode Tuning

Environment variables, deployment profiles, and optimization details for continuous matching mode.

Overview

Continuous matching mode (OLYMPUS_CONTINUOUS_MATCHING=true) matches orders individually on arrival instead of batching into ticks. The hot path is optimized with three techniques:

  1. Debounced snapshot publishingEngineSnapshot::from_engine() runs at a configurable interval instead of per-order
  2. Batched broadcast sends — market data and persistence broadcasts are accumulated and flushed with the snapshot
  3. Three-phase receive loop — drain via try_recv, spin briefly, then fall back to blocking recv_timeout
  +--------------------------------------------------------------+
  |                   Engine Thread Loop                           |
  |                                                               |
  |  Phase 1: Drain --> try_recv() in tight loop (no syscalls)    |
  |       |                                                       |
  |       v                                                       |
  |  Phase 2: Timers --> debounced snapshot + commit check        |
  |       |                                                       |
  |       v                                                       |
  |  Phase 3: Spin --> spin_loop() x OLYMPUS_SPIN_ITERS           |
  |       |                                                       |
  |       v                                                       |
  |  Phase 4: Block --> recv_timeout(remaining commit interval)   |
  +--------------------------------------------------------------+

Architecture: Decoupled Matching and Market Data

Olympus follows the same architecture as production exchanges like Binance: the matching engine and market data feed operate on independent cadences.

┌─────────────────────────────────────────────────────┐
│  Matching Engine (event-driven)                      │
│  Each order matched immediately on arrival (~µs)     │
│  Results accumulated in memory                       │
├─────────────────────────────────────────────────────┤
│  Snapshot Publisher (OLYMPUS_SNAPSHOT_INTERVAL_US)    │
│  Publishes ArcSwap snapshot for REST API reads       │
├─────────────────────────────────────────────────────┤
│  Market Data Feed (OLYMPUS_MARKET_DATA_INTERVAL_MS)  │
│  Batches depth diffs + mids → WS broadcast           │
│  Trades + account updates sent per-tick (no batching)│
├─────────────────────────────────────────────────────┤
│  Commitment (OLYMPUS_TICK_INTERVAL_MS)               │
│  Batches matched results → hash chain + persistence  │
└─────────────────────────────────────────────────────┘

Matching is event-driven in continuous mode — each order is processed immediately via try_recv(). There is no "tick interval" for matching; the engine processes orders as fast as they arrive.

Market data is published on a fixed timer, independent of matching speed. This prevents WS channel overflow when the engine processes thousands of orders per second. Binance uses 100ms/1000ms intervals; Olympus defaults to 100ms.

Persistence (commitment batching) controls how often matched results are hashed and written to RocksDB. Wider batches reduce I/O overhead but increase the crash window (data loss on unclean shutdown).

Environment Variable Reference

VariableDefaultDescriptionTradeoff
OLYMPUS_CONTINUOUS_MATCHINGfalseEnable continuous (event-driven) matchingCrash window vs latency
OLYMPUS_SNAPSHOT_INTERVAL_US500Microseconds between snapshot publishes (REST API freshness)API staleness vs CPU overhead
OLYMPUS_MARKET_DATA_INTERVAL_MS100Milliseconds between WS market data broadcasts (depth diffs, mids)UI freshness vs WS throughput
OLYMPUS_WS_CHANNEL_CAPACITY512Broadcast channel buffer size for WS messagesMemory vs lag tolerance
OLYMPUS_TICK_INTERVAL_MS1Commitment batch interval — persistence + hash chain (ms)Crash window vs I/O overhead
OLYMPUS_SPIN_ITERS256Spin iterations before blocking on the channelCPU usage vs wake-up latency
OLYMPUS_EVM_BLOCK_TIME_MS1000EVM block production intervalBlock explorer update rate
OLYMPUS_ENGINE_CORE(unset)Pin engine thread to CPU coreLatency consistency
OLYMPUS_HASHER_CORE(unset)Pin hasher thread to CPU coreHash throughput

What does each interval control?

  • SNAPSHOT_INTERVAL_US — how fresh the REST API is (depth, bookTicker, balance). Lower = more current reads, higher CPU.
  • MARKET_DATA_INTERVAL_MS — how often WS clients receive order book updates. 100ms = 10 updates/sec (industry standard). Lower = faster UI, higher WS bandwidth.
  • TICK_INTERVAL_MS — how often matched results are persisted. Does NOT affect matching speed. Higher = less I/O, larger crash window.

Deployment Profiles

Cloud / Railway

Shared CPU cores, network-attached storage. Save CPU, accept higher latency.

OLYMPUS_CONTINUOUS_MATCHING=true
OLYMPUS_SNAPSHOT_INTERVAL_US=1000
OLYMPUS_MARKET_DATA_INTERVAL_MS=100
OLYMPUS_SPIN_ITERS=0
OLYMPUS_TICK_INTERVAL_MS=100
OLYMPUS_EVM_BLOCK_TIME_MS=1000
# No core pinning — shared infrastructure

Docker / dev

Local development. Low overhead, good enough latency.

OLYMPUS_CONTINUOUS_MATCHING=true
OLYMPUS_SNAPSHOT_INTERVAL_US=1000
OLYMPUS_MARKET_DATA_INTERVAL_MS=100
OLYMPUS_SPIN_ITERS=0
OLYMPUS_TICK_INTERVAL_MS=1

Bare metal / production

Dedicated cores, pinned threads. Balance freshness and overhead.

OLYMPUS_CONTINUOUS_MATCHING=true
OLYMPUS_SNAPSHOT_INTERVAL_US=500
OLYMPUS_MARKET_DATA_INTERVAL_MS=100
OLYMPUS_SPIN_ITERS=256
OLYMPUS_TICK_INTERVAL_MS=1
OLYMPUS_ENGINE_CORE=2
OLYMPUS_HASHER_CORE=3

HFT / ultra-low-latency

Aggressive spinning, tight snapshots, wider commit batches.

OLYMPUS_CONTINUOUS_MATCHING=true
OLYMPUS_SNAPSHOT_INTERVAL_US=250
OLYMPUS_MARKET_DATA_INTERVAL_MS=50
OLYMPUS_SPIN_ITERS=1024
OLYMPUS_TICK_INTERVAL_MS=5
OLYMPUS_ENGINE_CORE=2
OLYMPUS_HASHER_CORE=3
# Also: isolcpus=2,3 in kernel boot params

HFT profile crash window

Setting OLYMPUS_TICK_INTERVAL_MS=5 means up to 5ms of matched orders can be lost on crash. This is an acceptable tradeoff for HFT workloads where latency matters more than durability, but should be documented in operational runbooks.

Optimization Details

Snapshot debounce

Before: EngineSnapshot::from_engine() after every match_order(). This method iterates all instruments, calls bid_depth(1000) and ask_depth(1000) on each book (walking up to 1000 BTreeMap levels, summing VecDeque remaining quantities), and clones the full balance map. At 1000 orders/sec with 4 instruments and deep books, snapshot overhead was 50-200ms/sec.

After: Snapshot publishes at most once per OLYMPUS_SNAPSHOT_INTERVAL_US. At 1000 orders/sec with a 500µs interval, overhead drops to ~2ms/sec. API reads may be up to OLYMPUS_SNAPSHOT_INTERVAL_US stale — at 500µs, this is well within acceptable market data latency for any UI refreshing at ≥10ms intervals.

Metric: snapshot_debounce_batch_size histogram shows how many orders accumulate between snapshot publishes. Higher values indicate the debounce is absorbing more per-order overhead.

Batched broadcasts

Before: tokio::broadcast::send() after every order — each send allocates an Arc, performs atomic reference counting for all subscribers, and may contend with the tokio runtime.

After: Trades and order events are accumulated in Vec buffers and flushed when the snapshot debounce fires. The Vec growth is amortized across orders (no per-order heap allocation once capacity is reached). Broadcast subscribers receive trades in micro-batches (up to OLYMPUS_SNAPSHOT_INTERVAL_US worth) instead of per-order.

Three-phase receive loop

Before: recv_timeout(Duration::from_micros(100)) on every iteration. Even when orders are flowing, each call enters the kernel's clock subsystem (Instant::now() inside the timeout check).

After:

  • Phase 1 (drain): try_recv() in a tight loop. No timer checks, no Instant::now(), no syscalls. During a burst of 1000 orders, the engine processes all of them before checking any timer.
  • Phase 2 (timers): After the drain exhausts, check the snapshot debounce timer and commitment timer. This runs once per burst, not per order.
  • Phase 3 (spin): std::hint::spin_loop() × OLYMPUS_SPIN_ITERS. Each iteration is ~1ns. At 256 iterations, this is ~256ns of CPU time before giving up. Catches orders that arrive just after the drain.
  • Phase 4 (block): recv_timeout with the remaining commitment interval. Only reached when the engine is genuinely idle.

Reduced cloning

Before: Per-order Vec::clone() on single.trades, single.order_updates, and single.bridge_instructions to accumulate into commitment buffers.

After: Vec::append(&mut single.trades) moves elements without allocating. The only remaining clone is for trade stamping (market data needs timestamp_ns set, commitment needs un-stamped trades), which is one clone per trade rather than a full Vec allocation. Additionally, InstrumentId clones are now stack-local (CompactString, 24-byte memcpy) rather than heap allocations, and ledger operations no longer clone InstrumentId per call thanks to the nested map structure (AccountId → InstrumentId → AccountBalance), further reducing per-order overhead.

Monitoring

The following metrics are specific to continuous mode:

MetricTypeWhat to watch
continuous_orders_matched_totalCounterOrder throughput in continuous mode
snapshot_debounce_batch_sizeHistogramOrders per snapshot publish — higher means better amortization
snapshot_publish_latency_nsHistogramPer-snapshot cost — should be stable regardless of order rate
matching_latency_nsHistogramPer-tick matching (batch mode only — not recorded in continuous mode)

On this page