Olympus
Performance

Latency

Matching engine latency profile — percentile distributions for batch and continuous matching modes, methodology, and how to reproduce.

The matching engine supports two matching modes with different latency characteristics. These numbers were measured on an Apple M-series processor using the HDR histogram benchmark suite.

208ns

Continuous mode p50 — single order, resting on book

541ns

Continuous mode p50 — crossing order, producing a trade

2.6µs

Batch mode p50 — match_tick, 10 orders per tick

3.9M/sec

Sustained throughput — continuous mode, orders per second

How to run

cargo bench -p olympus-core --bench latency

Batch Mode

In batch mode, the sequencer groups transactions into ticks at a configurable interval (default 1ms). The engine processes the entire tick atomically.

process_tick — matching + hashing

Full tick pipeline including Blake3 state hash and merkle root computation.

minp50p95p99p99.9maxmean
10 orders2,416 ns2,793 ns3,251 ns7,375 ns13,919 ns34,239 ns2,938 ns

match_tick — matching only

Engine hot path. Hash computation runs on a separate hasher thread in production.

minp50p95p99p99.9maxmean
10 orders2,374 ns2,625 ns3,167 ns7,503 ns13,671 ns25,135 ns2,784 ns

compute_commitments — hashing only

Blake3 state hash + merkle tree over trade leaves. Runs on the hasher thread.

minp50p95p99p99.9maxmean
10 trades4,292 ns4,503 ns4,875 ns5,627 ns13,503 ns30,383 ns4,567 ns

Hash overhead

The difference between process_tick and match_tick is ~168ns at p50 for 10 orders. In production this cost is absorbed by the hasher thread, so the engine's effective tick latency is the match_tick number.

Batch throughput

Metricops/sec
Sustained throughput (1,000 ticks x 10 orders)3.5M orders/sec
compute_commitments (10 trades)219K ops/sec

Worst-case order-to-match latency in batch mode = tick interval + match_tick latency. At the default 1ms interval: ~1.003ms.


Continuous Mode

In continuous mode (OLYMPUS_CONTINUOUS_MATCHING=true), orders are matched individually on arrival, bypassing the sequencer. Commitment batching happens asynchronously on a timer.

match_order — resting (no trade)

Order rests on the book. No crossing, no settlement.

minp50p95p99p99.9maxmean
single order125 ns208 ns292 ns1,208 ns2,625 ns635,903 ns261 ns

match_order — crossing (1 trade)

Order crosses the spread and fills against one resting order.

minp50p95p99p99.9maxmean
single order208 ns541 ns750 ns917 ns1,333 ns17,631 ns548 ns

Sub-microsecond matching

A resting order matches at 208ns (p50). A crossing order that produces a trade takes ~2.6x due to fill settlement (4 ledger operations) and book mutation. No batching delay — latency = match_order time only.

Continuous throughput

Metricops/sec
match_order (resting, no trade)3.83M orders/sec
match_order (crossing, 1 fill)1.82M orders/sec

Multi-Fill Sweep

Edge case

Most orders fill against 1-2 resting orders. This section characterises worst-case latency for large aggressive orders sweeping multiple price levels.

Latency scales linearly with fill count — each fill requires a full settlement cycle and book pop. Per-fill cost is approximately 180ns at p50.

Fillsminp50p95p99p99.9maxmeanops/sec
1416 ns541 ns1,833 ns5,043 ns23,295 ns84,927 ns782 ns1.28M
51,083 ns1,250 ns1,958 ns5,087 ns13,671 ns44,095 ns1,445 ns692K
102,082 ns2,251 ns2,625 ns4,001 ns14,543 ns45,183 ns2,340 ns427K
254,580 ns4,959 ns5,667 ns7,711 ns17,711 ns24,511 ns5,105 ns196K
508,912 ns9,503 ns10,591 ns14,007 ns25,967 ns65,855 ns9,723 ns103K

A 50-fill sweep at ~9.5µs is still well under the 1ms commitment interval.

Tail Latency

The p99.9 numbers show occasional spikes to 2-26µs depending on the operation:

  • Memory allocationVec resizing when accumulating trades or order updates
  • BTreeMap rebalancing — inserting into a new price level triggers tree rotation
  • CPU cache effects — even with core pinning, L2/L3 cache pressure from other processes can cause occasional stalls

The max values (17-636µs) represent extreme outliers from OS scheduling jitter or page faults. The 636µs max on the resting match_order benchmark is a single outlier across 50,000 samples — the p99.9 is 2.6µs.


Mode Comparison

BatchContinuous
Order-to-matchtick interval + match_tick (~1ms)match_order only (~208ns)
Throughput3.5M orders/sec3.83M orders/sec
Crash recoveryZero lossUp to 1 commit interval
Use caseProduction default, auditLow-latency, HFT

Methodology

What Each Benchmark Measures

process_tick — Full batch pipeline

The timer starts immediately before CoreEngine::process_tick() and stops when it returns. This single call executes:

  1. Iterate every Transaction in the tick (10 orders in the benchmark)
  2. For each PlaceOrder: validate the instrument is active, read precomputed base/quote assets from config, reserve balance (move from available to reserved in the ledger), walk the opposite side of the order book checking price levels for a match, and if no match, insert the order into the BTreeMap at the correct price level
  3. For each CancelOrder: look up the instrument via the global order→instrument index, then look up the order in the book's FxHashMap index, remove it from the VecDeque at its price level, release reserved balance back to available
  4. Compute the Blake3 state hash over tick sequence, timestamp, all trade details, and transaction count
  5. Hash each trade into a 32-byte leaf (Blake3 over sequence, instrument, price, quantity, buyer/seller accounts, order IDs, timestamp)
  6. Build a binary merkle tree from the trade leaves and extract the root
  7. Record Prometheus metrics (matching latency, ticks processed, orders total, trades total)

The benchmark creates a fresh engine pre-filled with 200 resting orders across 10 funded accounts for each sample. Orders are placed far from mid-price so no trades occur — this isolates the cost of order validation, balance reservation, book insertion, and hash computation without fill settlement.

match_tick — Matching only (engine hot path)

The timer starts immediately before CoreEngine::match_tick() and stops when it returns. This call executes:

  1. Iterate every Transaction in the tick (10 orders in the benchmark)
  2. For each PlaceOrder: validate the instrument is active, read precomputed base/quote assets from config, reserve balance (move from available to reserved in the ledger), walk the opposite side of the order book checking price levels for a match, and if no match, insert the order into the BTreeMap at the correct price level
  3. For each CancelOrder: look up the instrument via the global order→instrument index, then look up the order in the book's FxHashMap index, remove it from the VecDeque at its price level, release reserved balance back to available

Blake3 state hash, trade hashing, and merkle tree construction are excluded — in production these run on a separate hasher thread. The difference in latency between process_tick and match_tick isolates the cost of cryptographic commitment computation.

The benchmark setup is identical to process_tick — same engine state, same tick, same 10 resting orders.

match_order (no trade) — Resting order

The timer wraps a single call to CoreEngine::match_order() with one PlaceOrder transaction. The order's price is far from the spread, so it does not cross — it rests on the book. The call executes:

  1. Validate instrument status
  2. Read precomputed base/quote assets from the instrument config (no string splitting)
  3. Reserve balance: for a buy, move price * quantity of the quote asset from available to reserved; for a sell, move quantity of the base asset
  4. Walk the opposite side of the order book — find no matchable price levels
  5. Insert the order into the BTreeMap at the correct price level, append to the VecDeque for FIFO ordering, add to the FxHashMap index

Unlike the tick benchmarks, this uses a shared engine across all 50,000 samples — the book grows as orders accumulate, which is realistic for continuous mode where the engine state persists across orders.

match_order (with trade) — Crossing order

Same as above, but the order's price crosses the spread and matches against a resting order. The engine is pre-filled with 20,000 resting orders to ensure liquidity. Each call executes:

  1. Validate instrument status
  2. Read precomputed base/quote assets from the instrument config (no string splitting)
  3. Reserve balance: for a buy, move price * quantity of the quote asset from available to reserved; for a sell, move quantity of the base asset
  4. Walk the opposite side — find a matchable resting order at the best price level (single peek, no double-peek)
  5. Self-trade check — if the resting order belongs to the same account, cancel it (release reserved balance) and continue to the next order. In benchmarks, all orders are from different accounts so this is a single comparison that falls through.
  6. Compute fill quantity: min(incoming_remaining, resting_remaining)
  7. Create a Trade struct with the fill details (price is always the resting order's price)
  8. Settle the buyer: deduct fill_cost from reserved quote balance, credit fill_qty to available base balance
  9. Settle the seller: deduct fill_qty from reserved base balance, credit fill_cost to available quote balance
  10. Update or remove the resting order from the book — if fully filled, pop from the VecDeque and remove from the FxHashMap index; if partially filled, update the front order's remaining quantity in O(1)

The ~2.6x latency increase over the resting case comes from balance settlement (4 ledger operations per fill) and book mutation.

match_order (multi-fill sweep) — Multiple price levels

Places an aggressive buy with quantity N against an order book where each resting ask is quantity 1 at successive price levels (101.00, 101.01, 101.02, ...). The engine walks N price levels, executing a full fill cycle at each:

  1. Peek the best ask level
  2. Price check — confirm the aggressive price >= ask price
  3. Self-trade check — compare account IDs (always different accounts in benchmarks, so this falls through)
  4. Compute fill quantity (always 1 since resting quantity is 1)
  5. Create a Trade struct
  6. Settle buyer: deduct cost from reserved quote, credit base to available
  7. Settle seller: deduct base from reserved, credit quote to available
  8. Pop the fully-filled resting order from the VecDeque, remove from FxHashMap index
  9. Clean up the empty price level from the BTreeMap
  10. Repeat for the next level

Parameterised over fill count: 1, 5, 10, 25, 50. Each sample uses a fresh engine to ensure consistent resting liquidity.

compute_commitments — Hashing only

The timer wraps CoreEngine::compute_commitments(), a static function that takes a Tick and a slice of Trade references. This is what the hasher thread runs in production. It executes:

  1. Create a Blake3 hasher and feed it: tick sequence (8 bytes LE), timestamp (8 bytes LE), then for each trade: trade sequence, instrument ID bytes, price string bytes, quantity string bytes. Finally the transaction count (8 bytes LE). Finalise to produce a 32-byte state hash.
  2. For each of the 10 trades, compute a Blake3 leaf hash over: sequence, instrument ID, price, quantity, buyer account (20 bytes), seller account (20 bytes), buyer order ID (16 bytes UUID), seller order ID (16 bytes UUID), timestamp (8 bytes LE).
  3. Build a binary merkle tree: pair leaves, hash each pair with Blake3, repeat until a single root remains. Odd leaves are duplicated.

The benchmark first runs match_tick to produce trades, then times only the commitment step.

HDR Histogram Approach

The latency percentiles use HdrHistogram to capture per-iteration timing with nanosecond precision. Each benchmark:

  1. Creates engine state (fresh per-sample for tick benchmarks, shared for continuous mode)
  2. Runs 100 warmup iterations to stabilise CPU caches and branch predictors
  3. Records 10,000-50,000 individual timing samples into an HDR histogram (3 significant digits, range 1ns to 100ms)
  4. Reports p50, p95, p99, p99.9, min, max, mean, and derived throughput (1/mean)

Criterion Benchmarks

The throughput and regression-detection benchmarks use Criterion.rs, which provides statistical rigour (confidence intervals, change detection against previous runs) but only reports mean/median — not percentiles.

Production Deployment Considerations

The benchmark numbers above measure the pure matching path on bare metal. In a deployed binary, several factors add overhead:

Docker Desktop VM overhead. Docker Desktop on macOS runs Linux in an Apple Hypervisor Framework VM. This adds ~2-5x overhead on memory operations and syscalls. Expect p50 to be 2-3x higher than the numbers on this page. For production, run bare metal or in a VM with dedicated cores.

Snapshot publishing cost. Before the debouncing optimization, EngineSnapshot::from_engine() ran after every order — iterating all instruments, calling bid_depth(1000) and ask_depth(1000) on each, and cloning all balances. With 4 instruments and deep books, this cost 50-200µs per call. At 1000 orders/sec, that was 50-200ms/sec spent on snapshots alone. After debouncing (OLYMPUS_SNAPSHOT_INTERVAL_US, default 500µs), the same snapshot runs at most once per interval regardless of order rate — snapshot overhead drops from ~100ms/sec to ~2ms/sec at 1000 orders/sec.

CPU pinning impact. Use OLYMPUS_ENGINE_CORE and OLYMPUS_HASHER_CORE to pin threads to dedicated cores. Without pinning, OS scheduler migration causes L1/L2 cache invalidation, which shows up as p99.9 spikes. With pinning on bare metal, p99.9 typically drops by 3-5x.

Busy-spin vs timeout polling. recv_timeout(100µs) calls Instant::now() on every iteration and may park/unpark the thread. Busy-spin with try_recv avoids both costs. The tradeoff is CPU utilization: busy-spin uses 100% of the pinned core even when idle. Use OLYMPUS_SPIN_ITERS=0 in development and 256+ in production.

Heap allocation elimination. InstrumentId uses CompactString (inline up to 24 bytes) instead of heap-allocated String, eliminating ~7-9 malloc/memcpy operations per fill. Internal maps use FxHashMap (~3-4x faster than SipHash for small keys), reducing hash overhead on every ledger and book lookup. The ledger uses a nested map (AccountId → InstrumentId → AccountBalance) so that read-only balance lookups require zero InstrumentId clones, and Quantity::new_unchecked removes the Decimal modulo check from the fill path (both inputs are already validated at placement). Per-order metrics (Instant::now() + Prometheus observe) have been moved out of match_order into the caller, saving ~200ns per order. These changes reduce p99 tail latency by eliminating malloc lock contention and page fault jitter.

Native i64 arithmetic. All monetary values use i64 fixed-point representation with per-instrument scale factors. Price x quantity uses a single i128 intermediate multiplication (~4ns), replacing rust_decimal::Decimal arithmetic (~200-500ns). This eliminates the ~5-8µs Decimal overhead that was the dominant remaining cost after heap allocation elimination.

Expected production improvement. The match_order benchmark numbers (208ns p50 resting, 541ns crossing) measure the pure matching path. In the production binary, overhead from snapshot publishing, broadcast sends, and timer checks adds to each order. The debouncing and batching optimizations move that overhead off the per-order path, bringing production latency closer to the benchmark numbers.

See Continuous Mode Tuning for detailed variable reference and deployment profiles.

Reproducing

Stable results

Disable low-power / battery-saver mode before running. CPU frequency scaling significantly distorts latency measurements.

Full latency profile
cargo bench -p olympus-core --bench latency
Capture to file
cargo bench -p olympus-core --bench latency 2>/dev/null | tee latency-results.txt
Criterion sustained throughput (with change detection)
cargo bench -p olympus-core --bench engine -- "engine/sustained_throughput"

On this page