Latency

Matching engine latency profile — percentile distributions for batch and continuous matching modes, methodology, and how to reproduce.

The matching engine supports two matching modes with different latency characteristics. These numbers were measured on an Apple M-series processor using the HDR histogram benchmark suite.

208ns

Continuous mode p50 — single order, resting on book

541ns

Continuous mode p50 — crossing order, producing a trade

2.6µs

Batch mode p50 — match_tick, 10 orders per tick

3.9M/sec

Sustained throughput — continuous mode, orders per second

How to run

cargo bench -p olympus-core --bench latency

	min	p50	p95	p99	p99.9	max	mean
10 orders	2,416 ns	2,793 ns	3,251 ns	7,375 ns	13,919 ns	34,239 ns	2,938 ns

	min	p50	p95	p99	p99.9	max	mean
10 orders	2,374 ns	2,625 ns	3,167 ns	7,503 ns	13,671 ns	25,135 ns	2,784 ns

	min	p50	p95	p99	p99.9	max	mean
10 trades	4,292 ns	4,503 ns	4,875 ns	5,627 ns	13,503 ns	30,383 ns	4,567 ns

The difference between process_tick and match_tick is ~168ns at p50 for 10 orders. In production this cost is absorbed by the hasher thread, so the engine's effective tick latency is the match_tick number.

Batch throughput

Metric	ops/sec
Sustained throughput (1,000 ticks x 10 orders)	3.5M orders/sec
compute_commitments (10 trades)	219K ops/sec

Worst-case order-to-match latency in batch mode = tick interval + match_tick latency. At the default 1ms interval: ~1.003ms.

Continuous Mode

In continuous mode (OLYMPUS_CONTINUOUS_MATCHING=true), orders are matched individually on arrival, bypassing the sequencer. Commitment batching happens asynchronously on a timer.

match_order — resting (no trade)

Order rests on the book. No crossing, no settlement.

	min	p50	p95	p99	p99.9	max	mean
single order	125 ns	208 ns	292 ns	1,208 ns	2,625 ns	635,903 ns	261 ns

match_order — crossing (1 trade)

Order crosses the spread and fills against one resting order.

	min	p50	p95	p99	p99.9	max	mean
single order	208 ns	541 ns	750 ns	917 ns	1,333 ns	17,631 ns	548 ns

Sub-microsecond matching

A resting order matches at 208ns (p50). A crossing order that produces a trade takes ~2.6x due to fill settlement (4 ledger operations) and book mutation. No batching delay — latency = match_order time only.

Continuous throughput

Metric	ops/sec
match_order (resting, no trade)	3.83M orders/sec
match_order (crossing, 1 fill)	1.82M orders/sec

Multi-Fill Sweep

Edge case

Most orders fill against 1-2 resting orders. This section characterises worst-case latency for large aggressive orders sweeping multiple price levels.

Latency scales linearly with fill count — each fill requires a full settlement cycle and book pop. Per-fill cost is approximately 180ns at p50.

Fills	min	p50	p95	p99	p99.9	max	mean	ops/sec
1	416 ns	541 ns	1,833 ns	5,043 ns	23,295 ns	84,927 ns	782 ns	1.28M
5	1,083 ns	1,250 ns	1,958 ns	5,087 ns	13,671 ns	44,095 ns	1,445 ns	692K
10	2,082 ns	2,251 ns	2,625 ns	4,001 ns	14,543 ns	45,183 ns	2,340 ns	427K
25	4,580 ns	4,959 ns	5,667 ns	7,711 ns	17,711 ns	24,511 ns	5,105 ns	196K
50	8,912 ns	9,503 ns	10,591 ns	14,007 ns	25,967 ns	65,855 ns	9,723 ns	103K

A 50-fill sweep at ~9.5µs is still well under the 1ms commitment interval.

Tail Latency

The p99.9 numbers show occasional spikes to 2-26µs depending on the operation:

Memory allocation — Vec resizing when accumulating trades or order updates
BTreeMap rebalancing — inserting into a new price level triggers tree rotation
CPU cache effects — even with core pinning, L2/L3 cache pressure from other processes can cause occasional stalls

The max values (17-636µs) represent extreme outliers from OS scheduling jitter or page faults. The 636µs max on the resting match_order benchmark is a single outlier across 50,000 samples — the p99.9 is 2.6µs.

Mode Comparison

	Batch	Continuous
Order-to-match	tick interval + `match_tick` (~1ms)	`match_order` only (~208ns)
Throughput	3.5M orders/sec	3.83M orders/sec
Crash recovery	Zero loss	Up to 1 commit interval
Use case	Production default, audit	Low-latency, HFT

Methodology

What Each Benchmark Measures

process_tick — Full batch pipeline

The timer starts immediately before CoreEngine::process_tick() and stops when it returns. This single call executes:

Iterate every Transaction in the tick (10 orders in the benchmark)
For each PlaceOrder: validate the instrument is active, read precomputed base/quote assets from config, reserve balance (move from available to reserved in the ledger), walk the opposite side of the order book checking price levels for a match, and if no match, insert the order into the BTreeMap at the correct price level
For each CancelOrder: look up the instrument via the global order→instrument index, then look up the order in the book's FxHashMap index, remove it from the VecDeque at its price level, release reserved balance back to available
Compute the Blake3 state hash over tick sequence, timestamp, all trade details, and transaction count
Hash each trade into a 32-byte leaf (Blake3 over sequence, instrument, price, quantity, buyer/seller accounts, order IDs, timestamp)
Build a binary merkle tree from the trade leaves and extract the root
Record Prometheus metrics (matching latency, ticks processed, orders total, trades total)

The benchmark creates a fresh engine pre-filled with 200 resting orders across 10 funded accounts for each sample. Orders are placed far from mid-price so no trades occur — this isolates the cost of order validation, balance reservation, book insertion, and hash computation without fill settlement.

match_tick — Matching only (engine hot path)

The timer starts immediately before CoreEngine::match_tick() and stops when it returns. This call executes:

Iterate every Transaction in the tick (10 orders in the benchmark)
For each PlaceOrder: validate the instrument is active, read precomputed base/quote assets from config, reserve balance (move from available to reserved in the ledger), walk the opposite side of the order book checking price levels for a match, and if no match, insert the order into the BTreeMap at the correct price level
For each CancelOrder: look up the instrument via the global order→instrument index, then look up the order in the book's FxHashMap index, remove it from the VecDeque at its price level, release reserved balance back to available

Blake3 state hash, trade hashing, and merkle tree construction are excluded — in production these run on a separate hasher thread. The difference in latency between process_tick and match_tick isolates the cost of cryptographic commitment computation.

The benchmark setup is identical to process_tick — same engine state, same tick, same 10 resting orders.

match_order (no trade) — Resting order

The timer wraps a single call to CoreEngine::match_order() with one PlaceOrder transaction. The order's price is far from the spread, so it does not cross — it rests on the book. The call executes:

Validate instrument status
Read precomputed base/quote assets from the instrument config (no string splitting)
Reserve balance: for a buy, move price * quantity of the quote asset from available to reserved; for a sell, move quantity of the base asset
Walk the opposite side of the order book — find no matchable price levels
Insert the order into the BTreeMap at the correct price level, append to the VecDeque for FIFO ordering, add to the FxHashMap index

Unlike the tick benchmarks, this uses a shared engine across all 50,000 samples — the book grows as orders accumulate, which is realistic for continuous mode where the engine state persists across orders.

match_order (with trade) — Crossing order

Same as above, but the order's price crosses the spread and matches against a resting order. The engine is pre-filled with 20,000 resting orders to ensure liquidity. Each call executes:

Validate instrument status
Read precomputed base/quote assets from the instrument config (no string splitting)
Reserve balance: for a buy, move price * quantity of the quote asset from available to reserved; for a sell, move quantity of the base asset
Walk the opposite side — find a matchable resting order at the best price level (single peek, no double-peek)
Self-trade check — if the resting order belongs to the same account, cancel it (release reserved balance) and continue to the next order. In benchmarks, all orders are from different accounts so this is a single comparison that falls through.
Compute fill quantity: min(incoming_remaining, resting_remaining)
Create a Trade struct with the fill details (price is always the resting order's price)
Settle the buyer: deduct fill_cost from reserved quote balance, credit fill_qty to available base balance
Settle the seller: deduct fill_qty from reserved base balance, credit fill_cost to available quote balance
Update or remove the resting order from the book — if fully filled, pop from the VecDeque and remove from the FxHashMap index; if partially filled, update the front order's remaining quantity in O(1)

The ~2.6x latency increase over the resting case comes from balance settlement (4 ledger operations per fill) and book mutation.

match_order (multi-fill sweep) — Multiple price levels

Places an aggressive buy with quantity N against an order book where each resting ask is quantity 1 at successive price levels (101.00, 101.01, 101.02, ...). The engine walks N price levels, executing a full fill cycle at each:

Peek the best ask level
Price check — confirm the aggressive price >= ask price
Self-trade check — compare account IDs (always different accounts in benchmarks, so this falls through)
Compute fill quantity (always 1 since resting quantity is 1)
Create a Trade struct
Settle buyer: deduct cost from reserved quote, credit base to available
Settle seller: deduct base from reserved, credit quote to available
Pop the fully-filled resting order from the VecDeque, remove from FxHashMap index
Clean up the empty price level from the BTreeMap
Repeat for the next level

Parameterised over fill count: 1, 5, 10, 25, 50. Each sample uses a fresh engine to ensure consistent resting liquidity.

compute_commitments — Hashing only

The timer wraps CoreEngine::compute_commitments(), a static function that takes a Tick and a slice of Trade references. This is what the hasher thread runs in production. It executes:

Create a Blake3 hasher and feed it: tick sequence (8 bytes LE), timestamp (8 bytes LE), then for each trade: trade sequence, instrument ID bytes, price string bytes, quantity string bytes. Finally the transaction count (8 bytes LE). Finalise to produce a 32-byte state hash.
For each of the 10 trades, compute a Blake3 leaf hash over: sequence, instrument ID, price, quantity, buyer account (20 bytes), seller account (20 bytes), buyer order ID (16 bytes UUID), seller order ID (16 bytes UUID), timestamp (8 bytes LE).
Build a binary merkle tree: pair leaves, hash each pair with Blake3, repeat until a single root remains. Odd leaves are duplicated.

The benchmark first runs match_tick to produce trades, then times only the commitment step.

HDR Histogram Approach

The latency percentiles use HdrHistogram to capture per-iteration timing with nanosecond precision. Each benchmark:

Creates engine state (fresh per-sample for tick benchmarks, shared for continuous mode)
Runs 100 warmup iterations to stabilise CPU caches and branch predictors
Records 10,000-50,000 individual timing samples into an HDR histogram (3 significant digits, range 1ns to 100ms)
Reports p50, p95, p99, p99.9, min, max, mean, and derived throughput (1/mean)

Criterion Benchmarks

The throughput and regression-detection benchmarks use Criterion.rs, which provides statistical rigour (confidence intervals, change detection against previous runs) but only reports mean/median — not percentiles.

Production Deployment Considerations

The benchmark numbers above measure the pure matching path on bare metal. In a deployed binary, several factors add overhead:

Docker Desktop VM overhead. Docker Desktop on macOS runs Linux in an Apple Hypervisor Framework VM. This adds ~2-5x overhead on memory operations and syscalls. Expect p50 to be 2-3x higher than the numbers on this page. For production, run bare metal or in a VM with dedicated cores.

Snapshot publishing cost. Before the debouncing optimization, EngineSnapshot::from_engine() ran after every order — iterating all instruments, calling bid_depth(1000) and ask_depth(1000) on each, and cloning all balances. With 4 instruments and deep books, this cost 50-200µs per call. At 1000 orders/sec, that was 50-200ms/sec spent on snapshots alone. After debouncing (OLYMPUS_SNAPSHOT_INTERVAL_US, default 500µs), the same snapshot runs at most once per interval regardless of order rate — snapshot overhead drops from ~100ms/sec to ~2ms/sec at 1000 orders/sec.

CPU pinning impact. Use OLYMPUS_ENGINE_CORE and OLYMPUS_HASHER_CORE to pin threads to dedicated cores. Without pinning, OS scheduler migration causes L1/L2 cache invalidation, which shows up as p99.9 spikes. With pinning on bare metal, p99.9 typically drops by 3-5x.

Busy-spin vs timeout polling. recv_timeout(100µs) calls Instant::now() on every iteration and may park/unpark the thread. Busy-spin with try_recv avoids both costs. The tradeoff is CPU utilization: busy-spin uses 100% of the pinned core even when idle. Use OLYMPUS_SPIN_ITERS=0 in development and 256+ in production.

Heap allocation elimination. InstrumentId uses CompactString (inline up to 24 bytes) instead of heap-allocated String, eliminating ~7-9 malloc/memcpy operations per fill. Internal maps use FxHashMap (~3-4x faster than SipHash for small keys), reducing hash overhead on every ledger and book lookup. The ledger uses a nested map (AccountId → InstrumentId → AccountBalance) so that read-only balance lookups require zero InstrumentId clones, and Quantity::new_unchecked removes the Decimal modulo check from the fill path (both inputs are already validated at placement). Per-order metrics (Instant::now() + Prometheus observe) have been moved out of match_order into the caller, saving ~200ns per order. These changes reduce p99 tail latency by eliminating malloc lock contention and page fault jitter.

Native i64 arithmetic. All monetary values use i64 fixed-point representation with per-instrument scale factors. Price x quantity uses a single i128 intermediate multiplication (~4ns), replacing rust_decimal::Decimal arithmetic (~200-500ns). This eliminates the ~5-8µs Decimal overhead that was the dominant remaining cost after heap allocation elimination.

Expected production improvement. The match_order benchmark numbers (208ns p50 resting, 541ns crossing) measure the pure matching path. In the production binary, overhead from snapshot publishing, broadcast sends, and timer checks adds to each order. The debouncing and batching optimizations move that overhead off the per-order path, bringing production latency closer to the benchmark numbers.

See Continuous Mode Tuning for detailed variable reference and deployment profiles.

Reproducing

Stable results

Disable low-power / battery-saver mode before running. CPU frequency scaling significantly distorts latency measurements.

Full latency profile

cargo bench -p olympus-core --bench latency

Capture to file

cargo bench -p olympus-core --bench latency 2>/dev/null | tee latency-results.txt

Criterion sustained throughput (with change detection)

cargo bench -p olympus-core --bench engine -- "engine/sustained_throughput"