Olympus
System Design

Matching Engine

Deep dive into the matching engine — order book structure, matching algorithm, balance model, determinism, and thread architecture.

Overview

The matching engine is the deterministic core of Olympus. It processes batches of transactions called ticks, executes order matching with price-time priority, manages balances, and produces cryptographic commitments for on-chain verifiability.

The engine sits at the centre of the data pipeline:

API > Sequencer > Engine > TickResult > Persistence / Bridge / WebSocket

Every operation inside the engine is pure and deterministic — given the same sequence of ticks, any instance will produce identical state hashes. This property is what makes the validium architecture possible: the on-chain settlement contract can verify that off-chain execution was correct by checking committed state roots.

For system-level architecture see the Architecture page. This page focuses on the engine internals.

Data Flow

The system supports two matching modes: batch (default) and continuous (feature-gated).

Batch Mode (default)

+-----------+    +------------+    +------------+    +--------+    +------------------+
|  REST     |--->| Sequencer  |--->|   Core     |--->| Hasher |--->| Event Processor  |
|  API      |    | (batching) |    |  Engine    |    | Thread |    |                  |
+-----------+    +------------+    +------------+    +--------+    +------------------+
     |                                  |                               |
     |                                  |                               +--> Append Log (RocksDB)
     |                                  |                               +--> Bridge > EVM
     |                                  |                               +--> Block Submitter
     |                                  |
     |                           +------+------+
     |                           |  ArcSwap    |---> API reads (lock-free)
     |                           |  Snapshot   |
     |                           +-------------+
     |                                  |
     |                           +------+------+
     |                           |  Broadcast  |---> WebSocket subscribers
     |                           |  Channel    |
     |                           +-------------+
  1. API receives orders via REST and pushes Transaction objects to a crossbeam channel.
  2. Sequencer buffers transactions and flushes them as a Tick at a configurable interval (default 1ms). Each tick gets a monotonically increasing sequence number and a nanosecond timestamp.
  3. Core Engine receives the tick, calls match_tick() to process all transactions, and produces a MatchResult. Snapshot and market data are published immediately.
  4. Hasher Thread receives the MatchResult and computes cryptographic commitments (Blake3 state hash + merkle trades root) off the engine hot path. Constructs the final TickResult.
  5. Event Processor persists the tick to the append log, triggers periodic snapshots, and forwards bridge instructions to the EVM layer.
  6. ArcSwap Snapshot is published after every tick for lock-free reads by API handlers.
  7. Broadcast Channel fans out trade data to WebSocket subscribers via TickMarketData.

Continuous Mode (OLYMPUS_CONTINUOUS_MATCHING=true)

+-----------+                      +------------+    +--------+    +------------------+
|  REST     |--------------------->|   Core     |--->| Hasher |--->| Event Processor  |
|  API      |   (bypasses          |  Engine    |    | Thread |    |                  |
+-----------+    sequencer)        +------------+    +--------+    +------------------+
                                        |
                                 +------+------+
                                 |  Debounced  |--> ArcSwap (API reads)
                                 |  snapshot   |--> Broadcast (WS + persist)
                                 |  (500us)    |
                                 +-------------+

In continuous mode, the engine receives individual Transaction objects directly from the API channel, bypassing the sequencer entirely. Each order is matched immediately via match_order(). Snapshot publishing and broadcast sends are debounced to amortize their cost across multiple orders:

  • Snapshot publish fires at a configurable interval (OLYMPUS_SNAPSHOT_INTERVAL_US, default 500µs) instead of after every order, eliminating per-order EngineSnapshot::from_engine() overhead.
  • Market data and persistence broadcasts are accumulated and flushed with the snapshot, reducing per-order Arc allocations and broadcast channel contention.
  • Receive loop uses a three-phase strategy: drain all available orders via try_recv (no syscalls), spin briefly (OLYMPUS_SPIN_ITERS iterations), then fall back to blocking recv_timeout with the remaining commit interval.

A commitment timer (configurable, default 1ms) periodically batches accumulated results into a Tick for hashing, persistence, and bridge processing.

This gives sub-microsecond matching latency at the cost of a crash recovery tradeoff (see Crash Recovery).

Order Book Data Structure

Each instrument has its own OrderBook with three internal structures:

StructureTypePurpose
bidsBTreeMap<Reverse<Price>, VecDeque<Order>>Buy orders, highest price first
asksBTreeMap<Price, VecDeque<Order>>Sell orders, lowest price first
indexFxHashMap<OrderId, (Side, Price)>O(1) lookup for cancellations

Price-time priority is maintained naturally:

  • The BTreeMap sorts by price (highest bid first via Reverse, lowest ask first).
  • Within each price level, the VecDeque maintains FIFO (time) ordering — the first order inserted is the first to be matched.

Complexity

OperationComplexityNotes
Insert orderO(log P)BTreeMap insertion where P = number of price levels
Cancel orderO(1) lookup + O(N) level scanGlobal order→instrument index for O(1) book lookup, then FxHashMap lookup to find (side, price), then linear scan within the VecDeque at that price level
Peek best bid/askO(1)BTreeMap first element access
Depth queryO(L)Iterate L price levels

Matching Algorithm

When an aggressive order arrives, the engine walks the opposite side of the book:

  • Buy order walks asks lowest-price-first
  • Sell order walks bids highest-price-first

For each resting order at the front of the best price level:

  1. Price check: buyer must be willing to pay ≥ ask price; seller must accept ≤ bid price.
  2. Self-trade prevention: if the resting order belongs to the same account as the incoming order, the resting order is cancelled (popped from the book, reserved balance released) and the engine continues to the next resting order. No trade is created. This prevents market makers from consuming their own stale orders when requoting after price movements.
  3. Fill quantity: min(incoming_remaining, resting_remaining).
  4. Trade price: always the resting order's price (price improvement for the aggressor).
  5. Balance settlement: buyer pays quote from reserved, receives base; seller pays base from reserved, receives quote.
  6. Resting order update: if fully filled, pop from the level; if partially filled, update remaining.

After exhausting all matchable resting orders:

Order TypeRemaining Quantity Handling
LimitRests on the book (fill-or-rest)
MarketCancelled, reserved balance released
ImmediateOrCancelCancelled, reserved balance released

Balance Model

The ledger tracks balances per (AccountId, InstrumentId) pair using a three-tier model:

+-----------------------------------------------------+
|                   Total Balance                       |
|                                                      |
|  +----------+  +----------+  +----------+            |
|  |Available |  | Reserved |  |  Locked  |            |
|  |          |  |          |  |          |            |
|  | Free to  |  | Backing  |  | Bridge   |            |
|  | trade or |  |  open    |  | ops      |            |
|  | withdraw |  |  orders  |  | (mint/   |            |
|  |          |  |          |  |  burn)   |            |
|  +----------+  +----------+  +----------+            |
+-----------------------------------------------------+

State Transitions

EventTransition
Deposit / Admin Creditavailable += amount
Place Orderavailable -= costreserved += cost
Cancel Orderreserved -= costavailable += cost
Fill (resting side)reserved -= cost, counterparty available += proceeds
Lock for Bridgeavailable -= amountlocked += amount
Unlock from Bridgelocked -= amountavailable += amount

Asset identifiers follow a BASE-QUOTE convention derived from the instrument ID. For example, instrument AAPL-USD derives base asset AAPL and quote asset USD. A buyer reserves quote currency; a seller reserves base currency.

Price Improvement

When a buy limit order at price P fills at a better price F (F < P), the reserve was P × qty but settlement is F × qty. The difference (P - F) × qty is released back to available after matching completes. This applies to:

  • Fully filled orders — excess = reserve_amount - total_fill_cost
  • Partially filled + resting — excess on filled portion released, rest stays reserved for the resting order
  • Market/IOC orders — unfilled reserve + price improvement excess both released

Market Order Reserve

Market buy orders reserve based on best_ask × 1.1 (10% slippage buffer). If the order fills at the actual ask price, the 10% excess is released. If no asks exist, the order uses a minimal reserve and cancels immediately.

Safety Checks

The matching engine enforces several invariants to prevent state corruption and DOS attacks:

CheckWhenBehavior
Duplicate order IDBefore placementRejected with DuplicateOrderId — prevents uncancellable phantom orders
Max open ordersBefore placementRejected with TooManyOpenOrders if account has ≥ 1000 resting orders
Insufficient balanceBefore placementRejected with InsufficientBalanceledger.reserve() fails
Instrument haltedBefore placementRejected with InstrumentHalted
Ledger invariantDuring settlementReturns LedgerInvariant error — logged, metrics incremented
Self-trade preventionDuring matchingResting order from the same account is cancelled (not filled)

Determinism

The engine guarantees that the same tick sequence produces the same state_hash. This is enforced by:

  1. No randomness during matching — all decisions are deterministic given the input.
  2. No I/O during matching — the engine runs in-memory only.
  3. No clock reads during matching — timestamps come from the Tick struct, assigned by the sequencer before the tick enters the engine.
  4. Deterministic iterationBTreeMap iterates in sorted order; VecDeque maintains insertion order.

State Hash Computation

After processing a tick, the engine computes a Blake3 hash over:

Blake3(
  tick.sequence          (u64, little-endian)
  tick.timestamp_ns      (u64, little-endian)
  for each trade:
    trade.sequence       (u64, little-endian)
    trade.instrument_id  (UTF-8 bytes)
    trade.price          (string bytes)
    trade.quantity        (string bytes)
  tick.transactions.len  (u64, little-endian)
)

Merkle Trades Root

A binary merkle tree is built over trade leaf hashes. Each leaf is:

Blake3(
  trade.sequence | instrument_id | price | quantity |
  buyer_account | seller_account | buyer_order_id |
  seller_order_id | timestamp_ns
)

The merkle root is included in the TickResult and can be committed on-chain. Individual trade inclusion can be proven with a merkle proof.

Crash Recovery

The engine supports crash recovery through snapshots and log replay:

+----------------------------------------------------------+
|                    Startup Sequence                        |
|                                                           |
|  1. Check for latest snapshot in RocksDB                  |
|     +- Found: deserialize CoreEngine from snapshot        |
|     +- Not found: start with empty engine                 |
|                                                           |
|  2. Replay ticks from append log                          |
|     +- If snapshot exists: replay from snapshot_seq + 1   |
|     +- If no snapshot: replay from sequence 0             |
|                                                           |
|  3. Resume sequencer at last_sequence + 1                 |
|                                                           |
|  Result: identical engine state as before crash            |
+----------------------------------------------------------+

Snapshots are full serializations of the CoreEngine struct (including all order books, the ledger, and bridge state). They are saved to RocksDB periodically (every 1000 ticks by default). The append log stores every tick as MessagePack-encoded binary, keyed by sequence number in big-endian byte order for lexicographic ordering.

Batch Mode Recovery

In batch mode, every tick processed by the engine is sent to the hasher and then to the event processor, which persists it to the append log before any further processing. A crash at any point results in at most one tick of lost work (the tick currently in flight between the engine and the event processor). On recovery, process_tick() replays deterministically and produces identical state.

Continuous Mode Recovery Tradeoffs

In continuous mode, there is a crash window between individual match_order() calls and the next commitment timer fire. During this window:

What happensRisk
Engine state is mutated in memory by match_order()State changes are lost on crash
Trades may have been broadcast to WebSocket subscribers (if the debounce timer fired)Subscribers may see trades that were never persisted
Snapshot (ArcSwap) may reflect post-match state (if the debounce timer fired)API reads may show state ahead of persistence

Bounded loss: The crash window is at most one commitment interval (default 1ms). Trades matched in the last interval before a crash are lost. On recovery, the engine replays from the last committed tick, producing a consistent but slightly behind state.

Replay correctness: The committed ticks contain all raw Transaction objects. On recovery, process_tick() re-derives trades, order updates, and bridge instructions deterministically. The replayed state is self-consistent — it just may not include the very last sub-millisecond of activity.

Mitigation (not yet implemented): A per-transaction write-ahead log could persist each Transaction before match_order() is called, closing the crash window entirely. This would add per-order I/O latency, partially negating the benefit of continuous mode.

Thread Model

The engine runs on a dedicated OS thread — not a tokio task. This gives it:

  • No preemption from async task scheduling
  • Predictable latency without cooperative yield points
  • Sole ownership of the CoreEngine struct — no mutex needed

A separate hasher thread computes cryptographic commitments (Blake3 state hash + merkle trades root) off the engine hot path. This means the engine can begin processing the next tick immediately after matching, without waiting for hash computation.

Communication Channels

ChannelTypeDirectionPurpose
tick_receivercrossbeam::unboundedSequencer → EngineIncoming ticks (batch mode)
tx_receivercrossbeam::unboundedAPI → EngineIncoming transactions (continuous mode)
hash_sendercrossbeam::unboundedEngine → HasherPendingHash for commitment
result_sendercrossbeam::unboundedHasher → Event ProcessorProcessedTick output
ArcSwap<EngineSnapshot>Lock-free atomic pointerEngine → APISnapshot for reads
tokio::broadcastBroadcast channelEngine → WS layerTickMarketData (batched per snapshot interval in continuous mode)
tokio::broadcastBroadcast channelEngine → PersistenceTickPersistenceData (batched per snapshot interval in continuous mode)
                    crossbeam              crossbeam              crossbeam
  Sequencer -----------------> Engine -----------------> Hasher ----------> Event Processor
  (tokio task)                 (OS thread)                (OS thread)         (tokio task)
                                  |
                                  +--> ArcSwap (lock-free snapshot)
                                  |
                                  +--> broadcast (market data)

Lock-free reads via ArcSwap: API handlers call snapshot.load() which returns an Arc<EngineSnapshot>. In batch mode, reads may be one tick stale. In continuous mode, reads may be up to OLYMPUS_SNAPSHOT_INTERVAL_US (default 500µs) stale due to debounced snapshot publishing.

CPU Pinning

Both the engine and hasher threads can be pinned to dedicated CPU cores to avoid cache invalidation from OS core migration:

Env VarThreadPurpose
OLYMPUS_ENGINE_COREEngine threadPin matching to a dedicated core
OLYMPUS_HASHER_COREHasher threadPin hash computation to a dedicated core

Tick Lifecycle

Batch Mode

  1. API handler receives an order request and pushes a Transaction to the sequencer channel.
  2. Sequencer buffers transactions in a Vec<Transaction>.
  3. At each tick interval, the sequencer flushes the buffer into a Tick:
    Tick {
        sequence: u64,        // monotonically increasing
        timestamp_ns: u64,    // nanoseconds since Unix epoch
        transactions: Vec<Transaction>,
    }
  4. The tick is sent to the engine thread via crossbeam channel.
  5. Engine calls match_tick(&tick) which iterates all transactions and returns a MatchResult. Snapshot and market data are published immediately.
  6. The MatchResult is sent to the hasher thread, which computes state_hash and trades_root and produces the final TickResult.

Continuous Mode

  1. API handler pushes a Transaction directly to the engine channel.
  2. Engine drains all available orders via try_recv in a tight loop — no timer checks or syscalls during the drain.
  3. Each order is matched immediately via match_order(). Results are accumulated into broadcast buffers (stamped trades for market data, order events for persistence) and commitment buffers (un-stamped trades, order updates, bridge instructions).
  4. After the drain exhausts, the engine checks the snapshot debounce timer (OLYMPUS_SNAPSHOT_INTERVAL_US, default 500µs). If dirty, it publishes a new EngineSnapshot via ArcSwap and flushes batched market data + persistence broadcasts.
  5. A commitment timer (OLYMPUS_TICK_INTERVAL_MS, default 1ms) batches accumulated transactions into a Tick, which is sent to the hasher thread for commitment computation.
  6. If no orders are available, the engine spins briefly (OLYMPUS_SPIN_ITERS iterations of spin_loop()) before falling back to a blocking recv_timeout with the remaining commit interval.

In continuous mode, ticks represent commitment boundaries rather than batch boundaries. The matching happens order-by-order, but snapshot publishing, broadcast sends, and hashing happen in periodic batches.

Transaction Variants

VariantDescription
PlaceOrderSubmit a new order (limit, market, or IOC)
CancelOrderCancel an existing resting order
CancelReplaceAtomic cancel-and-replace (not yet implemented)
LockAssetLock balance for bridge withdrawal (generates mint instruction)
UnlockAssetUnlock balance after bridge deposit (burn confirmation)
AdminCreditAccountCredit an account with assets (deposits)
AdminAddInstrumentRegister a new trading instrument
AdminHaltInstrumentHalt trading on an instrument
AdminResumeInstrumentResume trading on a halted instrument
AdminPauseBridgePause all bridge operations
AdminResumeBridgeResume bridge operations

Instrument Management

Each instrument is defined by an InstrumentConfig:

InstrumentConfig {
    instrument_id: InstrumentId,  // e.g. "AAPL-USD"
    price_scale: u8,              // decimal places for prices (from tick_size)
    qty_scale: u8,                // decimal places for quantities (from lot_size)
    tick_raw: i64,                // tick_size in raw units at price_scale
    lot_raw: i64,                 // lot_size in raw units at qty_scale
    qty_divisor: i64,             // 10^qty_scale, precomputed for price*qty
    max_order_size_raw: i64,      // max order size in raw units at qty_scale
    status: InstrumentStatus,     // Active or Halted
    base_asset: Option<InstrumentId>,   // precomputed base asset (e.g. "AAPL")
    quote_asset: Option<InstrumentId>,  // precomputed quote asset (e.g. "USD")
}

Validation

  • Price validation: Price::new_raw(value, tick_raw) ensures the price is a positive exact multiple of tick_raw.
  • Quantity validation: Quantity::new(value, lot_size) ensures the quantity is a positive exact multiple of lot_size.
  • Halted instruments: orders placed on a halted instrument are rejected.

Asset Derivation

Instrument IDs follow the BASE-QUOTE convention. Base and quote assets are precomputed at instrument registration via InstrumentConfig::with_derived_assets() and stored in the config's base_asset and quote_asset fields. This eliminates per-order string splitting on the hot path.

"AAPL-USD" > base: "AAPL", quote: "USD"
"ETH-USD"  > base: "ETH",  quote: "USD"

This derivation determines which asset is reserved when placing an order:

  • Buy order: reserves price × quantity of the quote asset
  • Sell order: reserves quantity of the base asset

On this page