Performance Benchmark

Summary:

A series of stress tests were performed to assess the overall performance of the matching engine package.

The testing methodology is geared to establish realistic performance metrics that represent typical everyday use, and avoids any attempt to optimize order flow. It should also be noted that the monitoring processes which track these tests will result in a small reduction in overall performance.

System Architecture:

For the purpose of this article we can consider the matching engine package to be composed of two key components:

  • An order book
  • A matching process

The orderbook contains an ordered list of all open orders. New orders are inserted into the orderbook at the correct price and time priority location to ensure that the list stays ordered. With each new order insertion, the matching process identifies whether any of the open orders can be matched, and then if necessary processes matches and updates the orderbook.

The orderbook system can be considered as a finite state machine where the orderbook is locked during each order insertion, and only unlocked once the insertion and any executions have been complete. This means that only one order can be entered at a time, and there are no race conditions which could result in improper matching.

Environment:

We chose to carry out these performance tests on a physical server to avoid the uncertainty and variation which may be found when using cloud computing services. Cloud computing providers use virtualization technologies to allocate your assigned resources against a range of different physical servers, and based on the number of virtual CPU (vCPU) vs physical CPU (pCPU), you may be competing with other users for the same CPU resources.

Software Stack:

  • Ubuntu: 18.04
  • Yaws: 2.0.6
  • Erlang: 20.3
  • Mnesia: 4.15.4
  • Postgres: 11.5
  • Matching Engine: 0.8.4

Hardware Specifications:

  • PowerEdge R440 Server
  • 2ea Intel Xeon Silver 4110 2.1G, 8C/16T
  • 16 CPU Cores
  • 32 CPU Threads
  • 64 GB RAM (RDIMM 2666MT/s)
  • Standard Mixed Mode SATA SSD

Order Entry:

New orders are entered into a single order book in a binomial fashion and distributed by both price and quantity. Orders are only entered into one side of the orderbook to prevent any executions from taking place. As the orderbook is populated with more and more orders, the system time is record at regular intervals, and the time delta is recorded and used to calculate the order entry rate.

To get a better idea of the expected performance, we consider the following cases:

  • Optimal Insertion where orders are sorted prior to insertion, such that each new order is simply appending at the end of the order list.
  • Random Insertion where orders are entered at random, and need to be inserted at the correct place in the order list.
  • Worst Case Insertion where orders are sorted prior to insertion, such that each new order is appended to the start of the list.

In this test we are entering orders one at a time into a single orderbook.

From this graph it can be seen that for orderbooks of less than one million in length, the optimal and random order entry rate is almost linear. The significant noise seen in the worst case order entry is primarily due to system processes such as the allocation of memory.

For all three cases, 1 million orders are entered in approximately three seconds, which puts the order entry rate around: 300,000 orders per second.

If we were to enter orders simultaneously into 100 separate order books, then based on available system resources we could expect to achieve appropriately a 100 times performance improvement over the above. It is also possible to shard each orderbook on a separate physical or virtual host to gain access to an even greater resource pool.

Execution Performance:

The execution performance is measured by entering a large market order which matches against a pool of n open orders. The open orders are distributed in a binomial fashion by both price and quantity. Due to the "random" nature of the order distribution between tests, we expect some variation in execution time between subsequent tests which presents on the graph as noise. However the most significant source of variation is due to OS level memory garbage collection which slows down the matching process.

This graph was produced by varying the numbers of open orders and utilizing a matching large market order to extinguish them. This testing methodology was chosen as it avoids over-optimization, produces repeatable results, and only varies a single variable, n.

There are two approaches which can be used to verify the time required to complete the large market order:

  1. Check the status of the orderbook every ms and verify the orderbook length.
  2. Change the order entry mechanism from async to sync, such that the order entry api will only return once the execution is complete. It is then trivial to calculate the duration from when the order is entered, to when the API returns. In async mode the order entry API will return immediately and not wait for the result of any potential matching.

For simplicity we chose option 2. As the order is entered into the exchange in synchronous mode, a number of other operations need to be completed after the execution completes, and before the API returns. These additional processes significantly reduce the apparent performance of the system, and include tasks such as pushing executions and updated orderbooks to the websockets and propagating all data to the various nodes in the exchange system.

Memory Usage:

All open orders are stored in memory. The memory footprint of the order book increases linearly with order book depth. As the number of orderbooks increases, the memory footprint is primarily driven by the total number of open orders across all order books.

Significant variability in memory allocation can occur due to frequency of OS level memory garbage collection. As orders churn through the system, the OS will need to regularily process garbage and reclaim unused memory blocks, and in the intermediate period memory allocation can rise significantly above what is being used.

The memory useage indicated in this graph is the sum of both the client and the server. That is to say both the matching engine server, and the performance testing process which creates threads to enter orders.

Available memory resources are unlikely to be the limiting factor in any modern computer.

Conclusion:

When establishing the computing resources required to run a matching engine system, the critical parameter to consider is the expected number of orderbooks. The greater the number of orderbooks, the more performance benefits can be realised by increasing the number of CPU threads. Memory and disk space are far less likely to be critical in the decision making process due to their relatively small footprint in a modern computing system.