Performance engineering at Jane Street


Keeping up with the market

As markets grow, our trading infrastructure needs to process ever growing amounts of data in ever shorter time windows. That’s why we build highly-optimized packet processing systems that are capable of handling millions of multicast messages per second on a single core. Building this kind of system requires a disciplined approach to measurement, a focus on determinism and tail-events, and a good dose of mechanical sympathy.

Performance isn’t just important for the most latency-sensitive trading. We’ve built a distributed systems framework based on state machine replication (and inspired by the architecture of financial exchanges) which provides high throughput, low latency, and strong reliability guarantees to a wide variety of internal applications. The architecture of this system depends on a very high-performance backbone for sequencing, distributing, and filtering the transactions that drive these applications.

At the lowest level, we use FPGA accelerators as a way of achieving performance that can’t be gotten on CPUs alone. We lead development of Hardcaml arxiv.org arrow pointing right Hardcaml: an ocaml hardware domain-specific language for efficient and robust design , an open-source hardware design library github.com arrow pointing right Hardcaml is an OCaml library for designing hardware. - janestreet/hardcaml . By building our own tools, we’ve been able to build a highly productive hardware design workflow, with fast feedback for engineers, and integrated simulation and testing. If you’re interested in seeing Hardcaml at work, some of our engineers won a competition about accelerating zero-knowledge cryptography using it, and we’ve posted the detailed results zprize.hardcaml.com arrow pointing right In 2022, we, the team who develops Hardcaml, participated in the ZPrize competition. We comp... .

Accelerating machine learning

We do a lot of machine learning, and performance engineering is a critical part of that work. Making good use of our GPU clusters requires careful profiling and optimization of our training runs across the whole stack, from storage to network to host.

In most of the ML world, inference is largely a throughput problem, with responses aimed at human timescales. Because our models drive microsecond-scale trading, we need to architect for latencies far below those that are typical for ML workflows, while handling high-throughput market data. This leads us towards a variety of techniques, from writing heavily optimized CUDA code that stretches the bounds of what GPUs were designed for, to leveraging custom hardware, to writing our own compilers.

Industry-leading tools for performance debugging

Magic-trace

We developed magic-trace, a powerful open-source tool for collecting and displaying high-resolution traces of what a process is doing. It’s useful not just for detailed performance debugging, but also just for understanding your program. Magic-trace uses Intel Processor Trace man7.org arrow pointing right Intel Processor Trace (Intel PT) is an extension of Intel Architecture that collects information about… to snapshot a ring buffer of all control flow leading up to a chosen point in time, which it then presents to users in an interactive timeline.

animated demonstration of magic-trace tool

Memtrace

We also built memtrace, a tool for understanding memory usage and finding leaks. Memtrace builds on OCaml’s statistical memory profiler to get callbacks on GC events for a sample of a program’s allocations. The Memtrace viewer then analyzes these events and presents graphical views of them, as well as filters for interactively narrowing the view until the source of the memory problem becomes clear.

illustration with a blue background, featuring geometric patterns that evoke themes of technology and computing

Pushing programming language design for high performance

We write our lowest-latency software systems in OCaml, which combines a powerful type system with good and predictable performance and a low overhead runtime. Over the last couple of years, Jane Street has developed major extensions to OCaml, in particular:


  • The addition of modal types icfp24.sigplan.org arrow pointing right PACMPL (ICFP) seeks contributions on the design, implementations, principles, and uses of functio... opens up a variety of ambitious features, like memory-safe stack-allocation; type-level tracking of effects, and data-race freedom guarantees for multicore code.

  • Unboxed types github.com arrow pointing right Unboxed types in OCaml provides more control over the representation of memory, in particular allowing for structured data to be represented in a cache-and-prefetch-friendly tabular form.


Together, these features pull in some of Rust’s best features for writing high performance code, with a simpler and more ergonomic type-system that maintains the relative simplicity of programming in OCaml.


We’ve also made a wide variety of other improvements to make OCaml more efficient, from adding prefetching to the GC github.com arrow pointing right Speed up GC by prefetching during marking to working on the middle and back-end of the compiler to improve code generation and register allocation.

If the work detailed here resonates with you, you may find these opportunities particularly interesting: