System Jitter and Where to Find It: A Whack-a-Mole Experiencer

In this Tech Talk, Tudor Brindus, a software engineer at Jane Street, shares his expertise on reducing jitter—deviations from mean input processing times—in low-latency systems. Maintaining consistently low jitter in a trading system allows us to offer tighter markets without incurring additional risk. Using a simple memory ping-pong application as a case study, Tudor will walk through how to identify and resolve common sources of jitter caused by both Linux kernel and microarchitectural conditions.

Transcript

Thank you, Harry. So in case we haven’t interacted yet, I’m Tudor. I am a software engineer at Jane Street. I’ve been here for about four and a half years working on the ultra low latency trading team, and we deal with all sorts of things that involve low latency, that is fast software to fast hardware, custom hardware, analysis tooling for all these sorts of things. So one thing that comes up pretty frequently is jitter and how to control it, so that’s what we’re going to be looking at to today. Some industry standard tricks of the trade, if you will, for high performance, low latency systems.

In today’s presentation, we are going to be looking at a toy example, that of distributing messages in a machine between multiple cores. So while this is a toy example, in real life, this might be something like market data, ticks from an exchange, and market data tends to be pretty voluminous on the order of about a gigabyte per second, depending on exchange. And it is latency sensitive that you process this. If you are slow, if jitter causes you to trade on data that is a millisecond stale, even 10 microseconds stale, you may make bad trades and that is, I dunno, pretty sad when you’re a trading firm.

Hopefully you know that Jane Street really cares about OCaml, and OCaml is a fundamentally single-threaded runtime. This is not quite true. There is an async scheduler and the async scheduler will dispatch blocking system calls into some thread pool upon which they will wait and raise later. But this does mean that if you want parallelism, more than one thread of true code execution, you want to do something with multiple processes, maybe passing some memory between them. If you’re familiar with Python and the global interpreter lock, it’s a similar situation here.

Let’s talk about how to pass memory between two systems on a box. Throughout this talk, please, please, please, I know it’s hard, but interrupt me to ask questions if you have them. If you are confused about something, chances are your peers are also confused and you’ll be doing them a favor if you speak up and you’ll also be doing me a favor if you speak up ‘cause I really like talking about this stuff.

Great, without further ado, let’s introduce the ring buffer. We’re gonna be using this ring buffer a lot in this talk. A ring buffer we can think of as an array with fixed sized cells, and once you get to the end of the array, you loop back right around to the start. Very simple, we can establish some ring buffer between two processes on a machine and go from there.

In fact, let’s play ping pong. We’re gonna have two processes, a pinger and a ponger and two ring buffers, a pinger to ponger ring and a ponger to pinger ring. The thing we’re gonna do is pretty simple. The pinger is going to write an integer, one, into the first cell of the pinger to ponger ring. The ponger is going to read that value one, add one to it resulting in two, write that back to the ponger to pinger ring, at which point the pinger will read it back, add one to it resulting in three, write it in the second cell, and so on and so forth. Once the pinger reaches the end of the ring, it’ll start right back at the start, so this just continues infinitely.

Expressed as OCaml code, this is what it would look like. I’m gonna give you just a second to glance over it. An IO buff is similar to Java’s ByteBuffer APIs if that helps, but it should be fairly straightforward.

The metric that we are going to be considering throughout this talk is if we took a timestamp, T1, at the time that the pinger wrote the first value into the pinger to ponger ring and another timestamp, T2, when a round trip through the system has been completed, what is the value of T2 minus T1? Again expressed as code, this is where we would add those two timing calls and we can assume that these timing calls are free.

How long do people think this should take? For reference, this is our benchmark setup. It’s a typical Intel server, second gen Cascade Lake running at 3.6 gigahertz, has some RAM attached to it. I don’t think the Linux kernel version is relevant, but I have included it for completeness. How long do people think this should take? Who thinks it should take more than 10 microseconds? Put your hand up. More than one microsecond? Okay, got a hand. More than a hundred nanoseconds? Okay, okay, more than 10 nanoseconds? Great, so turns out that you are all correct, yay.

Because on an unoptimized system, this is sort of what it will look like. The jitter will be as high as 16 microseconds. This is the first of many examples that we’re going to walk through tonight. This is a virtualized machine running this pinger and ponger setup and you’ll see that there is a lot going on here. So I’m gonna briefly talk about what you’re seeing.

The top plots here have a linear scale and the bottom plots have a logarithmic scale. As we start the linear scale plots are gonna be pretty easy to interpret. There’s a ton of noise going on here, but as we shave down on some of that jitter, some of that noise, the logarithmic plots are going to become more useful.

The left hand side plots are histograms of all these samples while the right hand plots are time series, that is the x-axis is time and every single blue dot is one sample. Finally, there is a thin orange line cutting through all of these plots. That is the median latency in the system. It is also drawn in the legend. So in this case the median latency was 637 nanoseconds. That’s kind of a lot for not that much going on.

What’s going on here? Well, given that it’s a virtual machine, we might have a noisy neighbor problem where somebody else is running on the same physical host as us and stealing cycles from us. Maybe there could be some spooky virtual machine overhead. I don’t know. It’s pretty hard to think about. People have built entire careers in optimizing the performance of virtual machines, especially when it comes to cloud computing, but it’s not really our forte and there is a winning move here and it’s just to not play. We can just drop the virtual machine, run on the bare metal hardware and we have gotten 414 nanoseconds faster. We’re now sitting at 223 nanos. That’s a lot better.

We didn’t really do much. We’re running the exact same code, just not in a VM. A lot of the noise has cleared up, but we can kind of clearly see there’s something going on around one microsecond. There seems to be some density there, but we have a lot of samples that are quite high up there.

So what’s going on here? How do we figure out what these tails are? You can hook up a sampling profiler like Perf or VTune, but it won’t really tell us what’s going on here. It’ll tell you yeah, you know 99.5% of your time, you’re spinning on the loop waiting for your pinger counterpart to respond. Excellent, zero cache misses, perfect efficiency, but it’s not really what we care about. This like force also mirrors what a lot of our trading systems look like. They spin waiting on market data to arrive and only infrequently do they do something interesting and we only really care about what they did in that rare interesting case.

Where do we go from here if Perf doesn’t help us? Well, turns out we’re not the only people to have had this issue and Intel and processors after Broadwell that’s at least 10 years ago now, has this feature called Intel Processor Trace. And what Intel Processor Trace does is it tracks control flow transfers that’s branches, calls, and returns that your program takes in a very compressed format. You can imagine that if every time you branched you had to write down a 64-bit instruction pointer, mmh, you’re just going to run out of memory bandwidth on any contemporary system. So instead it compresses taken and not taken branches as a single bit, whether that branch is taken or not. And if you know you’re starting instruction pointer and then you have an X86 disassembler on hand, you can sort of walk through all of the instructions until you get a conditional branch. You inspect the Intel processor trace buffer and now you know which direction that branch went. This makes recording all of this losslessly a tractable problem and the cool part is that you can instruct Intel Processor Trace to write all of this data into a fixed size ring buffer that just loops back around and when an interesting event has occurred, you can snapshot this buffer, decode it, and figure out what happened.

Now this is kind of tricky, but we have built a tool to do this, Magic Trace. You can pull it off of our GitHub and just run it locally and what it does is generate you these charts where the Y-axis is call stack depth and the x-axis is time. So this is actually a small snippet of a C “Hello world” program that just print Fs hello world right here wearing DL map object. This is part of the linker linking in stuff like lib C so that you can print F anything at all.

If we were to draw this to scale, this sort of part of the trace would extend well off into the other room. There is a lot of detail here. We can zoom in however much we want to couple nanoseconds resolution and figure out exactly what our program is doing. So now that we have this, we have an answer for how to investigate these tails in our system. We can just use Magic Trace, we know what our median latency is so we can define an event that is interesting to us as a round trip that took more than say 300 nanoseconds. And if we do this and pull it up in Magic Trace, we immediately see these spikes. Everything else that you see, this green and purple hair grass-like looking thing, that’s just the normal mode of operation when we are exchanging integers between our ponger counterpart. But here and here, something interesting happened and we can zoom in and figure out what it is and it’s network activity, that’s pretty weird and you’re gonna have to take my word for it, but it does say IX GBE poll over there and IX GB is the Linux kernel driver for Intel 10 gig network cards.

But that’s kind of weird, right? Our application is not networked at all. It’s exchanging memory over like shared virtual address space. What is going on here? Turns out that CPUs constantly get interrupted. When a network card receives a packet, it somehow has to let the CPU know, the kernel know, hey you have a packet, please dispatch it to whatever application cares about it. And the way it does this is through interrupts a card receives a packet and interrupt fires, the kernel gets to run, gets to dispatch that packet and it makes sense for load balancing purposes that all of these interrupts would normally be distributed all across all cores. If you are doing a ton of network io, you really do care about this charting.

However, it does mean that occasionally your application that has nothing to do with networking will get interrupted by a network car driver. In our case, we don’t actually need this, we don’t care about the networking, we don’t have any bandwidth concerns on these boxes. They’re not doing networking anything. So how can we get rid of this? Well you can use this kernel parameter called isol CPUs and if you tell the kernel, hey please isolate everything that is not core one and two, let’s say. So you leave something so that network activity can flow, that you can assist H into this box, you can leave the rest of the cores free of interrupts. It does mean that as a side effect of specifying isol CPUs, the scheduler no longer gets a chance to run. So if you want stuff put on certain cores, you have to manually assign them with task set.

It’s kind of icky but it’s fine. We control the system. We are the ones starting the pinger and the ponger. We’re the ones starting our trading systems. We can just task set them to the core that they will be running on and the schedule will not move them throughout their lifetime.

If we do this and we restart the box, we notice that all of the interrupts for our ethernet devices have gone away. They’re still on CPU zero. Something is still handling those and our latencies have gotten better. We are now at 194 nanoseconds down another 29 nanoseconds. Now this sort of one microsecond tail has become much more pronounced but Magic Trace has been working pretty well so far, so why not try again?

If we try again we get another one, an apic timer interrupt. Now we know where it’s look for timer interrupts. It’s in this file proc interrupts and we can find it under local timer interrupts. You can see that by far the local timer interrupts make up the most interrupts the system is receiving, small fraction compared to what we are seeing for the network activity. Turns out that about a thousand times a second, the kernel stops to think, hmm, this process sure has been running for a long time. Should I switch it out for something else so something else can make forward progress?

For older CPUs, it’s also a chance for the kernel to adjust the frequency of the computer. You can imagine if you are on a laptop reading Wikipedia, you’re gonna be pretty sad if it’s running fans at full tilt and burning your lap. So the kernel in those situations would clock the CPU down so your fans can spin off. It’s also a chance for the kernel to decide to update process statistics. If you run H top or top and you see how many pages of memory it’s using or how much time it’s been running, that’s all updated in the timer interrupt. But we can also just turn it off with this parameter called no Hertz full.

This is a theme in this talk as long as there is a single runnable process per core, which we have already insured by setting isol CPUs earlier, we can actually turn off these timer ticks and if we do this and reboot the box, the timer ticks go to zero approximately. This is a doctored chart because the kernel really does care about process statistics at least a little bit. The most you can do is decrease the frequency to once a second, but once a second’s still much better than a thousand times a second.

In fact, this does play out, we get much better latencies here. We got 24 nanoseconds faster and that one microsecond mode was basically entirely attributed to the timer interrupt. We still have these 11 and 16 microsecond modes, but unfortunately at this point if we hook up Magic Trace again we don’t see anything interesting. We just see that yeah, sometimes it just took longer to execute these instructions. So at that point that’s an indication that probably something micro architectural is going on.

In this case, let’s talk about turbo boost. I earlier mentioned that you know because power is expensive and you don’t want to burn your lap, older kernels, sorry, kernels on older CPUs would manually adjust the frequency of the CPU. But at some point CPU designers were like, well this is silly. We know exactly what’s running on the chip at any given time. There are many distinct subsystems in the chip with different power governors. We can just clock the CPU for the user more intelligently than the kernel could clock it.

In this case, Cascade Lake has a base frequency of 3.6 gigahertz. Well, our Cascade Lake server has a base frequency of 3.6 gigahertz and it can boost up to 4.4 gigahertz in 31 possible steps that Intel calls P states in between. Now the question here is are all of these frequency transitions free? Well it turns out no, these frequency transitions take time as the CPU rescales its frequency and we might hypothesize that the jitter that we see is caused by transitions occurring, I dunno, probably the maximum P state and the maximum P state minus one as we oscillate back and forth.

In fact, if we go back to the plot, the time series plot earlier and with my excellent MS Paint skills, I have connected the dots and you will see that they make this little sawtooth pattern. I think my best guess here is that one direction of the frequency transition takes longer than the other direction. I don’t know which is which, but it does seem to hold for all of the samples that we have here.

As is customer, we can turn this off, we can just turn off turbo boost, we don’t want it. It increases the jitter and we actually get slower in doing this. We are now one nanosecond slower but the jitter has gone away. In this case, we have optimized for determinism, but because we are now running at 3.6 gigahertz instead of 4.4 gigahertz plus or minus epsilon, we have gotten slower and it’s likely that if we had a workload that did more than just increment a number, this difference would become much more pronounced. Nonetheless, we have gotten rid of the noise.

We still have these three weird peaks though, that show up in the logarithmic plots. At this point the linear plots are more or less useless. They’re just one nice line. This also probably micro architectural. This is a plot of RAM latency compared, sorry, RAM speed compared to CPU speed going from the 1980s to 2006. Why 2006 you might ask, why not 2024? Well I couldn’t find a plot like this online that went up to 2024. I can expect that the reason this is is because designers are embarrassed and this gap has really only gotten worse over time. As chips have become larger with more cores, memory latency has really fallen off a cliff.

This has kind of a problem because everything that CPUs do depends on having their memory operands. If you add two registers together, that’s fine, but if one of the operands of your addition is some pointer that you’ve just dereferenced, you have to go and get it. The gap is massive. You would really slow the CPU down to a halt if you couldn’t hide this in some way. And CP designers have gotten increasingly clever in ways to hide this latency. And one method that they employ is speculative execution.

The idea behind speculative execution is consider our code, where in this while true loop we query whether we have any data from our pinger counterpart, 99% of the time the answer to this is going to be no, there is no data. The other side is still reading the data you just sent it, adding one, sending it back. So the CPU learns that actually, yeah, there’s never going to be data there. I can just assume because there has been no data for the last a hundred times I’ve been through this loop, there will be no data here the 101th time I’m going through this loop and it will speculatively loop back around and issue more loads.

If we visualize the CPU pipeline here, very simplified. If there are any hardware engineers in the room, I’m so sorry, it might look something like this. The CPU will issue a load from the ring buffer and because the last a hundred times this load has come up empty, it will speculative speculatively jump to the start of the loop again. Again, it’ll issue another load, assume it’ll come back empty, jump back to the start, so on and so forth. We get this pipeline that’s quite deep.

However, eventually that load does complete and it does have data, that’s kind of a problem. We have a lot of work in flight that has partially modified CPU state. If you didn’t roll back the CPU state, who knows what would happen. So the CPU has to throw all that work away and that work takes time. If you know where to look, there’s this tool called OPCM that Intel publishes and if you run it on a box, it’ll tell you where pipeline stalls are occurring. In this case, 8% are being attributed to bad speculation.

This is the bucket that this sort of thing goes into. Now you might say, but Tutor backend bound is 44% of the pipeline stalls, front and bound is 22%, retiring is 23%. Why do you care about bad speculation when it’s 8%? Well in the Intel software optimization manual, they have wording that sort of goes like, yeah, if bad speculation is greater than 1%, something is catastrophically wrong and that should be your number one priority to fix. Okay, Intel, let’s fix it.

How do we fix it? Again, we’re not the first people to have this problem. If you think about what our pinger is doing, spinning on one cache line being updated, this kind of rhymes with a spin lock. And here is an excerpt from the pause X86 instruction. It is meant to improve the performance of spin weight loops because when executing a spin weight loop, processors will suffer a severe performance penalty when exiting the loop because they detect a possible memory order violation and that the pause instruction provides a hint to the processor that the code sequence is a loop, yada yada, yada yada. It avoids the bad speculation by preventing speculation past the pause instruction.

Very cool. How do we do this in OCaml? Well, we ask our compiler devs, hello, pretty please, can we please have an intrinsic that generates the X86 pause instruction and then a couple days later they come back and say, yes, here you have it. We have this one line of code diff here where we have called the pause instruction. This is the only change we have made to this program and the bad speculation has gone down to 1%. This is much better.

This is now within a range that Intel would consider good. How do the numbers look? Well, we got another eight nanoseconds faster. The three peaks that we saw sort of have collapsed into one. We still have some noise. We have a lot of samples here that are, I don’t know, about 15, 20 nanoseconds sort of in some bends. I don’t know what’s going on there. I’d like to know I’d like all of my samples to be in the fast bend, but this is where my bag of tricks runs dry. However, we did massively improve the performance here. We went from 637 nanoseconds with tons of jitter to 163 nanoseconds with very low jitter. That’s pretty neat.

But there’s at least one thing that we haven’t tried yet and that’s, if you go back to our benchmarking setup, when I said that, you know, we had this CPU and that RAM, it’s been a while since I’ve been in university, but I’m pretty sure that I had better laptop RAM and better laptop CPU than this. What if we bought the gamer RAM with the RGB lights and overclock the CPU? Like everybody knows that the RGB RAM makes you go faster, please. What would the numbers look like here?

They actually get much better. This almost feels like cheating. We have gotten 94 nanoseconds faster by just running on better hardware. Now, better hardware asterisk. The real story is right after we took this benchmark, we’re like, wow, it’s so fast. Let’s try seeing what some of our real trading systems perform like on this hardware. They made it halfway through a testing run before the box died. Memory corruption. Turns out that if you overclock things too high, one plus one is equal to three and that’s not great. So trade-offs.

Nonetheless, we have gotten about an order of magnitude faster from where we started. Pretty good. The takeaways here are that if you care about performance and determinism in any respect, maybe don’t use virtual machines. You might not always have the choice. Typically cloud providers will upcharge you for not virtual machine. It is what it is. At least know what you’re getting yourself into.

Question from Audience: Yes?

Audience Member: Sorry, I’ll ask a question. What do you consider as a virtual machine? Like as a hypervisor in AWS a virtual machine?

Tudor: Yes. AWS leases a pretty reputable cloud provider. They will tend to not have that much jitter, but they definitely will. Lower end cloud providers will frequently oversubscribe you. AWS also does sell metal instances and those are the ones where you would get the most deterministic performance. Thanks for the question.

So as long as Docker is running on a bare metal machine, it should be fine. So Docker just uses namespace capabilities in the kernel. Your interactions with the kernel will get moderately slower from all the C group accounting and things like that. But as long as you’re not doing system calls, it mostly doesn’t matter. It certainly won’t introduce more jitter. Thanks for the questions.

The second point is a lot of performance can be squeezed out of your system by understanding the underlying architecture. We made one line of code difference here pretty good for what we got. The choice of hardware matters a lot as we just discussed. And finally, probably the most salient takeaway from this is it’s really important to constantly measure. If some trading team handed me a system whose tails looked like what we started off with, I would have a pretty hard time getting into a state that looks like what we ended off with.

More importantly, if we have a system that has tons of tails repeatedly, I can introduce some regression by accident and we’re never going to notice, it’s already so noisy, we can make it 10 x worse before somebody’s like, wait a second, the frog’s kind of boiling. We’ve gotta figure out what’s going on. Whereas if you had a system that looks like what we ended off with, where the median latency is not very different from the P-99 latency, any new regression that you introduced that causes the P-99 to go off of a cliff, you’re gonna notice immediately in testing. And that is just a much more stable place to be if you care about determinism.

Tudor: Cool, thanks everyone.

The next great idea will come from you