Caveat Configurator: How to replace configs with code, and why you might not want to
Dominick LoBraico
Jane Street
Complicated systems require expressive configuration languages. But language design is hard: It's no surprise that many applications have either limited configurability or an unwieldy configuration format with complex semantics. One solution to this problem is to write your configs in an existing language and treat them more like code. This pattern allows you to use the same familiar workflows and tools to configure applications that you use when developing them.
At Jane Street, we developed ocaml_plugin
, a library for embedding OCaml within an application, to leverage this approach and have made use of it in a variety of systems. But, giving config-writers the full flexibility of a Turing-complete language introduces its own problems. In this talk, we'll discuss the merits of writing configs in code and cover some of the different approaches we've taken to harness this power.
Transcript
All right. Thank you all for coming out tonight. I am very satisfied with the pun in this talk title, so I’m just going to rest on the title slide for a few more seconds and let you all take it in. So, as Ron said, the talk is called Caveat Configurator, and this is a little bit about how Jane Street thinks about configuring systems and some of the lessons that we’ve learned from experience doing that.
So, first, a little bit more about Jane Street. I’m guessing most of you know at least the very basics, which is that we’re a proprietary trading firm, so we trade all around the world and in all different markets, we trade all different financial instruments, and most of our business is based on in-house tools and systems. There are kind of two constraints that I want to talk about today. The first is flexibility, and the second is safety. So, I’ll say a little bit more about each of those.
We’re relatively small. We’re about 500 people. Maybe 150 of those are developers of some kind, and the number of systems per engineer here at Jane Street is high. We trade, like I said, in something like, I don’t know, 90 countries or something like that, and we are constantly looking for new opportunities and new places that we can do business. As a result, we really, really need to kind of maximize the amount of output from each person to be able to do that.
So, that means designing flexible systems, being able to reuse them in lots of different places, being able to sort of have new instantiations of things with slight tweaks without lots and lots of extra development effort is really valuable. That can be kind of the difference between one day into production for a new opportunity somewhere and 30 days, if we have to go and build a new system or something like that. 30 days is a complete made-up number and definitely way over optimistic.
On the other hand, I mentioned that we’re a proprietary trading firm, which means that we’re trading our own money, and the significance of that is that if we lose all of our money, then we aren’t here anymore, and it’s really hard to find jobs as OCaml programmers, especially for, like, 150 of us. So, safety is kind of the other constraint that we consider in our development work, and that means thinking about the different kinds of risks that a particular system is going to be taking and how we can constrain those, and sort of considering the fact that a small bug could mean missed opportunity cost, it could mean that we’re not able to trade in some place that we want to be able to trade for some amount of time, but a big bug in a tight loop could mean that it’s an existential risk for us. It could mean that we don’t have a job tomorrow.
So, these two things in mind, configuration is something that we really think a lot about, being able to have these general systems which we kind of instantiate in specific cases is something that lets us get a lot of mileage for our time.
In 2014, a colleague and I, Drew, sitting right there … Drew, cool. We wrote a talk. We wrote a talk about a new configuration paradigm that was starting to take hold here at Jane Street based on a new library that we had written called ocaml_plugin
. The idea, as you probably can guess from the name, is we wanted to be able to have some kind of plugin architecture for our system, some way to write code that we could plug in to a system to change the behavior in more flexible ways than previous configuration architectures had allowed us to do.
This is a pretty common approach. We certainly were not the first to do this, and there’s some built-in support in the OCaml compiler and runtime to be able to do this kind of dynamic loading, but we wrote a library to make it easier to do and it started to take hold here in various places. ocaml_plugin
was really new, at the time, and we were really excited. As software developers, we’re all kind of attracted to shiny things, and it was really shiny. At the time, we here at Jane Street were working on, just to give an example, a new implementation of our email system which was written and configured in OCaml. So, there was real production effort going into using OCaml plugin.
We went around, we gave the talk on a few college campuses, the crowds went completely insane, I got famous, there was much rejoicing. I figured out how to use emoji in a LaTeX document which made me really satisfied, but okay. We did that for a few months, we came back to work, and it turns out to have been kind of a flaming disaster. At the time, we were taking what was a pretty naïve approach to this plugin architecture and it turns out when we tried to actually use it in production the way that seemed obvious to us at the time, we ran into a bunch of issues, so this talk tonight is going to be first about some of the issues that we ran into with kind of a little bit of background about how we got to ocaml_plugin
, and then I’m going to talk a little bit about how we’ve actually ended up dealing with that at Jane Street, what our current sort of strategies for dealing with this config problem tend to be.
So, what is the point of configuration? I’ve alluded to it in the beginning of the talk so far. The sort of first idea that you think of when you think of configuration is that you want to be able to sort of easily change the behavior of some system at runtime without having to potentially recompile, repackage, redeploy the application and deal with sort of all the rigamarole that goes into that. So, there’s some set of customizations that we’re going to apply to a general system and make it useful in a particular instance.
A little bit more generally, in thinking about configuration, there’s some sort of necessary implied idea about separating the things that you decide to configure, to make configurable I should say, and the things that are left to the kind of core functionality of the system. The last thing is configuration makes it easier to handle cases that you might not have thought about at the time that you wrote the application. So, you write an application, you roll it into production, something changes in the world, and now you want to be able to make it do your bidding in some other way. Even if you didn’t think about that at the time that you actually wrote the application, if you made your application configurable and flexible enough, then you may have left yourself some kind of escape hatch, some way to actually make it still work for that thing that you hadn’t even ever thought about.
So, let’s just consider an example. I’m going to use as an example tonight the idea of an email server because it’s familiar to me, but for those who it isn’t familiar to, the basic idea is pretty straightforward: you’ve got some system, it’s going to accept connections from some remote set of hosts, it’s going to receive email messages over those connections, it’s going to process them in some way, maybe it’s going to transform them, add headers, pull out virus-ridden attachments, things like that, and then it’s going to deliver or forward the messages on to some other host.
So, you can think about configuring an email system in a few different ways. To start with, we have this kind of monolithic approach where we don’t actually have any configuration, we just have the core functionality and any of the bits that you might otherwise consider as configurable just bundled into the application. You build the application, you deploy it. You want to change something? You change it, you build it, you deploy it again. Is this crazy? No, it’s not. Right answer. In fact, I think if you can get away with zeroconf, with the idea that you don’t have any actual configuration for your system in your particular use case, then that’s actually … it sidesteps a bunch of problems that we’re going to talk about tonight, and there are, I think, real cases in the world where that is true. I think mail systems tend not to be because of the fact that you do want that ease of changing behavior at runtime if something changes in the world, but the point is, you shouldn’t write that off, and when you’re thinking about these configuration problems, you should think about “Maybe I don’t actually need configuration, maybe I can just build a system that has some sane set of defaults and make it easy to deploy new versions of it.”
So, I’m not going to talk about that anymore, I’m going to continue talking about configuration now. Going back to our mail server, the natural thing to do is “Oh, we decided we need a little bit more flexibility, we need to be able to change certain things at runtime,” so we kind of push this line that I’m using to kind of separate the core functionality and the configurable bits of the application over to the right, and I’ve got this … this is kind of a pseudo x-axis here of flexibility. As we move this line over, as we increase the set of things that are configurable and sort of pull things out of the core functionality of the application as we consider it, the application, in theory, becomes more flexible, more configurable. As a sort of natural progression from that, you can just keep pulling things out of the core functionality of the application to the point that the configuration is handling everything, or close to it.
So, with all of that in mind, that’s great as a concept, but in practice, you have to write the configuration down somewhere. You have to put it into a file or into a database, or you need some way of specifying what this is that the application can read or process in some way at runtime. The more general and configurable that your system is going to be, the more complex the config is going to become, the more complexity, at least, the config has to be able to handle. So, there’s kind of a natural progression here: I think the place that most applications start is with something that I’m going to refer to as pure data, where this could mean anything from a bunch of command line parameters that you specify when you run the application, some environment variables, a flat file on disk with a list of kind of key equals value pairs or something like that. Maybe use some common serialization format: YAML, CSV, JSON, S-expressions if you’re at Jane Street.
This often starts out as kind of unstructured data, and … Oh, that was not smart. Okay, cool. There’s a few kind of obvious pros here. This is easy to understand. You can look at one of these small, simple flat config files and say “All right, the port that the mail server’s going to listen on is 25,” that’s it. There it is. There’s no logic to have to decipher. There’s nothing opaque about it. Following on from that, files like this can be pretty easy to version. When you roll a new version of the application that changes the config format, it’s pretty easy to have an upgrader for the config file that just knows how to read the old version and turn it into the right internal type for the new version or something.
The kind of obvious fallbacks to this approach are that it doesn’t give you much power. You can’t specify things of much complexity in this way, and when you try to, it can get pretty verbose. When you’ve kind of shattered the structure of your application, or of the thing that you’re configuring, to the point that you’ve got this really simple configuration format, it can get really repetitive.
I’ll just give a little example here: so, consider the fact where we have added a little bit of structure. This is a completely made up config format in something that is JSON-esque. I haven’t actually checked if it’s fully coherent, but the idea is we have some keys, maybe we have a list of ports that we listen on, we have the log directory, we have the log level, and then we have this more structured data here for the next hop that we’re going to take, the next mail server we’re going to deliver messages to, based on the domain of the sender. Just for kicks here, we also specify the retry intervals: we’re going to retry sending this message every minute,” and then every five minutes, and then every 15 minutes, just to show what it might look like when you add a little structure.
Now, this looks okay. It’s not too long, but it can get pretty messy pretty fast. So, as soon as we add another domain, we need to handle … I could’ve written a file that went all the way off the board and it would not be necessarily a fun thing to work with. So, the natural thing to want to do in this case is to factor some of these common bits out. We’ve got a lot of values that are just the same in each of these fields. We’re always going to use this port, we’re always going to use these retry intervals, et cetera. So, why don’t we just pull these out to some defaults?
All right, so, now we’ve pulled out the common things and then, in the case where we’re using the defaults, we just don’t specify anything. It looks a lot nicer, but you probably are going to end up with some field that doesn’t have a clear default that has maybe a few really typical cases, and in that case, you’re right back where you started. You’ve got this long config again, it’s hard to read, it’s not easy to work with. So, the kind of natural thing to want to do at this point is to just introduce macros, maybe variables, something like that. It can’t hurt, right? It can hurt. So, you go ahead and you start to develop this weird little ad hoc language which I have not done here, but you maybe even find a few different parameters that can expand in different ways to make it easier to write a nice, concise configuration file, and now, you’re kind of off to the races.
So, what if you actually set out to do that right? What if you set out to write a domain-specific language for your application? You could specify all of the things that you want it to be able to do and really take into account from the very beginning the kinds of things you want to be able to specify, which is great. You’re probably going to end up with something a little bit less cobbled together if you’ve stepped back and thought about the overall structure of the thing, and it’ll be a lot easier to add support for that one little feature that you could never quite make work with your previous attempt. To be honest, everyone really likes to pull the language designer hat off the shelf once in a while. It’s fun. It’s a fun thing to do. Finally, in your DSL, you can make it do all of the things that you might want to do. It can be kind of arbitrarily powerful, and you have full control, so it’s nice.
This is a badly photoshopped dog with a language designer hat on. I’ll just rest on that for a sec. So, that’s all great. These all sound like really nice things, but there is one major con, and that is that you are a terrible language designer. You’re just not good at it. Even if you’re good at it, you’re bad at it because you don’t have the time to actually do it, and so, as a result, you’ve now built this whole language and there are going to be problems with it. It’s going to have bugs. How are you going to distinguish bugs in your config language parser or interpreter from bugs in your core application? What is that going to look like at runtime when something’s broken and it’s a production issue?
The semantics of your DSL are going to be kind of new and different from anything that anybody else has ever done. You’ll probably take inspiration from “Oh, I really liked the way this language handled it,” and then “Oh, I really thought this was cool in that other language.” Now, you’ve got this mishmash of things that seemed good in your head but that maybe don’t seem good in anybody else’s head. So, you’ve got this thing that nobody else is going to understand, nobody else is going to come in having any experience with. It’s a brand new thing.
The tooling is going to be bad because it doesn’t exist, because it’s a new language, so you’re not going to have your IDE features like jumping to definitions, syntax highlights, auto completion and all that kind of stuff, you’re probably not going to have a way to write tests for code written in your DSL, and all of these things are going to be things you’re going to have to do yourself, if you decide that they’re important enough. You don’t have time to do that.
Then, I think finally, the transferability of skills is really hampered by a DSL. So, you might have somebody who is an expert in mail systems, they’ve run a million production mail servers, but they come in and now they’ve got to write some code in your crazy DSL. All of their knowledge, their experience, kind of goes out the window, to a certain extent, and when something’s broken at 2:00 in the morning, do you want to be the only person who understands how your DSL works? Because you’re going to be getting that phone call. I don’t want to be that person. I just included the dog again because it’s fun.
So, let’s look at a real example here. Let’s say that somebody in here, maybe me, decided to run their own mail server. You decide “Ah, I’m not going to write my own because I just want something stable,” so you decide to use a popular open-source mail server. This is a config that you might use to run an instance of this system. So, the idea here is we essentially extract the local part from the address, which is the part before the @ sign there, and we just deliver mail to a file that is named that in /var/mail. We look in /var/users to decide who we’re accepting email for and otherwise, we return some error, unknown user.
So, now you’ve got this, it works. In reality, it probably took a lot of time to get to that point, but now it works, and you decide “All right, I’m going to take on some users, I want to add some features.” So, those of you who have Gmail accounts might know that Gmail has this cool feature where you can append a plus sign and then some string to the first part of your address, to that local part, and it’ll still get delivered to your actual address and it’ll still have this full recipient, so you can do things like filter on it. So, let’s say we want to add support for that.
Here’s our config. We’ve now gone through and specified a few other things here. We decide to make it configurable, who has this feature enabled and who doesn’t? I’m not even going to go through this but hopefully you can see that there’s a lot of curly braces and there’s some weird things. There’s the fact that there’s this weird plus star thing over by the local part suffix there, and does the order matter between these different things? What is the syntax for expanding variables? Well, in some cases, it’s a dollar sign over here, but if you’re actually checking if it’s defined, there’s no dollar sign. There’s a bunch of just weird stuff here and, unless you’re familiar with the particular system, this is probably all going to be new to you. I think this config would actually work and it took a non-trivial amount of time to get it to the point where it would.
This is a relatively basic config and a relatively basic thing to want to do, and something that is actually supported pretty easily within this system, but you could imagine from this that, in a real production environment, it might start to get pretty messy. What are the scoping rules? Where does ordering matter? All of these things are questions that you’re going to have to understand the answers to if you’re going to try to run this in production.
So, what do you do? You’ve come down this road, you’ve built up this complex config line which … What do you do? You already have a tool that is good at describing these kinds of problems, you already have something that you are able to use very flexibly, that you have lots of libraries you’re familiar with and that you use every day. It’s your programming language.
So, what if we just tried to model this problem of running a mail server in your programming language? This is OCaml syntax. For those of you who don’t know OCaml, I’ll just walk through it real quick. So, we define some message type, which has all the things you’d expect an email message to have, in particular a sender, some recipients, and the actual body of the email, and then we define the type of the function, called handle_message … this should just not be there, ignore that … which takes a message and returns one of these three things: either delivered to signify that it has accepted and done something with the message, relay to signify that you want to take that message or some new message and send it to some further destinations, or reject with some error message to say “We’re not accepting this message,” for whatever reason.
That seems intuitively, to me at least, like the kind of thing that a mail server is responsible for doing. It takes messages in, like we talked about before, it either sends them on to somewhere else or accepts them itself. So, what if we then tried to add that feature that we were talking about a few minutes ago using this kind of OCaml-based config format? We’ve got here … imagine that this says handle message. We’ve got here this function which matches that type that I showed on the last slide, we split out the part before the plus and the part after the plus from the local part, and we check our user preferences to see if that user wants to support this plus functionality. If they don’t, then we reject it, and if they do, then we deliver the message. Straightforward, no weird syntax except for all the people who don’t know OCaml in here, but in your programming language, try to envision it. No weird syntax.
This is something that we can actually do in OCaml. This is what that ocaml_plugin
library that I mentioned before makes it easy to do, but it’s something that you can do in lots and lots of different languages. The list is here. I’m not even going to bother reading it off, but this is a pretty common workflow. So, yay! Everything’s great. We’ve got our language, we’ve got our tooling, everything’s happy again. We don’t have to worry about the weird semantics of some DSL.
This is the part in the old talk where we’d stop and pull up some flashy example, we’d do a live demo, everybody would be like “Oh my God, look! It worked!” Then, the crowd would go completely insane and I would just stand here, bask in it for a while and then I’d go home. Obviously, that’s not what’s going to happen tonight, but we got really excited about this. ocaml_plugin
seemed really cool. It seemed like a new superpower, a new way that we could build extensible systems without having to deal with all of the pain of the other solutions that I talked about.
So, we started with kind of configs this big in whatever DSL or pure data format that we had, and then we moved to OCaml and they got nice and short, concise, easy, easy to read, understandable, but when you give people this power, in particular software developers who like to write code and like to generate lots and lots of lines of code, they’re going to use it. So, usage really ballooned. People were writing all kinds of applications using ocaml_plugin
, and we ran into a few issues. The first of these is it’s dangerous. Code is code. It can do anything and you probably will have a hard time preventing it from doing things that you don’t want it to do. In the case where you’re using this code to configure a production system, it’s going to be running inside your production process as your production user on your production box and, if you don’t trust the people writing your configs, then you probably shouldn’t be happy to do that.
So, we rolled a bunch of production systems, including some where users were actually writing the configs where there was this separation between the application developers and the config writers. Everything was great. The users were really happy, they wrote lots of configs and the world was merry. Then, we wanted to fix some bugs in the core application, wanted to add some features, whatever, wanted to roll a new version, and “Oh, we’re completely screwed.” We didn’t think about versioning.
Why is this a problem? So, when you’re compiling code in your application to use as configuration, that code has to be able to compile against whatever version of all the libraries that are linked into your application, and if you didn’t build some kind of discipline around versioning, then if the core application has linked in some new version of a bunch of libraries and your plugins haven’t been updated to use the interfaces that have changed correctly, then your plugins just aren’t going to work anymore, and we literally abandoned applications because we made this mistake, didn’t think about it. We’re like “Huh, what do we do now? Guess we’ll just start again.” That actually happened in a few different cases, I think. So, that’s a complication of plugins that we hadn’t really considered.
Related to that is the fact that now you’ve got a bunch of other files with code in them floating around. Maybe they live in various sort of users’ directories or maybe they are next to the application, but the point is there’s a bunch more code hanging around out there and you need some story around how you deploy it, how you track it, how you update versions of your plugin to match new versions of your application, and the lack of centralization by default would prove to be an issue, in some cases. This is related to the versioning thing, as well.
I think maybe the most surprising thing, at least to me and, maybe, to many of you is that the tools were still really bad. This is a dog with a tool. There’s one more dog image, I promise. This seemed surprising because the promise of plugins was “All right, we can use our own programming language, we can use all of our tools. We’ve got our editor, we’ve got our syntax highlighting, we can write tests in all the libraries we’re familiar with,” whatever. All that’s great, but it turns out that the plugin architecture actually doesn’t typically slot right in to all those tools in the way you might expect. You’ve now got this standalone file. How does your editor know where to find dependencies to jump to definition and stuff like that? You’ve got just a bunch of things where you’re treating this a little bit different from the rest of the code in your code base because it does not live inside your code base necessarily, and it turns out that actually breaks a bunch of assumptions in various places.
You might not care about this. You might be like “Oh, well, it works well enough and it’s still easier to write than crazy DSL,” but when you realize that most of your application is now written in these plugins, suddenly, that becomes a lot more of a problem. Remember that line that we pushed over earlier? When you get to the point that the core application is like this and the plugins and the configuration are like this, you might care a lot more that your tools don’t work well.
So, looking back to that little configuration before, that’s what we started with: nice and simple, it’s in 12 or 13 lines. Here’s our real production configurations for our mail server. It’s four thousand lines of OCaml, which is a lot more than 13 lines. This is a real thing that we … like, I did this today, and the fact that our tools wouldn’t work nicely with plugins sort of naïvely was a big issue for our own development processes.
So, the natural inclination here is to just roll back that line, get to the point where you’re back in zeroconf territory, where you just deploy new versions of your application when you need to change things, and that’s actually what we did, in a lot of cases. We took the plugins and we bundled them with the application. We kept them separate, we still loaded them with ocaml_plugin
, but we deployed them all together and, when we changed the plugins, we deployed a whole new version of the application.
This is actually how we solved this problem in various places, and it kind of points out the fact that it wasn’t the plugin functionality that we necessarily were looking for, it was this nice, clean separation between the core application and the configurable bits of the application. Having the kind of escape hatch of being able to go at runtime and change the plugin if you need to make an emergency change felt kind of nice, too, and so we were relatively happy with that approach, but it didn’t really solve the original problem that we were looking to solve. Let’s think a little bit about that.
So, we took this journey from that nice, simple config to our DSL to just naïve, completely unfettered plugins that can do whatever they want and they’re scattered all over the place, and we kind of came to the conclusion that there were two different routes that we could take to deal with this. There’s the run away approach, which is what we actually took, like I said, in a lot of places, like “Okay, that was cool, let’s go do something else now.” I think, in practice, we for several applications that used ocaml_plugin
, we ended up saying “Actually, the flexibility of plugins wasn’t really necessary. Let’s just deploy the plugins with the application,” and we’ve still got OCaml for the configurable parts of our application, but we don’t have any of these annoying problems to deal with, as far as plugins go.
Then, the other approach was to basically improve the tooling as much as we could. We did a bunch of work on systems to make the plugin environment less painful to work with. We developed entire systems with UIs for editing files and for keeping them in version control and for rebasing them and merging the sort of disparate branches when changes occurred … things that exist in the world and that there are tools to do, but that we kind of did again without necessarily realizing it to make this plugin environment a little bit nicer to work with.
So, if we step back for a second here and just think about the things that we like about writing code, the things that we like about solving problems with code, first of all, the obvious one: code is expressive. You can express anything in your programming language. You aren’t limited by what some config language designer happened to think of beforehand, you can be really general and you can tackle lots of different problems in your language, so that’s nice. Code scales well. We’re familiar with the scaling problem as far as code goes. We’ve worked with lots of big applications. It’s something that we have tools to deal with. We have tools, and our tools work nicely with our programming language, and being able to use those to solve problems is something that gives us superpowers, as well. Then, this is the last dog photo.
Then, there’s this thing that I call the culture of code, which is that we treat code differently from the way that we treat other things, the way we treat config languages, or config files I should say, and particularly, that means we write tests for code. We re-factor code when it gets hard to read. We do code review on code. These are all things that, by default, people don’t really think about doing for configuration, and it’s something that you get a lot of out in terms of the quality of the code and the correctness of the code. So, these are all things that are nice about using code to solve problems.
So, we kind of honed in, given those things, on two different approaches for getting these attributes from any configuration system without the problems that we found with ocaml_plugin
. The first of these is something I’ll call config generation. The idea behind config generation is what if we took those nice bits about a plugin architecture, the fact that we can use code to solve problems, and we combine them with the sort of predictability and the transparency of that pure data approach, where you can actually just look at the config and see what the application’s going to do at runtime, given that.
The value from the plugin approach comes from writing your configs in a programming language, but that’s not to say that your app needs to read the configs in that language. So, this is the kind of founding principle for this idea. There’s a few requirements that you need to have. This doesn’t work for everything. The first is that you need to be able to finitely specify your configuration in some way. There needs to be some thing that eventually gets spit out that your application reads. It can’t be a function, and you need some way of turning whatever it is in your application that you need to configure into something that you can actually write out in that way.
You need to think about versioning as well here. You need some way of versioning this format that you end up writing into a file or wherever else you decide to store it, but it’s a lot simpler than the plugin versioning problem, you just have to pick a version for your config format on disk and you can maybe have some transformation from one version to the next. It has all the benefits that the pure data approach has, in that respect.
It’s kind of a natural progression from that handwritten pure data approach to then move to some config generation approach where you write some code that actually spits out a thing that is in the same format as that pure data was and you can kind of just pivot to that without having to make any big changes in your application.
Something else that is nice about config generation is that it’s not actually restricted to apps that you wrote either, which is nice. You can write code in whatever programming language you want to generate the configs that go into some third-party system that you’re stuck using, and that can be nice, too.
At Jane Street, practically speaking, we settled on two different approaches for using config generation, and they both … once you have the version’s type, they both kind of fall out naturally from it. One is you just have some script somewhere written in OCaml which generates your config and that script is maybe pinned to a particular version of the OCaml compiler and library so that you are always using that set of libraries to generate this version config. Then, the other is to basically have something that looks a lot like our regular dev process, to have a repo in code review with a build daemon and a bunch of tests, and to just keep your config generator there, probably to actually commit your generate config into version control as well so that you can easily see differences from one version to the next and also to be able to sort of easily roll back to some previous version without even needing to run the generator again.
If this sounds like our regular dev process, that’s because it is and that’s what’s great about it. There are a couple of other nice attributes that we get out of it. So, this is a kind of simplification of the configuration file that we use to configure our monitoring system that we have here, and you can see here we give some list of hosts, we have some set of services associated with various hosts that we’re going to monitor and then, in here, we specify the actual commands that we run to check those services and the time periods that those things should run in.
So, this is a format that we can generate, and it is one that we generate. The nice thing about config generation is that you can actually see the generated config file before you load it into your application. You know exactly what your application is going to have for these various fields at runtime just from looking at this flat file in a way that it might be harder to do if you had to look at a bunch of code.
I think the other thing that’s nice about this is you can generate other things, too. You can generate files in whatever format you want and, in this case, I kind of generated a CSV that shows the actual resolved set of hosts and checks that we’ll run. The nice thing here is, if we take the diff between two versions of this generated file, it’s very clear to see what a particular change will have affected on our system. So, “Oh yeah, we removed some old check on one hostname and we added a new host with a few checks.” That’s something that you don’t really get as cleanly with config that’s actually written in code.
So, the other approach that we took was to try to actually tackle those issues that we talked about with plugins directly. So, just to recap those quickly, we talked about versioning. Versioning is difficult, we need some story for that. We talked about the danger that a plugin that’s got a bug in it or raises a top-level exception or something like that has for the actual stability of the application that’s using the plugin. We talked about the inconsistency with which plugins are managed with respect to the rest of the code, in terms of where the files live, how do we deploy them, how do we keep them up to date, all of that. Finally, we talked about the tools.
So, what we did here at Jane Street is we built a system to address each of these issues. The system’s called plugd
, and the basic idea is just centralize everything. Take all of your configs, put them in a repo, treat them like you treat application code. Only serve up safe versions of plugins … that means versions of the plugins that have been code reviewed, that compile, that have been tested, all that kind of stuff, and serve those up to the application in the way that the application doesn’t have to know where the plugins live. So, the application itself is kind of decoupled from the plugins in this way. It also introduces a workflow for keeping plugins that are used for older versions of the application up to date by basically shuffling patches between versions and sort of keeping them all in sync.
So, this was a lot of work. We had to build real infrastructure to make this work, but the plugin architecture does get you something that the config generation approach doesn’t get you, which is that you do have this real runtime dynamism. It’s the fact that you can have a function that is used for your configuration and you’re not restricted to things that you can actually write down in a static file.
I think I kind of summarized the overall decision tree here for how you should decide what to use, and that’s probably illegible, but the idea is, first of all, do you need Turing completeness? If you don’t need that runtime dynamism, then config gen is a great approach. It works really well, we use it for lots of things here. The next is this question of the decoupling of the application developers and the plugin writers. So, do you actually need the plugins to be able to live separately from the application or can you just bundle them up with the application and deploy them like we found we were able to do with a bunch of our own applications? If you don’t, then you should probably just do that, and if you do, then you should continue down. If you don’t care about stability at all, then just do the naïve plugin thing. Fine, sure. It’s going to break. No big deal, but if you do, then you probably need some story for disciplined versioning of your plugins, and that’s what we ended up with, with plugd
.
So, I think one final takeaway which isn’t actually on the slide here is that maybe you just shouldn’t use plugins? We found in a lot of cases that plugins seemed like an attractive tool because they forced us to be really explicit about the separation between the application, the core application, and the configurable functionality, but in a bunch of cases, it turned out when we looked back at what we had ended up with, we didn’t really need plugins to get that. We could’ve just structured application in that way. I think that’s the kind of allure … the siren’s call of plugins drew us in but, in practice, it’s not actually been that beneficial to us in a bunch of cases, so I said, “All righty, sure,” because you probably do need stability.
And that’s it. Any questions?
How is plugd
configured?
I think plugd
is in more of the kind of zeroconf camp, in the sense that it has pretty same defaults and then it’s tightly integrated with our code review system and with our build daemon so a lot of the configuration is kind of offloaded onto that. It depends on there being a feature in our code review system that the plugins are all based on, and so it knows it can get most information from there. Does that match your understanding, Ron?
Affirmative.
Okay. Yeah?
What elements of the config generation versus flat configs versus plugins do you need to have available outside the developer group for researchers, operations, [inaudible 00:43:12] or [inaudible 00:43:12]
That’s a good question. So, one of the problems with plugins that I didn’t say explicitly is that you have to write code to write plugins. If you’re going to configure a system with OCaml then you need to know OCaml. We benefit in a lot of cases from the fact that many people here, even outside the tech groups, are technical and are comfortable reading or, in a lot of cases, even writing code, so that helps with that, but we definitely have lots of cases where have chosen to stick with a simpler config format, something that a human could just open up and clearly see which bits they need to change to make some functionality change because of that exact problem.
A thing that we do in various cases is, even in an application where we use ocaml_plugin
to get that true dynamism, we’ll define a simpler config format inside the plugin which the plugin then … like “Yo dawg, we put a plugin inside your plugin,” and so, that has actually helped us in a lot of cases as well. Drew, you want to add something?
I just wanted to point out that interestingly for us, the case where the plugin stuff is useful is actually where we want to write the validation of it [inaudible 00:44:16] application [inaudible 00:44:16] the tech group [inaudible 00:44:16] then let the traders go and [inaudible 00:44:16] in their own thing because, actually, the traders [inaudible 00:44:16] like that’s the case where the application and the configuration is so separate [inaudible 00:44:34].
And you can then have people more familiar with the application do code review on the plugins that the traders have written and help to ensure that they’re safe and doing reasonable things, and not going to bring the application down when they run, which is really nice. Yeah?
Can you elaborate a bit more on the [inaudible 00:44:55] version problem? I ask that because when you mentioned it the first time, “The first thing I’m talking about is … “ Wait a minute. These people aren’t using version control? [crosstalk 00:45:05]-
That’s not quite what I meant.
… and actually, in one of your subsequent bullet points, you sort of implied that there was some stuff that wasn’t under version control.
I think we version control most things, and you definitely want to version control things in every case that you can. The point I was trying to make is that with configuration by default, you’re not as sort of forced into version control as you tend to be with kind of core application code, but what I meant by that [inaudible 00:45:31] version problem is, if you’re going to compile at runtime some code and dynamically link it into your application, you need some libraries, some dependencies, that that code is going to use to be available at that time. The way ocaml_plugin
works is it basically bundles these dependencies up into the application, it gzips the compiler and tacks that on to the end of the executable for the application, then at runtime, it expands all that stuff, it runs the compiler on the plugin and the dependencies, and then sort of dynamically loads in the library that results.
In the case where you’ve got this static version of all of the libraries that you depend on bundled into the application, if the plugin writer goes and writes their plugin with respect to some newer or older version of the libraries, the plugin isn’t going to compile anymore when you try to actually do that at runtime, and that problem ends up being kind of a complicated one as far as keeping the version of the libraries that the application is going to provide and the version that the plugin is going to expect in line. Does that make sense?
Yes.
Yep?
So, one approach that I maybe didn’t mention … I mean, if you have the end user, if you write your application with a library and have the end user wrap that in some small program that does the configuration. Did you consider doing that? That seems like an approach.
Yeah, where essentially, you have some library and then, instead of configuring an application, you just built an application that uses that library. Yeah, that is an approach that is taken a lot in various open source-y projects, I think. It’s not something that I would call configuration in the same way, but it’s something that we definitely use. Our entire code base is just a bunch of libraries that we have worked on as a collective and that we use in various ways so, depending on where you draw the line between what is configuration and what is just using the libraries that we have, it’s a little bit hard to say, but there are certainly cases where we have built up a library that essentially gives you a function to start an application, take some arguments or whatever and that then people go and build instantiations of that and run them however they see fit. Does that answer your question or … ?
Well, I mean, so when do you choose to use that approach versus the plugin bundling approach that you went to?
I see. I don’t have a really nice, succinct answer for you. I think we tend to just kind of draw the line based on the amount of functionality that is going to be bundled into the application and the amount that is left open to the user. Libraries tend to be more general things. Even then, very configurable applications, I think, and that they have more components that you can kind of mix and match. I think Ron probably has a nice way to describe it.
I think another answer to your question is you need plugins for true runtime data, which is to say you actually want in the middle of the execution of your program to change the behavior [inaudible 00:48:38], and then and I think if you’re not in that world where you’re just like “I need to set something up and run it,” and you use code to [inaudible 00:48:47] library or maybe load [inaudible 00:48:48] plugin, I think it’s just the same thing, slightly different in terms of the tooling but, semantically, it’s just the same scope.
Yeah?
Do you ever wish you had sandboxing for plugins [inaudible 00:49:05]
Yeah, so, we have written some libraries that give us a semblance of this, where they basically spin up a bunch of workers and the workers build the plugins so that the master process is not actually affected by a particular plugin going down. It’s pretty complicated architecture. It requires some more plumbing than you might want. So, in practice, it hasn’t actually played out as a really dominant use case, I think, but it’s there as an option, for sure. Yeah?
Can you talk about … Excuse me … your [inaudible 00:49:40]
Yeah.
How do you do that without a sandbox [inaudible 00:49:45]
So, that’s where code review, tests, compilation all come into play. So, we-
[inaudible 00:49:53]
… Yeah, exactly. So, we have a build_daemon. It compiles and then also runs some set of tests on every release of a feature, and if the tests don’t pass, then it doesn’t validate the feature.
[inaudible 00:50:09] process [inaudible 00:50:10] core application versus a plugin, or [crosstalk 00:50:12]
That’s right, yeah. So, in the plugd
model, we have kind of a separate build workflow for the plugins, where it runs tests for each plugin and builds them sort of independent from the application before the application will ever get served that plugin. Yep, in the back?
How many people does it take on your team to build and maintain this whole infrastructure?
The plugd
infrastructure? Two people built it? One person built it?
Yeah, I think … it was mostly Nathan.
Nathan, yeah.
[inaudible 00:50:44] and now [00:50:46].
Now Wang owns it, yeah.
So, it’s got one [inaudible 00:50:48]
One-ish person, but-
A fraction of one person.
… Yeah, that’s right, but a lot of time from that combination of people. Yeah?
How are you maintaining survivability if Wang doesn’t show up for work? Because that’s a big issue.
This is a general problem, right? This is a general problem with everything. Somebody’s responsible for it and, if that person disappears, then you need to find a new person to be responsible for it. I don’t think … the sort of production operational responsibility for it, I think there’s at least one other person. There are at least two other people who are pretty familiar with the way that all works. So, if he went on vacation and then it broke, they would know how to fix it, but this is a thing that we face every day, I think.
Do you normally keep it in the workflow or do you just maintain a standing operating procedure for proper commenting or do you structure for when you create it? [inaudible 00:51:43] a personal issue. [inaudible 00:51:44] is making sure that others could step in?
What you build in the application, you mean?
Yeah.
I mean, depending on the application, we require some number of other people to have reviewed the code and reviewed changes to it, which gives some kind of natural understanding of the overall structure of it. We encourage documentation but, just like anywhere else, it’s spotty, and I think we kind of trust the fact that people here are pretty smart and, if they needed to learn how something worked, even relatively quickly, that they would be able to. I don’t know. In practice, it’s been pretty good. Things that are really important and that we’re really worried about for this particular thing? We just make sure that there are lots of people who know how they work and who are keeping up with their day-to-day operation.
Yeah?
Do plugins ever migrate to features in version two [inaudible 00:52:32] your users are going to tell you the things they want [inaudible 00:52:35]
That’s a good question. I think probably yes? I can’t think of specific examples but Drew’s nodding his head kind of really voraciously. Can you think of an example or no?
[inaudible 00:52:50]