Michael Bang's blog

Simple event broker: data serialization is expensive

Tue, 10 Sep 2024 13:25:00 +0000

In the last post I described my weekend project of using advice from Tiger Style to optimize the write path of Seb. Here, we found that data serialization and memory allocations were big contributors to the application being slower than it could be, and profiling helped us identify places on the write path where batching and buffer reuse could greatly improve the throughput. With a few small changes, we doubled the number of records that Seb can write to disk per second!

In this post we’re going to use those learnings as a guide to do the same thing on the read path. In order for the posts not to be almost identical, this time we’ll focus on how seemingly minor changes to function signatures can have major impacts on performance.

Overview

Since we already covered how to record performance profiles in the last post, we’ll skip it here. Instead we’ll go directly to a high-level picture of Seb’s read path, and then look at a profile of the code (at 19a5bde1).

A high-level overview of Seb’s read path:

Here, we see that the read path starts with an incoming HTTP request which is handled by an HTTP handler (1) and sent to the Broker (2). The Broker ensures that a relevant instance of Topic exists and hands it the request (3). The Topic then checks to see if the requested records are available in the locally cached batches (3.1), fetching any missing batches from S3 (3.2) and caching them on disk. The Topic then finally uses the Parser to extract the requested records (4), which might span one or more files in the cache. Finally it sends the retrieved records all the way back up the stack, where the result is turned back into an HTTP response and sent back over the network to the caller.

It’s important to mention here that, just like was the case on the write path, the HTTP response is encoded using multipart form-data with one part per record. As was evident when we looked at the write path, this is highly inefficient. To give you an intuition of what multipart form-data looks like, here’s an example HTTP request:

POST /records HTTP/1.1
Host: localhost:51313
Content-Type: multipart/form-data;boundary="boundary"

--boundary
Content-Disposition: form-data; name="0"

record-0-data
--boundary
Content-Disposition: form-data; name="1"

record-1-data
--boundary--

Profiling

Like we did in the last post, we’ll use Go’s excellent profiling tools to identify where Seb is spending its time on the CPU. In order to do this, we need to put some load on the system. The first task of this project therefore was to implement a simple read benchmark that is easy to run. I won’t go into details of the implementation here, but I will note that having a tool to generate reliable, consistent load on your system makes performance optimizations so much easier to do, and gives us much better odds of making actual improvements. I highly recommend investing the time in building a tool like this for your next project!

While using the read benchmark to put some load on the system, I recorded a profile of Seb which resulted in the following flame graph:

I’ve highlighted multipart form-data formatting-related code using red boxes, and memory-related operations (allocations, copying, and garbage collection) using black boxes. We saw exactly this behavior on the write path in the last post as well, so if you read that one this result should come as no surprise. What we’re seeing is that we’re spending loads of time writing all of the records according to the multipart form-data format, generating a lot of garbage while doing so.

Looking at the left-most red box on the flame graph, we see that most of its time is spent in Syscall6. Going a bit up the stack, we see that this originates from net.(*conn).Write, i.e. writing bytes to a network socket. We want to get a response to our callers, so this work looks productive and isn’t something we’re trying to eliminate.

Looking at the right-most red box, we see that multipart.(*Writer).CreateFormField spends a lot of time serializing our HTTP payload using fmt.Fprintf and fmt.Sprintf, both of which causes a lot of allocations and creates tons of work for the garbage collector.

Lastly, looking at the black boxes in the middle of the flame graph, we see that sebtopic.(*Topic).ReadRecords spends a lot of time allocating and copying bytes around. If you look carefully, you can see that (*Parser).Records does disk IO. And, oh my, although disk IOs are one of the most expensive operations we can do, they take up only ~25% of the the time spent in that function!

Now that we have a better understanding of where Seb is spending its precious time on the CPU, we can focus on how to improve it for the better.

Reflecting

Like we learned in the previous post, data serialization has a major impact on performance. It not only takes time to translate data between formats, it also requires us to allocate and copy bytes between buffers, creating a lot of garbage that has to be cleaned up.

In the previous post we worked backwards from Seb’s internal on-disk format and redefined the user-facing API such that it uses the same format, thereby avoiding almost all of the serialization-related work we’re now seeing on the read path. Instead of using multipart form-data, encoding one field per record, if we instead serialize it as one buffer containing all record data and one list containing the lengths of each record in that buffer, we can avoid a lot of work.

I’ve visualized the difference between the two formats below:

Looking at the flame graph again, what would it look like if we removed all of the serialization and unproductive allocations that we currently see?

Assuming that we don’t have to restructure data but can basically just give the caller the raw bytes, we can just read it from disk and pass it up the stack. This should remove all of the unproductive allocations we saw.

Since the format shown above only requires us to create two form fields instead of N (one per record), we would also expect the time spent in CreateFormField to almost go away.

I’ve visualized what these changes might look like, with blue boxes representing avoidable work:

When we disregard the contents of the blue boxes in the above flame graph, we see that we’re almost left with only the essential (and most expensive!) operations: reading from disk and writing to the network.

This is all well and good in theory, but how do we achieve this in code?

Fixing

Although the specific changes in implementation could be interesting to look at, we will continue to look at this using only the high-level information we already know; I want to highlight that the changes in execution speed we’re going to see from the changes described here don’t have as much to do with the exact implementation as it has to do with the structure; the flow of data. Both of course play a role, but I think the most important learnings in this case can be had by focusing on just the structure.

If you’re interested in digging into implementation details, I suggest you look at the source: this is where we start, this is where we end.

In the first diagram of this post, we saw the functions that make up the read path. Here, we see it again, this time with function signatures:

type Batch struct {
	Sizes []uint32
	Data  []byte
}

func (s *Broker) GetRecords(ctx context.Context, topicName string, offset uint64, maxRecords int, softMaxBytes int) ([][]byte, error)

func (s *Topic) ReadRecords(ctx context.Context, offset uint64, maxRecords int, softMaxBytes int) (sebrecords.Batch, error)

func (rb *Parser) Records(recordIndexStart uint32, recordIndexEnd uint32) (Batch, error)

At the bottom of the read path, we see that Parser.Records() returns a Batch. Since this is at the bottom of the call hierarchy, the returned Batches must be allocated within Parser.Records() itself. From the description at the beginning of the post, we know that Topic.ReadRecords() will call Parser.Records() once per file that we need to read. This means that, with the current function signature, we will see at least one allocation per file read. Depending on the number of records requested, this could cause many allocations.

We are looking to eliminate unproductive allocations, so how do we avoid the current requirement that Parser.Records() must allocate a Batch per call? By giving *Batch as an argument instead of requiring it as a return value:

func (rb *Parser) Records(batch *Batch, recordIndexStart uint32, recordIndexEnd uint32) error

The small change we just made to the signature has a very important impact: we moved the responsibility of allocating Batch one level up the stack, from Parser.Records() to Topic.ReadRecords(). We can of course do this same trick all the way up the stack, which changes all signatures to the following:

func (s *Broker) GetRecords(ctx context.Context, batch *sebrecords.Batch, topicName string, offset uint64, maxRecords int, softMaxBytes int) error

func (s *Topic) ReadRecords(ctx context.Context, batch *sebrecords.Batch, offset uint64, maxRecords int, softMaxBytes int) error

func (rb *Parser) Records(batch *Batch, recordIndexStart uint32, recordIndexEnd uint32) error

This minor change has moved the responsibility of allocating Batches from the bottom of the stack to the top. It’s now the responsibility of the code that calls Broker.GetRecords() (in our case an HTTP handler) to provide a pre-allocated batch to be used for each request. As long as the given *Batch is large enough to satisfy the request, we can now guarantee at most one Batch allocation per request, regardless of how many files we need to read data from. And, with allocations being made at the top of the call stack, it’s possible to reuse buffers across requests, leading to even fewer allocations.

To show you what this could look like from the caller’s perspective, here’s a simplified version of the HTTP handler:

type RecordsGetter interface {
	GetRecords(ctx context.Context, batch *sebrecords.Batch, topicName string, offset uint64, maxRecords int, softMaxBytes int) error
}

func GetRecords(log logger.Logger, batchPool *syncy.Pool[*sebrecords.Batch], rg RecordsGetter) http.HandlerFunc {
	return func(w http.ResponseWriter, r *http.Request) {
		// do http stuff

		batch := batchPool.Get()
		defer batchPool.Put(batch)

		err := rg.GetRecords(r.Context(), batch, topicName, offset, maxRecords, softMaxBytes)
		if err != nil {
			// handle various errors
		}

		err = httphelpers.RecordsToMultipartFormDataHTTP(mw, batch.Sizes, batch.Data)
		if err != nil {
			// handle various errors
		}
	}
}

Since the write path already uses the same structure, these changes also allow us to share the pool of Batches between the read- and write paths!

Additionally, since Seb limits how many HTTP requests it wants to handle in parallel, an extra benefit is that it’s now possible to allocate all buffers that the program needs at startup! This of course comes with some drawbacks, e.g. it puts hard limits on the size of payloads, but it also comes with some superhero-like benefits: with all buffers allocated at startup, we can now determine at deployment time how much memory the application will use¹. If the application starts at deployment, we can be confident that it cannot go out-of-memory! This sounds surreal and is an absolute superpower when doing server planning and provisioning. This one took a few days to sink in for me, but once I realized the power of it, I couldn’t stop thinking about it. Why aren’t we aiming to build our systems like this?

Alright. With the above changes implemented, it’s time to put some pressure on the system and record another profile. The new recording resulted in the following flame graph:

Oh my, this is even better than I dared hope for! We’ve eliminated basically all of the serialization and garbage collection overhead, even removing a large memmove in multipart.(*part).Write that I wasn’t expecting to get rid of.

On the new flame graph we see that we’re almost literally down to spending time only in Syscall6. Clicking around, I can tell you that the flame graph reports that Syscall6 now takes up 91.9% of the total runtime! Approximately half of it is for reading from disk, and the other half is for writing to the network.

With these very promising changes it’s time to benchmark.

Benchmarking digression

Before jumping to benchmarking, I want to digress slightly and note something I’ve learned the hard way (many times by now, so maybe I never really learned it…)

When benchmarking you should always record and safely store your benchmark parameters. And, importantly, include the version of the code that was used! This lets you know exactly which code and configuration gave you the results you’re looking at. This is incredibly valuable when you inevitably make more changes to the code than you expected, as it allows you to understand how (or even if) you can sensibly compare different runs of the benchmarks. If you fail to do this, you’re destined to have to re-run all of your benchmarks just this last time (for the 7th time.) The best strategy I found for remembering to do this is to just dump the benchmark’s parameters along with the results. The parameters are honestly just as important and valuable as the results themselves!

Benchmarking

The benchmarks run for this post were run on my laptop, a Lenovo T14, plugged in to the wall, with the following specs:

AMD Ryzen 7 PRO 4750U
Micron MTFDHBA512TDV 512GB NVMe drive
48 gigs of RAM
Ubuntu 22.04

We’re doing no network requests (all files are cached locally), so the NIC should be irrelevant. Also, since we’re doing buffered IO on a 1GiB records, we expect reads to be mostly served from the page cache.

The benchmarks were run with the following command:

seb benchmark-read --local-broker=true -r 5 -w 16 --batches=4096 --record-size=1024 --records-per-batch=256 --records-per-request=1024 --requests 20000

This command runs 5 repetitions of a job that utilizes 16 workers to send a total of 20.000 requests. Each request asks for 1024 records (1KiB each, so 1MiB/request), for a total of ~19.5GiB requested. The starting offset for each request is selected uniformly at random from a set of pre-inserted and cached records. The on-disk batch size is 256 records/file, so each request will have to open and read 4 or 5 different files.

And, as summarized by the benchmark tool:

Benchmark config:
Num workers:            16
Num requests:           20000                                 
Records/request:        1024                                 
Record size:            1KiB (1024B)                                 
Bytes/request:          1MiB (1048576B)
Total bytes requested:  19.5GiB (20971520000B)

Note: this workload doesn’t really replicate a production scenario where we would probably expect something like a Poisson distribution heavily skewed towards the most recent records. Also, we’re not looking to understand the absolute performance of Seb here but are just looking for the relative impact of our changes.

Without further ado, the results of the benchmarks:

code	seconds/run	records/second	improvement
reference	35.82 / 35.32 / 37.21	572k / 550k / 580k	-
update	9.76 / 9.50 / 10.30	2099k / 1987k / 2154k	3.67x

Whoop, a 3.67x improvement; we can now run the same workload in about 1/4 of the time!

For the second time we’re learning that data serialization and unnecessary memory operations have a major impact on performance. By changing the user-facing interface to match the format that Seb wants the data to be in internally, we’ve removed a lot of work and with it a lot of allocations and memcopying. By using simple tools and comparatively small refactorings, we’re seeing a massive 3.67x payoff in performance. Awesome!

Yet again I’ll end my post by tipping my hat giving a big THANK YOU to Joran Dirk Greef at TigerBeetle and Dominik Tornow at Resonate for sharing all of their knowledge and helping to light a fire in the systems software community!

Footnotes

this isn’t entirely accurate; I haven’t eliminated all allocations from Seb yet. But the vast majority of memory used is coming from these buffers, so the overall point is still valid. ↩

Simple event broker tries Tiger Style

Wed, 10 Jul 2024 11:20:00 +0000

I’ve been on a bender for the past few weeks. I haven’t been able to stop reading and watching content about TigerBeetle. I was especially enamored by videos in which Joran Dirk Greef presents TigerBeetle in general, replication, and Tiger Style.

Joran has been far and wide the past years, doing all he can to spread the message of TigerBeetle and Tiger Style. Lucky for us, this has left a trail of insightful content in his wake!

My time in the virtual company of Joran has inspired me to try TigerBeetle’s coding style, Tiger Style. Since I’m already working on Seb, my event broker which I want to be fast and keep my data safe, I thought this would be a good place to try it out.

With inspiration from Joran and Tiger Style, my past weekend’s project was to improve the write path of Seb. My goal was simple: write more records per second while maintaining correctness (duh!)

Tiger Style

The parts of Tiger Style that mostly inspired this weekend project were:

Perform back-of-the-envelope sketches with respect to the four resources (network, disk, memory, CPU) and their two main characteristics (bandwidth, latency). Sketches are cheap. Use sketches to be “roughly right” and land within 90% of the global maximum.

Amortize network, disk, memory and CPU costs by batching accesses.

These were particularly intriguing to me since, in the first implementation of Seb, records could only be added and retrieved one-by-one. This was a fundamental, architectural problem that had to change in order for the event broker to have any reasonable hope of not remaining the slowest kid in class forever. In my first post, Hello World, Simple Event Broker, I showed that my first naive batching implementation gave an easy 2x improvement in the number of records handled per second, going from ~22k to ~50k. This was obviously a welcome improvement, but honestly not very impressive.

I’ve been focusing more on correctness than performance while building Seb so far, so I haven’t really taken the opportunity to do any profiling. Until now!

Profiling

It has taken me much longer to learn this than is reasonable, but I now finally know, and act as if I know!, that the very first thing you must do when trying to make your program faster is to measure it and be very systematic about your measurements. Yes, I know it is much more fun to guess at the problem and try out random solutions, crossing your fingers in the hope that one of your guesses magically make things go brrr. But if you plan to make progress instead of trying your luck all day, going straight to some sort of profiling is the winning move. Every. Single. Time. Even if you’re just printf’ing timestamps; you must measure!

Luckily, Go has some excellent tooling for profiling which makes the decision to stop spinning the roulette that much easier. It’s almost trivial to instrument a Go program to be profiled: just start an HTTP server on an unused port (on localhost!) and request a CPU profile from it:

curl http://localhost:5000/debug/pprof/profile?seconds=10 > profile

Once you’ve got your profile, you can view it using:

go tool pprof --http ":8001" profile

This should open up a browser with an interactive view of the profile you just made. If you haven’t done this before, try it out on one of your programs. As the following will show you, you might be surprised by what you find!

Alright, on to Seb. On Saturday morning I fired up Seb, ran a workload with a bunch of inserts and requested a CPU profile.

With the profile in hand, I opened the interactive web view and jumped directly to the flame graph. If you haven’t seen one of these before, check out this explanation.

The graph I got was this (sorry - open the screenshot in a new tab, it doesn’t show in a readable size on my blog and I’m an idiot with CSS):

The red box I put on there outlines the HTTP handler httphandlers.AddRecords() which takes up almost 50%(!) of the time of the time shown on the graph. AddRecords()’s job is to parse incoming HTTP requests, pass them to the Broker, and send an HTTP response to the caller. Admittedly I was surprised to see that Seb is spending around half of the time on its write path parsing multipart data and, in the process, generating heaps of garbage that has to be cleaned up again.

The green box on the screenshot outlines sebrecords.Write() which is responsible for writing data to the underlying storage.

The black boxes outline runtime memory operations: allocations, memcopies, and garbage collection. This is a large part of the time spent!

The flame graph basically tells us that Seb is creating a lot of garbage. Like, a lot. Unlike in real life where making a mess can be quite fun, on the computer it’s doubly bad: it’s expensive to clean up and it’s expensive to make a mess in the first place. And, to make matters even worse, using all of this memory completely ruins the effectiveness of our hardware caches. Ugh!

Taking another look at Tiger Style, we see that it has more relevant advice:

All memory must be statically allocated at startup. No memory may be dynamically allocated (or freed and reallocated) after initialization. This avoids unpredictable behavior that can significantly affect performance, and avoids use-after-free. As a second-order effect, it is our experience that this also makes for more efficient, simpler designs that are more performant and easier to maintain and reason about, compared to designs that do not consider all possible memory usage patterns upfront as part of the design.

I have never attempted to implement a system of this size that statically allocates everything, but I appreciate that it must be a major effort to do so. I’m absolutely certain that I won’t remove all allocations from Seb’s write path in this small weekend project, but in terms of performance and safety it seems like great advice. Let’s see how far we get.

Using a stretchy interpretation of the Tiger Style advice of back-of-the-envelope sketching (which is supposed to be done before you actually write your code), let’s have a high-level look at the implementation of the two functions highlighted by the flame graph. Our aim is to find code that puts pressure on the garbage collector.

Investigating

Since AddRecords() is taking up most of the time, we’ll focus on that first. I’ve listed the most relevant code below. The full function is available here if you’re curious. Since the flame graph told us that this function is doing a lot of allocations, I’ve added comments to highlight the most obvious ones.

func AddRecords(log logger.Logger, s RecordsAdder) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        defer r.Body.Close()

        // ...

        records := make([]sebrecords.Record, 0, 256) // >= 1 ALLOC
        mr := multipart.NewReader(r.Body, mediaParams["boundary"])
        for part, err := mr.NextPart(); err == nil; part, err = mr.NextPart() {
            record, err := io.ReadAll(part)  // >= 1 ALLOC PER LOOP
            if err != nil {
                log.Errorf("reading parts of multipart/form-data: %s", err)
                w.WriteHeader(http.StatusInternalServerError)
                return
            }
            part.Close()
            records = append(records, record)
        }

        // ...
    }
}

We’re only doing a back-of-the-envelope kind of investigation here, so we won’t go into the actual implementations of anything but the code listed above. With just this tiny snippet of code we can tell that there is at least one allocation related to the records variable (notice the trailing “s”), and at least one allocation for the record variable; io.ReadAll() must allocate the byte slice it returns.

Since the record-variable is computed once per N iterations of the loop it looks to be the dominating factor in terms of how many allocations are made. In fancy systems lingo we say that there’s on the order of N allocations happening here - at least one allocation per record received in the HTTP request.

This very high-level understanding of AddRecords() memory usage is enough to satisfy me for now. Let’s turn to the second offender on the list, sebrecords.Write().

func Write(wtr io.Writer, rb []Record) error {
    header := Header{
        MagicBytes:  FileFormatMagicBytes,
        UnixEpochUs: UnixEpochUs(),
        Version:     FileFormatVersion,
        NumRecords:  uint32(len(rb)),
    }

    err := binary.Write(wtr, byteOrder, header)
    if err != nil {
        return fmt.Errorf("writing header: %w", err)
    }

    recordIndexes := make([]uint32, len(rb)) // 1 ALLOC, small
    var recordIndex uint32
        for i, record := range rb {
        recordIndexes[i] = recordIndex
        recordIndex += uint32(len(record))
    }

    err = binary.Write(wtr, byteOrder, recordIndexes)
    if err != nil {
        return fmt.Errorf("writing record indexes %d: %w", recordIndex, err)
    }

    records := make([]byte, 0, recordIndex) // 1 ALLOC, large
    for _, record := range rb {
        records = append(records, record...)
    }

    err = binary.Write(wtr, byteOrder, records)
    if err != nil {
        return fmt.Errorf("writing records length %s: %w", sizey.FormatBytes(len(rb)), err)
    }
    return nil
}

As we saw earlier, the flame graph told us that Write() is spending a lot of time copying things around and doing garbage collection. Looking for big memory accesses, we see that Write() makes two calls to make() - one for recordIndexes and one for records. In preparation of the first loop a single, small allocation is made, before memcopying N uint32s. For the second loop it’s a probably much larger allocation of N*[avg record size] bytes that is being copied into.

We see that both of these allocations are made in preparation of a call to binary.Write(); both are done in order to reduce the number of disk IOs. Calling binary.Write() once instead of N times will reduce the number of disk IO-related syscalls we make. Since Seb is using buffered IO without fsync (S3 is the source of truth!), we can’t tell exactly how many actual disk IOs each call translates to, but at least we do know that it translates to fewer syscalls and context switches.

This means that, although it doesn’t look like it on the flame graph, both of these allocations and memcopies are actually beneficial in the current setting. The cost of doing a memory copy is much smaller than the cost of doing a disk IO, so given the chance to trade between a few memory copies and doing a few disk IOs (or syscalls), you’re very likely to get ahead if you bet on memory copying over disk IOs.

Using Sirupsen’s napkin math and a bit of hand waving regarding buffered IOs, we can estimate that it’s on the order of 10 times faster (100μ/MB vs 1ms/MB) to collect all of our data into a single buffer and then do a single IO instead of doing one IO per record using the fragmented buffers that Write() receives as its input.

Although the flame graph shows that we’re spending a lot of time copying things around in memory, we’ve actually just found that, in this particular example, a bit of memcopy is preferable because it’s done to reduce the number of much more expensive disk IOs.

Fixing

Taking a step back and considering all of the information from our investigation above, we see that the two functions have a common problem: the fact that they’re given records one-by-one impacts how much garbage they generate.

For AddRecords(), each record received directly translates to at least one allocation. Receiving a multipart form data-formatted list of records means that it needs to parse the records and make an allocation for each one. In Write(), we need to transform the slice of records created in AddRecords() into a slice of bytes so that we can write it efficiently to disk.

It looks a lot like we could do the same job with a lot fewer allocations if we simply didn’t have to transform data between different representations!

But how do we do this? If we work our way backwards, we can try to change the interface of Write() so that it doesn’t have to do any transformations:

func Write(recordIndexes []uint32, records []byte) error {
    // ...
}

That doesn’t look too bad! With recordIndexes and records being given directly as inputs, we can write them to disk without further processing.

Working our way backwards up the stack, we can do the same to the callers of AddRecords(). If, instead of requiring users to send data as N multipart form-encoded fields, we request that they send the sizes of each record as one field and the raw record data as another, the number of allocations goes from the order of N to the order of 1, meaning that the number of allocations no longer depends on the number of records in the input. Nice!

With the changes described, the implementation of Write() becomes much simpler and is basically just three calls to binary.Write():

func Write(wtr io.Writer, recordIndexes []uint32, allRecords []byte) error {
    header := Header{
        MagicBytes:  FileFormatMagicBytes,
        UnixEpochUs: UnixEpochUs(),
        Version:     FileFormatVersion,
        NumRecords:  uint32(len(recordSizes)),
    }

    err := binary.Write(wtr, byteOrder, header)
    if err != nil {
        return fmt.Errorf("writing header: %w", err)
    }

    err = binary.Write(wtr, byteOrder, recordIndexes)
    if err != nil {
        return fmt.Errorf("writing record indexes %d: %w", recordIndex, err)
    }

    err = binary.Write(wtr, byteOrder, allRecords)
    if err != nil {
        return fmt.Errorf("writing records length %s: %w", sizey.FormatBytes(len(recordSizes)), err)
    }

    return nil
}

AddRecords() becomes slightly worse to read, but I’m sure another pass could improve it:

func AddRecords(log logger.Logger, s RecordsAdder) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        defer r.Body.Close()

        // ...

        var fileSizes []uint32
        var records []byte
        mr := multipart.NewReader(r.Body, mediaParams["boundary"])
        for part, err := mr.NextPart(); err == nil; part, err = mr.NextPart() {
            bs, err := io.ReadAll(part)
            if err != nil {
                log.Errorf("reading parts of multipart/form-data: %s", err)
                w.WriteHeader(http.StatusInternalServerError)
                return
            }
            part.Close()

            switch part.FormName() {
            case httphelpers.RecordsMultipartRecordsKey:
                records = bs

            case httphelpers.RecordsMultipartSizesKey:
                err = json.Unmarshal(bs, &fileSizes)
                if err != nil {
                    log.Errorf("reading sizes: %v", err)
                    w.WriteHeader(http.StatusBadRequest)
                    return
                }

            default:
                log.Errorf("unknown field %s", part.FormName())
                w.WriteHeader(http.StatusBadRequest)
                return
            }
        }

        // TODO: we verify that both 'sizes' and 'records' were given

        // ...
    }
}

Let’s see whether our interpretation of Tiger Style back-of-the-envelope changes (and a bit of other make-the-types-match kind of stuff all along the write path that I’ll sweep under the rug for now), has done to decrease the amount of garbage we generate on Seb’s write path:

Not bad! AddRecords() has changed quite a bit. What I immediately notice is that half of the multipart parsing code has disappeared from the graph: only the left-most part is still there. It’s not exactly perfect yet, as we’re still spending a lot of time in runtime.growslice. This is likely because each byte slice allocated for the records variable must be expanded quite a few times to accommodate the all of the record data received.

Looking at Write() (which is named WriteRaw() in the new graph), we see that the amount of pressure on the garbage collector has decreased noticeably. You might notice that the allocations have moved from Write() up to its parent, collectBatches() - I’ve swept some minor changes under the rug here, but trust me that this isn’t important to our goal.

Although we’re seeing definite progress, I’m not entirely satisfied with the results of AddRecords() yet. The flame graph is showing us that a lot of time is being spent growing slices, which makes sense since io.ReadAll() is a generic function that starts out with a modest allocation which has to grow to accommodate the size of our batches of records.

In order to fix the problem, we can allocate a pool of larger buffers that can be reused between requests. I’ve highlighted the added lines with comments.

var bufPool = syncy.NewPool(func() *bytes.Buffer { // NEW
 return bytes.NewBuffer(make([]byte, 5*sizey.MB))  // NEW
})                                                 // NEW

func AddRecords(log logger.Logger, s RecordsAdder) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        defer r.Body.Close()

        // ...

        var fileSizes []uint32
        var records []byte
        mr := multipart.NewReader(r.Body, mediaParams["boundary"])
        for part, err := mr.NextPart(); err == nil; part, err = mr.NextPart() {
            buf := bufPool.Get()        // NEW
            buf.Reset()                 // NEW
            defer bufPool.Put(buf)      // NEW

            _, err = io.Copy(buf, part) // NEW
            if err != nil {
                log.Errorf("reading parts of multipart/form-data: %s", err)
                w.WriteHeader(http.StatusInternalServerError)
                return
            }
            part.Close()

            switch part.FormName() {
            case httphelpers.RecordsMultipartRecordsKey:
                records = buf.Bytes()                         // NEW

            case httphelpers.RecordsMultipartSizesKey:
                err = json.Unmarshal(buf.Bytes(), &fileSizes) // NEW
                if err != nil {
                    log.Errorf("reading sizes: %v", err)
                    w.WriteHeader(http.StatusBadRequest)
                    return
                }

            default:
                log.Errorf("unknown field %s", part.FormName())
                w.WriteHeader(http.StatusBadRequest)
                return
            }
        }

        // TODO: we verify that both 'sizes' and 'records' were given

        // ...
    }
}

Running the same workload again with the new pool of buffers shows that our buffer pool was a great help:

We’re now seeing much less pressure on the garbage collector, with only a few large runtime.memmove() calls left.

This is where we’ll leave the optimization work for now. The only thing left is to do some benchmarking to see how these changes affect the goal of the project, namely increasing the amount of records per second we can push through Seb.

Benchmarking

Part of the work I did during the weekend was to update Seb’s benchmarking tool. It’s nothing fancy, but should work well to get an understanding of the relative improvements of the changes implemented above.

I started out benchmarking using Seb’s S3 storage implementation, but because of very variable latencies I decided that writing to disk would serve us better for these experiments; the purpose isn’t to show how many records Seb can handle in a production scenario, but rather to see relative improvements of the changes discussed above. A final note is that this workload uses buffered IO without fsync, so don’t read too much into the absolute numbers. We’re looking for relative changes, nothing else.

All benchmarks were run on one of Hetzner’s tiny, cheap, 2-core CAX11 machines, and were repeated 10 times each. Each benchmark starts a new Seb broker, exposes it on a local HTTP port and starts 16 goroutines that use the Seb client to pepper the broker with POST /records. They were run like this:

./seb benchmark -r 10 -w 16

The configuration for each benchmark is as follows:

Config:
Num workers:            16
Num batches:            4096
Num records/batch:      1024
Record size:            1KiB (1024B)
Total bytes:            4GiB (4294967296B)
Batch block time:       5ms
Batch bytes max:        10MiB (10485760)

And the results, given as avg / min / max are:

code	seconds/run	records/second	improvement
reference	24.21 / 23.37 / 25.11	173k / 167k / 179k	-
updated, no buffers	15.82 / 15.51 / 16.13	265k / 260k / 270k	1.53x
update, with buffers	12.41 / 12.17 / 12.57	338k / 334k / 345k	1.95x

Nice! By running three CPU profiles and looking at Seb’s code at a very high-level, we managed to identify a few locations where we could avoid a bunch of unnecessary memory allocations and thereby alleviate pressure on the garbage collector. These simple changes have almost doubled the number of records that we can push through Seb. Not bad for a weekend project!

With that I’ll say that this has been fun to try out Tiger Style and that I’ll definitely continue learning from it in the future. I’m particularly interested in deterministic testing; if you happen to have great references and/or code examples to study, please let me know!

Thanks to Joran and the TigerBeetle team for sharing their many insights with all of us - it’s a major source of inspiration!

If the post resonated with you and you are looking for someone to help you to do hard things with a computer, you can hire me to help you!

Driplang: triggering when events happen (or don't)

Wed, 19 Jun 2024 14:20:00 +0000

This post describes multiple ways I’ve seen projects handle event triggering in the past and suggests a minor tweak that I believe will greatly benefit projects that have nontrivial event triggering requirements. The tweak is simple and helps to avoid creating unnecessary dependencies between unrelated parts of your system.

Additionally, it also describes how a tiny domain specific language can be used in the implementation of this, trying to make it possible for even non-developers to manage and create event triggers. Perhaps even using a visual tool! I never got this far in my own implementation, but it’s a very obvious next step from where the post ends.

The ideas discussed here aren’t new. The functional outcome of my ideas have been available in various SaaS solutions for probably a decade. Nonetheless, I think there’s an important lesson here regarding software in general, in how seemingly minor changes in structure can have outsized benefits when it comes to the cost and complexity of developing and maintaining a system.

Before we really get going I want to note that, although we’ll be talking about sending emails, the point I’m trying to make is much more general. It just so happens that notifications are a very natural context to describe this problem with. Every time I’ve tried to explain these ideas, I always end up going back to notifications.

A final thing before we continue: I’ll need a pinky promise that you won’t use this to spam people. No. Yes, seriously. Spam is easily top 3 on the list of the 7 deadly sins.

We good? Alright.

The problem

On most projects I’ve worked on, it has at some point been a requirement to trigger certain functionality when specific events happen. A classic example is “send an email to users who haven’t used feature X within their first week of signing up”.

Even though this example is rather basic, it can be surprisingly difficult to implement well. If we’re not careful when we implement event triggering, we can inadverdently start introducing dependencies between otherwise unrelated components, which over time can become a burden that slows development significantly. What once started as a simple one-liner to send an email can suddenly require that we have to consider large parts of the system whenever we want to make even a small change.

Simple triggers

At the beginning of a project there isn’t a lot of functionality yet. This hopefully means that there aren’t a lot of accidental or unnecessary dependencies between components, and that it’s still pretty cheap and easy to add new features and maintain existing ones. Not wanting to introduce new abstractions before they are truly needed, at this stage it can easily be argued that sending an email when a user is created is most simply done somewhere on the code path that naturally exists for user creation. This could, for example, be just after the user has been persisted to storage:

class UserController:
   def add_user(self, user):
      self.user_repository.create(user)
      self.email_service.send_intro_email(user)

Depending on the full context of the rest of your system, the project’s goals and scope, your team and the position of the moon, this very well could be a nice and simple, non-overengineered solution to a simple problem. Lovely!

A benefit of this simple solution is that it’s easy to look up what happens when you add a user: it’s all right there in the add_user() function! This of course comes with the assumption that everything that happens when you add a user actually happens in that function.

Depending on how many event triggers we need to implement, a potential drawback of this simple implementation is that we will be scattering email-sending code all over the system. This might make it difficult to get an overview of all of the places from which we’re sending emails. Although this could become a problem, the thing that tickles my spidey sense is that there are examples of reasonably simple event triggering logic that simply cannot be implemented this way. At least not in any advisable way that I know of. Triggers that require more information than naturally exists on existing code paths are super difficult to implement without introducing coupling between otherwise unrelated components. In the above example we wanted to trigger on the “user created” event, which happened right there in the code. For more complex triggers such a code path might simply not exist.

More advanced triggers

As time passes and new and more complex features are added to the project, we might want to create event triggers that aren’t a direct response to something that happens in the system. Such event triggers rarely have an obvious location where we can just add a one-liner. The problem is that we need knowledge from different parts of the system in one place.

One obvious way to tackle this problem is to create an omniscient cron job-thingy that can pull information from all relevant parts of the system. In my mind I imagine this as an octopus that gets to roam around freely in your database, sticking its tentacles into anything it likes.

The benefit of this strategy is that it can make it very explicit what information is required to trigger a certain event and where that information comes from. Additionally, depending on our needs, it might be an advantage that this allows us to place all code relating to sending notifications close together instead of sprinkling it throughout the system. Below is an example of what this might look like:

class OmniscientCronJobThingy:
   def x_not_used_in_first_week(self):
      for user in self.user_repository.list(created_within='1 week'):
         if not self.feature_x_repository.used_by(user):
            self.email_service.send_feature_x_intro_email(user)

Since this is a cron job we have to run it at some meaningful interval, ensure that it actually runs, probably handle errors asynchronously (we don’t want stop sending emails to the rest of users just because sending an email to one of them fails), and so on. All of these are problems that can be overcome, but it does come with the price of added complexity compared with the one-liner we first saw.

A major drawback that our omniscient octopus introduces is that it adds a dependency on potentially the entire data model of the system. Since it’s basically a component with license to ~~kill~~ read data from anywhere, we have to take it into account whenever we consider making a change to almost literally any part of the system; did one of our co-workers add an event trigger that requires knowledge from the part the system we’re currently considering changing? This problem can be mitigated somewhat by forcing the component to go through epositories instead of raw-dogging the database, but this doesn’t eliminate the problem entirely. When there’s an omniscient octopus tasting various parts of your data, you never quite know whether it’s safe to change your data model or not. At the very least, the loose octopus will make it more cumbersome to change the data model. Been there, done that. Although pets are nice and cute, you really don’t want them running around your database!

Another problem we haven’t discussed yet is that of using existing data models to infer the state of something that we want to trigger on. In some cases we’re lucky that the data model naturally happens to contain exactly the information we want to trigger on. In other cases, not so much. What do we do then? Do we muddy the existing data model by adding just one more field, to keep our omniscient cron job satisfied? I would personally be looking for different options very quickly.

To summarize: we are looking for a solution that

avoids sprinkling email-sending code all across our application
avoids unwanted tentacles fiddling around our tables
does not create unnecessary dependencies between components
does not lead us into the temptation of introducing “unnecessary” data into our existing data models

As advertised earlier, the path I’m suggesting is in no way new nor sophisticated. It’s fundamental programming. One of the classics. It’s decoupling.

If we simply separate tracking of events and reacting to events, we can have all of the benefits from our two solutions with very few of the drawbacks. We might even be able to move a large part of the human responsibility for declaring event triggers to non-developers!

The following snippet looks very similar to our first one-liner snippet, but the result is quite different.

class UserRepository:
   def store_user(user):
      self.user_repository.create(user)
      self.eventdripper.log(
      	event_id="user_created",
      	entity_id=user.id,
      	data={'name': user.name, 'email': user.email},
      )

Although there’s a new name here that I haven’t introduced yet (eventdripper - yay naming!), there are no tricks and it should be fairly obvious that by logging the occurrence of an event instead of reacting to it immediately, we can move the responsibility of sending emails away from the place that the event naturally occurs. In this case, the responsibility has been moved to the mysterious Ms Eventdripper.

Besides delegating responsibility, another benefit of logging events is that we no longer need to keep our handsy octopus on staff. Since eventdripper is given all information required to determine which event triggers to trigger, we no longer need an omniscient entity that can snoop on the existing data model to gather information about the current state of things. This also avoids the temptation of adding new fields to our data models just to satisfy the needs of our snoop.

As you might have guessed from the poor naming, I’ve implemented a service that makes it easy to log events and react to them later. It tries to solve the problems described in this post, and it works for complex event triggers with restrictions on real-world timings. That service is called…. Eventdripper!

Eventdripper

As indicated by the snippet above, the interface of eventdripper is dead simple:

 POST /event
 {
 	"event": "user_created",
 	"entity_id": "user-id",
 	"data": { /* data relevant when reacting to the event */ }
 }

All the information it needs to do its magic is:

the name of event that happened
a unique identifier for the entity the event relates to

The third parameter, "data", is an optional, opaque value that the consumer can use to add metadata needed when reacting to the event. In our example, since we’re sending an email, it might be nice to have the user’s name and email.

In order to get data into eventdripper, we just have to send the above payload over our preferred transport (Seb anyone?). Eventdripper then collects the events and shoves them into a database, indexing them on "event" and "entity id". For the purposes of this post, the way data is transported and stored isn’t super important. As long as events are received in-order and the database allows fast lookup by event and entity id, we’re golden.

With all of our events now happily inhabiting the databases of eventdripper, we have a new problem to solve: how do the users of eventdripper describe which sequences of events that should satisfy a trigger? And, related to that, how does eventdripper decide whether the user’s description is satisfied by a given sequence of events? If you’re anything like me, hearing these requirements simply begs for an implementation of a domain specific language. This is the story of how driplang was born!

Driplang

Driplang is a tiny, domain specific language (DSL) inspired by boolean and temporal logic. The DSL makes it easy (okay, possible at least..) to define expressions that can either be satisfied or not by a given sequence of events. A driplang expression can’t be evaluated by itself, but must be evaluated against a sequence of events.

Driplang has four operators: AND, OR, NOT, and THEN. The two only possible outcomes of expression evaluation are true and false.

The driplang operators work just like you would expect them to in boolean logic, with the caveat that THEN is (very) special.

Let’s start by looking at a few simple boolean examples. The contents of this table shouldn’t be surprising if you already know boolean logic.

expression	events	output
A `AND` B	[A]	false
A `AND` B	[B]	false
A `AND` B	[B, A]	true
A `AND` B	[A, B]	true
A `AND` (`NOT` (B `OR` C))	[A]	true
A `AND` (`NOT` (B `OR` C))	[B, A]	false
A `AND` (`NOT` (B `OR` C))	[D, A]	true

The important point to notice here is that the order of events doesn’t matter for AND, OR, and NOT.

As the name hopefully suggests, the THEN operator is needed when we require a ordering, e.g. if we only want our expression to be satisfied when A happens before B. In driplang that requirement would look like this: A THEN B.

Here’s a table to give you an intuition for how THEN works. I left a tiny surprise for you at the end.

expression	events	output
A `THEN` B	[A, B]	true
A `THEN` B	[B, A]	false
A `THEN` (`NOT` B)	[A]	true
A `THEN` (`NOT` B)	[A, B]	false
A `THEN` (`NOT` B)	[A, C]	true
A `THEN` (B `WITHIN` 2 days)	[A, B]	it depends

Hopefully everything makes sense until the last expression in the table above.

A minor but very important note that I left out, is that THEN has an optional argument: WITHIN. WITHIN causes THEN to consider real-world time. This means that, besides the fact that the events must arrive in the required order, they must also have arrived within the given time constraint. The expression A THEN (B WITHIN 2 days) from the table above will thus only be satisfied if B happened within 2 days of A.

All the way in the beginning of this post, we talked about triggering on events based on real-world time: “send an email to users who haven’t used feature X within their first week of signing up”. WITHIN is the piece of the puzzle that allows driplang to handle this. We now know enough to express this as a driplang expression: user_created THEN ((NOT use_feature_x) WITHIN 7 days).

A great benefit of using a DSL to implement this is that it can be used two-fold: we can use it both to describe describe event triggers and to evaluate them. And, since driplang is easily expressed as text (via a stupid-simple JSON format), we can easily store driplang expressions in a database, close to where the events we need to evaluate them on live.

As I hinted at in the beginning, this post ends at a point where the obvious next step is to make a visual tool that can generate driplang expressions behind the scenes. I’m no UX designer, but I imagine it might look something like this:

I think this could be helpful by letting non-developers, who often are the people that declare the trigger requirements anyway, be responsible actually managing event triggers. This would leave developers “only” with the job of logging events and implementing the functionality that must be triggered. In my experience, the functionality to be triggered (sending non-spammy emails) can often be abstracted enough that developers don’t have to be part of this in the long run, e.g. using email templates with variables.

Performance and implementation

In terms of performance, there’s a few things we can do to optimize scheduling of expression evaluation; for expressions that don’t contain THEN operators with ‘WITHIN’ arguments, we only need to evaluate expressions when a new event is added; the only time that the output of these expressions can change is when a new event arrives.

Additionally, we only need to evaluate expressions that contain references to the event that just arrived. For example, the expression A THEN B will not change if the event C arrives. So even if we have loads of event triggers declared in eventdripper, waiting to potentially be triggered, we only have to evaluate the expressions that contain the new event. With a bit of semi-clever SQL, we can ensure that we only evaluate expressions when there’s a chance that the output changed.

The only type of expression we haven’t considered yet is THEN expressions with WITHIN arguments. Here, it’s not only the arrival of new events that contributes to whether an expression is satisfied, but also the fact that time continues on its infinite march. I don’t currently see a way of doing this that doesn’t rely on a cron job having to run in the background, reevaluating expressions at some fraction of the interval of WITHIN’s time constraint. If you’ve got any ideas for how this could work, do reach out!

Although the implementation of eventdripper and driplang is interesting to discuss, I’ll leave these details for another blog post. For driplang, however, I will say that it requires surprisingly little and rather simple code, especially considering that it allows us to both describe and evaluate rather complex event triggers which otherwise have a tendency to turn into a big ball of mud.

Heading back to the surface

Having just been introduced to eventdripper and driplang, you might be thinking that this looks like an overly complex solution to something that in many cases can be solved much simpler, with code closer to the first snippet I showed. In situations where your needs are simple and you don’t expect to need more advanced triggers, I most likely will agree with you. In general we should not waste time overengineering things that we will not need.

The one thing I hope you take away from this: once your event triggering needs become non-trivial, I think decoupling the code that tracks and the code that reacts to events is definitely worth your while. Whether you use a DSL to implement this is another discussion. So far, it has served me well and helped solve exactly the problems I set out to solve. I’m very happy with the results!

If the post resonated with you and you are looking for someone to help you to do hard things with a computer, you can hire me to help you!

Data exploration using VIM

Sat, 01 Jun 2024 17:08:00 +0000

I’ve used vim and/or vim bindings for the better part of 10 years. But apparently there’s this tiny piece of magic that has completely escaped me all this time.

About half a year ago I received a tip from a good friend (thanks Jörn ❤️) that I kind of forgot about and never took the time to actually try out.

Then, this week I had to do a bunch of random data exploration and, luckily, it somehow jumped back into my brain. Just this week I’ve saved countless hours looking through gigabytes and gigabytes of sketchy data from the Danish Business Authority. Public data is awesome, but the quality of that data? Often not so much :(

Anyway. The tip is this: you can use the vim command (is it called that?) :%! [cmd] to invoke CLI programs on data in vim’s buffer. That’s it. It’s crazy powerful and I love it.

I’ve made a 3.5 minute screencast of how this looks in practice. If you don’t want to watch the video, I’ll give you a short description of how it works below.

For example, let’s say you have the following in your buffer:

9msAkRIqstFcQAdfpvFZqgWGPBbReNS
3JEFbIfJuIGZBZodTONfnzyCykPtsBR
4KdSIqYYlDEIxpGiHFbRpqiZsFlgLxL
7UNqzFGgxEkzfzWLdTSKabDsUtTcSDs
5IqHRWKquwsekkritCxsnInXbsPeLvx
2ZdEuPTvYKFXNpOkhOytByqaDUQRSQI
0UreGiTTUnRxxrtNtaBfNYfbDhDlKwJ
1aOaHMrQzwGFjFtmwcPwdTfKVwteivR
6abgfdynLiidyiSBPUVMbkhKEsJMNVy
4doltlrfrOLmkuvCdVyJzqZRGkCOzkD

Running :%! sort in vim will pipe the data from our buffer into sort and put it back in the buffer:

0UreGiTTUnRxxrtNtaBfNYfbDhDlKwJ
1aOaHMrQzwGFjFtmwcPwdTfKVwteivR
2ZdEuPTvYKFXNpOkhOytByqaDUQRSQI
3JEFbIfJuIGZBZodTONfnzyCykPtsBR
4doltlrfrOLmkuvCdVyJzqZRGkCOzkD
4KdSIqYYlDEIxpGiHFbRpqiZsFlgLxL
5IqHRWKquwsekkritCxsnInXbsPeLvx
6abgfdynLiidyiSBPUVMbkhKEsJMNVy
7UNqzFGgxEkzfzWLdTSKabDsUtTcSDs
9msAkRIqstFcQAdfpvFZqgWGPBbReNS

We can continue doing this as much as we like, using all of our normal CLI tools, e.g. :%! grep "sekkrit"

IqHRWKquwsekkritCxsnInXbsPeLvx

And, the best part, because we’re in vim, we can undo and redo all of the commands that we run, retry failed commands (remember to add those pesky quotes around spaces for grep!!), search and replace, and the list goes on. You’re only limited by your imagination and the tools you have available on the CLI.

So, now you also know. Go spread the word!

If the post resonated with you and you are looking for someone to help you to do hard things with a computer, you can hire me to help you!

Hello World, Simple Event Broker!

Sun, 26 May 2024 20:55:00 +0000

For various side projects I’ve worked on, I’ve wanted to introduce event queues in order to simplify some things. Normally, I just go with the “one DB to rule them all”, and shove things into Postgres. Sometimes though, the workload becomes too much and the burst- and credit balance of my puny RDS instances start looking like ski slopes that would kill most skiers.

Every time this has happened I’ve looked into hosting or renting actual event queuing systems, but never found anything that fit the bill: dedicated event queuing systems are built to scale to insane workloads with the smallest latency possible and, to me at least, they all either seemed like a handful to self-host or were too expensive to rent. I just needed something that would not lose my data if the VM and/or its disk died, something that would run on tiny, cheap hardware, and was able to put up with a reasonable amount of load. I took some time off recently and thought a fun way to spend some of this time would to be to build a system that matches these requirements.

So, I started work on Seb (Simple Event Broker. Yay naming!)

Goals and status

Seb is an event broker designed with the goals of being 1) cheap to run 2) easy to manage 3) easy to use, in that order. It actually has “don’t lose my data” as the very-first goal on that list, but I wanted a list of three, and I thought not losing data reasonably could be assumed to be table stakes. Let’s call it item 0.

Seb explicitly does not attempt to reach sub-millisecond latencies nor scale to fantastic workloads. If you need this, there are systems infinitely more capable, designed for exactly these workloads, and which handles them very well. See Kafka, Red Panda, RabbitMQ et al.

In order to reach the goals of being both cheap to run and easy to manage, Seb embraces the fact that writing data to disk and ensuring that data is actually written and stays written is rather difficult. It utilizes the hundreds of thousands of engineering hours that were poured into object stores and pays the price of latency at the gates of the cloud vendors. For the use cases I have in mind, this trade-off is perfect; it gives me reasonable throughput at a (very) low price.

I expect the target audience for a system like this will be small and niche. Who knows? Maybe there’s more people like me that need event queues but aren’t rich enough to rent them!

Anyway, working on Seb has been a lot of fun and it solves exactly the problem I was looking to solve. It’s by no means “done” yet (is anything ever?), but it’s currently in a state where I can use it for what I need to. There’s of course loads of stuff I’d love to add and improve; only supporting a single, static API-key for authentication, for instance, is laughable. But things take time and this is how far I’ve come.

Architecture

Although Seb doesn’t have a clever play on words including “go” in its name, it’s written in Go. I kinda want to evolve it to be embeddable (even easier to manage when it lives inside your application!), but for now I’ve hidden everything from the public in the internal/ folder so that I don’t have to play nice with anyone that might be foolish enough to try and use it just yet. It’s currently very actively under development, and I might change anything at any time. Force-push-to-master kind of active; be warned!

Seb is split into three main parts; the Broker, which is responsible for managing and multiplexing Topics, Topic which is responsible for persisting data to the underlying storage, and Cache, which is responsible for caching data locally so that we can minimize the number of times we pass through the gates of the cloud vendors, saving both latency and cash money. This is shown below.

The Broker assumes that data is durably persisted when a Topic’s AddRecords() method returns. As might be legible from my doodles above, Topic currently has three different backing storages: S3, local disk, and local memory. S3 is the only one that anyone should trust with production data (remember I said that writing to disk is hard?). Disk and memory are super-duper only to be used for data that you don’t care about. Pinky-promises required before use!

The simple but important realization I had when initially trying to design Seb on paper was that if I can trust the cloud vendors’s object stores that a file is durably stored once they’ve given me a 200 OK, the hardest part of the system (besides concurrency?) wouldn’t have to be handled by me. With this assumption it’s a non-event interms of durability if my VM or local disk dies during operation. The data lives on in the skies and no caller believes that they have added data to the queue which wasn’t actually added. Argument for why this last part is true coming right up!

Durability and latency-money trade-off

In order to not have to wait a full roundtrip every time we write data to S3 (and to save money on the $0.005-per-1,000-requests of S3!) we collect records in batches before sending them off to S3. Whenever “the first” record of a batch comes in the door, Seb will wait for a configurable amount of time in the hope that more records will arrive and can be included in the batch. Callers are blocked while waiting for the batch to finish. This is a very direct trade-off between money and latency, and your specific situation will dictate how long time it makes sense to wait. Once the wait time has expired, Seb will attempt to write the accumulated records to S3. Only when we’ve gotten our response from the S3 API do we tell the callers whether their request succeeded or not. If it succeeded we send them the offset of their record, and if not we send them an error. This is it. The main argument that Seb won’t lose our data. There’s of course still a lot of other ways that things can go wrong, but, in terms of durability, this is the central argument: Seb only tells callers that their data has been persisted once it has gotten a 200 OK from S3.

You might have noticed that it’s still possible that Seb will crash in the time between getting a 200 OK from S3 and replying to the caller. In this situation the data has been added to the queue, and can be retrieved by consumer, but the caller has no way of knowing. So, if the caller really cares about adding their data to the queue, they will retry the call and the data will be added twice. In fancy systems lingo we would say that the producer has “at-least-once” delivery semantics. This problem is somewhat easily circumvented: if producers include a unique id in each record, consumers can use this to ignore records they’ve already handled. This would of course also be possible to handle this directly in Seb, but would require that all producers include a unique ID for every record, and that Seb has some way of keeping track of which IDs that were already added. In order to keep Seb simple, this is not a goal.

The strategy for batching records is configurable and hidden behind the RecordBatcher interface. The strategy described above is implemented as BlockingBatcher. There’s also a batching strategy called NullBatcher which doesn’t do any batching, and just send records straight through to S3, creating and uploading one file per record. This is mostly useful for testing.

Data layout

The data format used in a system like this can have a large impact on read and write performance. I initially looked around for existing file formats to use but didn’t manage to find any that would be particularly helpful. Instead, I came up with the simplest and stupidest file format that I thought would work, which would be fast and simple to both write and parse. I started out being kinda inspired by LSM trees, but since I’ve yet to implement support for record keys, I’ve done nothing of the sort. It’s just a tiny header concatenated with pointers into raw record data. Oh, and files are immutable, so they’re infinitely cacheable and only ever have to be “constructed” once.

This is what the format looks like:

As I’ve tried to show in the visualization, the file format has three sections:

header (32 bytes)
pointers to each record (N * 32 bytes)
record data (however much data the records are)

For anyone that has tried to come up with a custom file format before, one of the things you’re likely to learn the hard way is that you should include a version number in the header. It’s unlikely we’ll get the file format right in the first try, and adding a version number will give us the opportunity to change the format in the future while keeping the parser code compatible with versions without too many hacks; read the header and do dispatch based on the version number.

The static part of the header is declared as follows:

type Header struct {
	MagicBytes  [4]byte
	Version     int16
	UnixEpochUs int64
	NumRecords  uint32
	Reserved    [14]byte
}

It weighs in at 32 bytes and dictates that each file can contain a maximum of 2^32 records (NumRecords is uint32). Each offset into the file is given as a uint32, so the maximum offset into the file we can point to is 4GB. Both of these numbers are obviously way larger than we are likely to want to use in practice. We want to keep the size of each file reasonably small so that it’s not too expensive to fetch it from S3 if we don’t have it in the local cache, but at the same time we don’t want it to be too small because this would mean that we have to go to S3 more often. Trade-offs everywhere!

Let’s see what everything looks like when we create a file with a few records. I’ll do the example in human readable format so that you don’t have to dust off the good-ol’ ascii chart.

Here’s our file:

Data                   Field.       Size      File offset
----------------------------------------------------------
seb!                 Magic bytes   4 bytes        0
1                    Version       2 bytes        4
2024-05-28 12:00:00  UnixEpochUs   8 bytes        6
3                    NumRecords    4 bytes       14
00000000000000       Reserved     14 bytes       18
44                   Index0        4 bytes       32
61                   Index1        4 bytes       36
96                   Index2        4 bytes       40
first-record-data    Data         17 bytes       44
second-record-data   Data         18 bytes       61
third-record-data    Data         17 bytes       79

As is hopefully clear from the above snippet, the three records we added to the file contain the rather boring data “first-record-data”, “second-record-data”, and “third-record-data”.

The first step of reading back records from our file is to read the static part of the header, namely the first 32 bytes. Having read this, we can verify that the magic bytes (“seb!”) and the version number (1) match our expectations and, additionally, we have information on how many records the file contains (3). The second step is to use the number of records to calculate the size of the file’s index (3 records *4 bytes). Now, having read both the header and the index, we know exactly where each record starts and ends.

In order to read the second record, for example, we look up entry 1 in our index, which is zero-indexed. Looking at Index1 in our file, we see that the record starts at file offset 61. We can tell the length of our record by looking up the offset of next one and subtract the two; 79-61. We now know that our record starts at file offset 61 and is 17 bytes long; the code has been cracked and we can continue our adventure!

Benchmarking

This post has already become way too long. If you’re still reading: well done! We’re almost through. If you’re out of breath and need to take a break: I hear you. Go lie down. But, if you want to finish this before doing so, I’ve written a summary TLDR below. If you don’t want the spoiler, quickly cover your secreen and scroll past the following handful of lines!

TLDR Summary

Hardware: Hetzner CAX11, 2 core ARM Ampere, 4GB memory
Seb configuration: batch collection time: 10ms
Each test sends 100k records
Requests are sent from T14 laptop on fiber in Copenhagen, Denmark to CAX11 in Falkenstein, Germany
Max performance non-batched: 22k requests/s with 4800 workers (1 record/request)
Max performance batched*: 50k requests/s with 600 workers (32 records/request)

Now that I’ve spent some time building and discussing Seb, I thought it would be nice to understand how it behaves if we put it under a bit of stress. These benchmarks aren’t going to be particularly scientific. I’m aiming for getting an overall feeling for what this thing can do, not winning benchmark of the year. Each test in the following data was run just once, so you don’t have to look at those pesky error bars. Yes, I know. You’re welcome.

Since Seb was designed to be cheap to run, I wanted to try it out on a cheap machine. At €4.51/month, Hetzner’s CAX11 ARM VMs are exactly what I’m looking for. They come with 2 ARM Ampere cores and 4GB memory. Hetzner provide no specs on their disks, but do state the following

They are optimized for high I/O performance and low latency and are especially suited for applications which require fast access to disks with low latency, such as databases.

I expect the latency to AWS to be the dominating factor in this test anyway, so the performance of the disk shouldn’t matter too much.

Since we’re going for speed in these benchmarks, I decided to set the batch collection time low at 10ms. This means that, whenever the first request comes in, Seb will collect all incoming requests for the next 10ms into a batch. Once the batch is collected, Seb writes it to a file and sends it to S3 before putting it into the local disk-cache.

An important detail: since Seb blocks callers while collecting a batch, we have to send a lot of HTTP requests in parallel in order to be able to saturate the system.

Graphs and numbers

The first graph we’re going to look at is runtime vs number of workers for different payloads.

We see that it’s faster to use more workers, but that the returns of adding more workers start diminishing at around 1200. I speculate that our small 2-core server starts to buckle at the knees because of the overhead of handling that many HTTP connections simultaneously.

On the above graph we also see that it’s generally slower to send requests with larger payloads, but that requests of size <= 1024 bytes are roughly the same. This makes sense since we’re aren’t even filling up our ethernet packets at this point.

The next graph is requests/second vs workers for different record sizes.

Here, we see the maximum number of requests/second hit ~20k for record sizes 64 and 256 bytes. I can’t come up with a reason why 256 bytes should be faster than 64, so I’m going to assume that this is just noise. After all, we are running this on a shared VM and giving it a bit of a hard time. See, I promised: no error bars!

Starting at 1200 workers, we see that the requests/second drops by roughly half with a quadrupling of the record size. This is another indication to me that we have found the point at which we’re starting to confuse our hardworkig CAX11 with the sheer number of requests we’re sending to it. If only the record size had been the bottle neck, I would expect the number of requests/second to drop by something closer to a factor of four. Another way to look at this: ~3000 requests/second at 16kb/request is around 375 mbit/s, whereas at ~7k requests/second at 4kb/request is around 220 mbit/s. Even though the number of requests is much lower, we’re still pushing almost double the amount of data through with our 16kb payload. The record size does seem to have an impact, though, which we can see from how the graphs flatten out a lot quicker for the higher record sizes.

I didn’t really plan to benchmark any further, but after finding that we’re probably saturating the server with the number of requests rather than the amount of data we’re pushing through, I decided to do one more benchmark. This time I’m using Seb’s batch API, which allows us to queue multiple records per request.

The final graph shows us records/second vs workers, for batch sizes of 1 and 32 with a record size of 1kb.

As we would expect from our above analysis, the graph shows that the number of records/second increases dramatically (more than doubling from ~22k to ~50k!) when records are batched. On the graph, we also see that the system starst to deteriorate at 1200 workers. This matches our previous observations. I believe that main difference now is that we’re not just pushing it on the amount of requests, but also giving it more work per request than it has time to handle. The system simply can’t keep up anymore and performance starts to degrade.

Alright, that’s it folks! I must say I’m pretty happy with how much work we can push through this system. ~22k and ~50k records/second is a lot more than I expect to be needing in the foreseeable future. Turns out that Seb packs a decent punch!

TODOs and missing features

There’s still a bunch of things I’d love to work on to improve Seb. I’ve spent too much time writing the above, so I’ll just outline the TODOs and missing features in a bullet list below. Perhaps some of these will be the topic of another post?

Authentication
- currently only supports a single, deployment-wide API key
- considering: certificate-based authentication
keep state
- probably sqlite
- track consumer offsets
- track record keys
record keys
- compaction
- history of values for key
- iterate over all keys
clean up old data
- LSM compaction (requires record keys)

If the post resonated with you and you are looking for someone to help you to do hard things with a computer, you can hire me to help you!

About

I’m Michael, based in Copenhagen, Denmark. I’ve made a living as a programmer since 2007 and I now run a consultancy that helps customers do things with computers.

Since you’re reading this, I’m assuming that you read one of my blog posts and thought it reasonated with you somehow. That makes me happy! Or, perhaps you’re furious and need to tell me how wrong I am. In that case, do tell!

You can hire me on an hourly, weekly, or monthly basis. Depending on the type of project, I’m also available for longer contracts.

My contact information is at the bottom of the page.

Experience

I love working behind the scenes, doing backend, infrastructure, and systems programming. Basically anywhere that maintainability, correctness, and performance are important factors, and where people mostly expect things to Just Work™.

I have experience form the following industries:

Health care (Novo Nordisk, Adent Health)
Semiconductors/systems programming research (Samsung Research)
Banking (Danske Bank)
Gaming (noesis.gg)
Real estate (Ejendomstorvet)
Consulting (Eksponent, Big Bang Holding)

On a contract, I’m happy to help in any way that I can; architecting, writing code, mentoring, doing code reviews, setting up CI/CD pipelines. All of it is important and required for teams to be great. I’m pragmatic, open, easy-going, and I love keeping things light and fun. I’m professional and I’m on time.

I care deeply about, and have proven experience with:

technical project leadership
mentoring
designing and implementing greenfield projects
writing testable, maintainable code
writing tests that actually provide value
improving maintainability and testability of existing systems
using performance profiling to guide development
doing code reviews
technical writing

I’m also experienced with technical reviewing, having reviewed:

Matt Boyle’s Foundations of Debugging with Go
Inanc Gumus’s Go by Example
Bartlomiej Plotka’s Efficient Go
William Kennedy’s Ultimate Go Notebook

References

You don’t have to take my word for it. Below are references from previous employers:

2023-2024 Novo Nordisk
2023 FOSS
2022 Ejendomstorvet
2021 Samsung
2018-2020 Co-founding noesis.gg and founding cvr.dev
2017-2018 Danske Bank
2014-2017 Eksponent

Contact

You can contact me here:

Email: project@vbang.dk
LinkedIn: micvbang
Twitter: @micvbang