Optimizing a bounded-memory KV store

For context, I got nerdsniped by this.

fig (aka:[phil])

1mo

[rabbit hole bait] bleh just seems like a disk thing for a - immutable map - no concurrency - bounded memory - durability and crash consistency *not* required should be able to be really fast

See this thread for the specific thoughts prior to this doc:

Sekoia

1mo

I feel like doing some math on this. How big are your values?

In this doc, I'm gonna ramble about figuring out the performance of my (and, if I have the energy, fig's generalized initial version) algorithm.

I have not reread this and I'm posting at 4am.

Let's start by setting the stage:

We want a key-value store in which keys are 64 bits, and values are dynamically sized with an average of 1 KiB and a maximum of a couple MiB

We have in the order of 100 million keys, perfectly uniformly distributed (I might look at the 1-billion key case as an extreme case)

All reads are randomly uniformly distributed, and are all hits

All values get inserted first, then all the reads happen. There are no updates (yay!)

We cannot use more than ~100 MiB (I'm going to use this as a strict limit, and pretty much as a minimum too)

We don't have to think about concurrency or crash consistency (phew)

Our performance is measured by the number of 4 KiB disk reads, without worrying about the page cache or sequential disk reads

There's a short TLDR at the end.

Finding the right algorithm

You can skip this part if you're not interested in re-deriving the algorithm.

By rule of thumb, we can eliminate a few approaches:

We have about 100 GiB of data. Even if we used all our RAM as cache, we'd only have a 0.1% hit rate. Therefore, values cannot be stored in-memory. Since reads are random, there's no point in trying to cache them. Also, if we stored our data on-disk linearly, we can only fit 3 pairs per page on average, and sometimes only a fraction of one. Therefore, it is reasonable to store values in their own region of the disk and simply have 64-bit pointers into that region. Our key-value pairs are now 128 bits (64 bits each).

Since keys are uniformly distributed, any approach that uses sparsity or closeness of keys and reads as an optimization is right out; our RAM budget is not even close enough for that to be worth it probabilistically. This also eliminates B-trees (as the most optimal), in particular.

Onto the algorithm! There are two general approaches to this kind of map: trees and (hash-)tables. Since our keys are randomly distributed, they're effectively already hashes, so really it's trees and buckets.

Note that we cannot store all our key-value pairs in memory, so part of our data structure will be on-disk. Even with our small key-value pairs, we have an absolute maximum hit rate of ~6.5%. As such, we will also move our key-value pairs to the disk, for a minimum of 2 reads per key-value pair (for now). On one block, we can store 4 KiB/16 B=256 pairs, minus a header. Let's use 250 pairs per block.

This leaves us with 400'000 blocks, which we'll index with 4 bytes. This is much more manageable. Storing this many indices takes 1.6 MBs, which is manageable.

Trees

Let's first look at trees.

Please note that this section is VERY VERY LONG, kinda pointless and I haven't reread it. I'm pretty sure the math is correct, but unless you're curious about my reasoning, you should probably just jump to the end of this bit.

Since there's no structure, our trees will be balanced, meaning there will be log_x(N) levels to our tree. Let's get a reasonable initial structure for our tree. Each branch will fan out 2^a times (to make the math easy), with each pointer being 32 bits. There are some minor optimizations with saturated branches and such, but let's ignore that. If there are fewer than x values in the entire keyspace below us, it becomes a leaf. Each leaf is a list of key-value pairs. A leaf is the same size as a branch, so we also have 2^a entries.

At each layer, there are N keys distributed into (2^a)^n nodes. The number of keys per node follows a very convenient binomial distribution

\Pr(X=k)=\binom{N}{k}\left(\frac{1}{(2^a)^n}\right)^{k} \left(1-\frac{1}{(2^a)^n}\right)^{N-k}

What interests us is whether there are fewer than 2^a entries. The probability that a random node is a leaf is

\Pr(X\leq 2^a)=\sum_{i=0}^{2^a}\Pr(X=i)

We know there are ~N/2^a leaves in the tree total, so what interests us is the total number of branches. This is

1+\sum_{n=1}^\infty (2^a)^n \Pr(X > 2^a)

Great! There's one small mistake, though; we're turning a node into a leaf if there are less than 2^a keys in it, and we store a block pointer for each key. That's too much, we already knew that! We'll assume each block contains a contiguous section of 250 keys. This means we instead use

1+\sum_{n=1}^\infty (2^a)^n \Pr(X > 250\cdot 2^a)

Here's a desmos graph that does all this math (as a link: https://www.desmos.com/calculator/s55ohcccfl):

You'll notice that... for all the effort we went to doing all this math, as soon as the branching is greater than ~16, our tree is very quickly dominated by the required 1.6 MB needed to store the leaves. This is because our tree is effectively perfectly balanced, so with large amounts of branching, very few leaves are created until right before the end. This could potentially be optimized a little by turning our 32-bit pointers into 31-bit block-or-branch values, but I can't be bother to figure out the math for that; 1.6 MB is effectively the amount of space we need.

TLDR: a tree structure can be relatively easily optimized to be dominated by its leaves, which need 4 bytes per block.

Buckets

Buckets are very simple; store a really big array, with each entry being a "bucket". Take our key, divide it by the number of entries, and you know which bucket it's in. A bucket contains a list of pages. We'll go for 4-byte indices once again. For convenience, we'll assume the number of entries is a power of 2 (this makes division basically free).

Since we don't have too many blocks, each bucket can correspond to a block. The smallest power of 2 that fits this is 2^19, which would take exactly 2 MiB. That's tiny! The biggest power of 2 we can afford is 2^24, which would take exactly 64 MiB. Using this approach would effectively guarantee a maximum of 2 reads, so it seems pretty promising.

Note that in trees, each block corresponds to 250 contiguous keys. In this setup, each block corresponds to exactly one bucket, which isn't necessarily full of 250 keys. If a bucket has one key, it still takes 4 KiB on disk. Let's do the math of the disk overhead:

Similar to before, the number of keys per block is dictated by a binomial distribution, though this time it's N keys distributed into 2^a buckets. The probability that a bucket is the probability that no key ever hits the bucket is (1-1/2^a)^N. With a very large N, this becomes ridiculously small. So in practice, every bucket requires its 4 KiB (though an optimization could merge small buckets), which amounts to anywhere from 2 GiB to 64 GiB.

The average occupancy of a bucket is N/2^a, which is 190 in the case of 2^19, and about 6 in the case of 2^24.

Picking an option

Trees:

Have little overhead, but need 4 bytes per block

If the tree were stored partially on disk, are very optimized

Scales better at extremely high scales

Is basically a database B-tree but worse I think

Blocks:

Can be configured to take anywhere from 2 MiB to 64 MiB trivially

Extremely fast

Pretty high disk overhead

Even barring optimizations that could reduce the disk overhead significantly, blocks are just better.

Optimizing it further

We'll pick blocks with an array size of 2^20. If you skipped to here from earlier, know that we treat our KV store as basically a really big hash table. The buckets are in memory, but the linked list of entries is in the disk.

With 2^20, we are using 4 MiB of RAM out of our 100 MiB. That leaves 96 MiB up for grabs, corresponding to 24576 blocks or 6.3 million key-value pairs. That could give us a... 6% performance boost.

Turns out it's pretty trivial. Instead of storing a 32-bit block index, reserve 1 bit for whether this is a memory or a disk index. In-memory, we can store our blocks contiguously with no gaps. A header could be simply the number of entries, which would take about a byte, which is negligible at this scale, so we get our 6% performance boost. Note that in this situation, we are dedicating 96% of our available RAM purely to key-value pairs, but 94% of values will need slightly over 2 reads. I don't think you can reasonably do better.

The probability that we could need more than 250 entries in a bucket is low enough that Desmos can't calculate it, so we can assume we'll always need exactly 2 disk reads: one for reading our bucket, one for reading our real value.

With 1 billion keys, we can still use 2^24 and comfortably fit everything into one bucket. The 64 GiB used by mostly-empty buckets is negligible compared to the 1 TB of values.

The only remaining interesting case is when we don't know the number of keys in advance. This might make the optimizations on mostly-empty buckets worth it, for the disk space. I'm leaving it as an ✨exercise to the reader✨ though because I'm running out of steam here (if you sort the block contents by key, splitting is really easy, and you can start off with every bucket being the same disk block then split the buckets every time the disk block is full).

Final algorithm

This is for a version without a cache, and without optimizing for disk seeks (which could be done by interleaving data blocks with key-value blocks), or data locality (which would "just" need a big pass after the insertion phase to reorder all the out-of-order blocks (if we do the splitting technique, you could do this during insertion too)).

All of this code is pseudo-Rust btw.

Our in-memory data structure is

buckets = vec![0; 1 << 20]

and our on-disk data structure is

struct Block {
  /// Number of valid keypairs
  count: u32,
  /// Index of the next block
  next: Option<NonZero<u32>>,
  reserved: [u8; 4096-250*16-8],
  /// Data pairs: (hash, idx)
  keypairs: [(u64, u64); 250],
}

Insertion

fn insert(key: u64, value: u64) {
  // Definitely safe
  let block_idx = {
    let block = buckets[key >> (64-20)];
    if block == 0 { disk.new_block() } else { block }
  };
  let block = disk.read(block_idx);
  let count = block.count;
  if count == block.keypairs.len() {
    let new_idx = disk.new_block();
    let new = disk.read(new_idx);
    new.next = block.next;
    new.count = 1;
    new.keypairs[0] = (key, value);

    block.next = Some(new_idx);
    disk.write(new);
  } else {
    block.keypairs[block.count] = (key, value);
    block.count += 1;
  }
  disk.write(block);
}

Reads: basically always 1. Never more than N/250.

Writes: up to 2, basically always 1

Mode switch

Nothing to be done, though you could consider optimizing by sorting/coalescing the blocks.

We could also bring the actual values to the blocks directly after the corresponding keypair block. The 64-bit pointer could store a flag for this as well. This would require moving around a lot of data, though.

Read

fn read(key: u64) -> Option<u64> {
  let block = buckets[key >> (64-20)];
  fn read_block(key: u64, block: Block) -> Option<u64> {
    let next = block.next;
    block
      .keypairs.iter()
      .take(block.count)
      .find_map(|(bkey, pair)| (key == bkey).then(|| pair));
      .or_else(|| next.map(|idx| read_block(key, disk.read(idx))))
  }
  let block_val = database.read(block);
  if block_val == 0 { return None };
  return read_block(key, block_val);
}

(yes I just wanted to write fancy iterator stuff)

Reads: basically always 1. Never more than N/250.

Splitting optimization

Splitting would need a lot less disk space, letting us use 2^24 bits (supporting many more keypairs). We can do it by making each Block responsible for a range of buckets. When it's full, split its range in two, then partition its keypairs. Replace the block index of every bucket in the range.

This does not affect the number of reads or writes significantly.

TLDR

Store the values in one big file

Store the key and a pointer to the value in 250-element lists stored in a single block

In-memory, the block corresponding to a key is determined by a 2^20 (or 2^24, if optimized) element array of u32 block indices

This requires an extra 4 GiB (1.5-3 GiB optimized) of disk space, and 4 MiB of RAM (64 MiB optimized)

Using the remaining RAM available does not significantly improve the number of reads (~6% of reads will need 1 disk read instead of 2)

With 1 billion keys, optimization is needed. Extra required disk space is 32 GiB; RAM required is still 64 MiB.

The best optimization after this would be to interleave the big values with the value pointers (not storing them in line because of very large values), which would reduce seeking.

made using Leaflet