2

I'm looking for a sanity check my thinking around the memory layout of a key-value store.

We have a system that periodically polls and monitors on host-level jobs. On every poll, a given job emit can multiple timeseries. A timeseries is {timestamp->double} mapping. Each timeseries is identified by its timeseries id (e.g. "cpu", "memory"). It can also be user-defined. These timeseries ids are not globally unique though - what's globally unique is the job_id + timeseries_id. I will call this job_ts

The cardinality of jobs id ~5m and cardinality of timeseries ids is ~1B

We want to store the datapoints of every job timeseries into a key value store. I figured the key could be: hash(job_ts) + fixed_width_unix_timstamp. The value is the datapoint value (e.g. cpu of the job at time T).

This way I easily write range queries (i.e. give me all datapoints for a given job_ts from [T, T+100]

The hash function I figured would work is xxhash128 which would cost ~32bytes, per entry. This way I easily write range queries (i.e. give me all datapoints from [T, T+24hours] for job=x and timeseries_id=cpu)

Am I on the right track? I want to do is optimize for our memory footprint on the key value store - what are some techniques/tradeoffs I should consider?

asked Jul 23 at 11:38
3
  • 1
    I would try to figure a rough estimate for how much more space per month a readable format would cost, and then talk to some hardware people about the price of that memory/storage. Commented Jul 23 at 11:45
  • Is there a use case to query a time range for all or multiple job_ids? Even if it is a use case, is it rare enough to justify a more complex storage and handling? Storing one separate time series per job_id would require the job_id only once without any hash values. Commented Jul 23 at 15:16
  • 2
    Not sure if this really fits as an answer, thus only a comment. It might be worth to take a look into some simple compression techniques, e.g. store one or few base-timestamps and make all other timestamps relative to that and thus are much smaller. Depending on the approach this may or may not affect the time to find the entries. Commented Jul 23 at 15:27

3 Answers 3

2

If I understand the proposal correctly, you will need to track the hash of the job_ts in some other location (hashes are not reversible.)

If you already have to do that, there's no reason you are limited to hashes. You could ceate a sequential id or any other scheme you like. With 32 bits, you will be good for billions job_ts before you run out of unique Ids. You will need to make sure these ids are unique before you associate logs with them, of course.

answered Jul 23 at 20:49
3
  • The idea here is a 4 byte uint32 should be sufficient to make an ID for each job since the job_ts cardinality is low, right? I can see how that would reduce my memory footprint Great suggestion Commented Jul 23 at 23:17
  • @nz_21 You can store around 4 billion unique values in a 32-but int. How many unique job_ts ids do you get in a week? How many weeks of these do you need to store? That will tell you if 32-bits is enough. That part is the same whether you use hashing or not. The main upshot here is that you don't seem to need a hash here. Hashing is cool but I think you might be over-engineering a tiny bit. Commented Jul 24 at 13:38
  • Sorry, I should clarify, you can store a lot more ids as a monotonically increasing sequence in 32-bits than you can with a hash in 32-bits (without a high risk of collision.) Your monotonically increasing sequence will not produce collisions if managed properly. But the math for how many unique ids you require is the same. Commented Jul 24 at 14:03
2

When using a 32 byte hash function for numbers in the range of 5 millions, the risk of hash collisions is extremely low, hence that's surely a possible approach. Still, you have to make sure no job_id is used twice for different jobs, not even accidentally, hence you will need a central registry for the IDs.

But why complicate things by a hash function at all? Just introduce a unique job number which is automatically generated, maybe filled up with leading zeros to get fixed width numbers. Or, if you want to save the central registry, you may add an additional job_GUID - filled with GUIDs, of course. That is the standard solution for distributed systems.

answered Jul 23 at 12:47
5
  • Thanks for the answer! The trouble is that some these job_ts ids can be arbitrarily long since they can be user defined - that's why I thought about hashing them. Commented Jul 23 at 12:55
  • 2
    @nz_21 be careful about hashing user defined anything. Sometimes the user is not your friend. Commented Jul 24 at 22:15
  • @DocBrown xxhash128 is a 32 bit hash? Commented Jul 24 at 22:16
  • @candied_orange: you are right, I need more sleep ;-) Commented Aug 6 at 15:07
  • @nz_21: I changed my answer, according to the comments, hope it fits better now. Commented Aug 6 at 15:17
1

"We have a system that periodically polls and monitors on host-level jobs. On every poll, a given job emit can multiple timeseries.
...
Each timeseries is identified by its timeseries id (e.g. "cpu", "memory"). It can also be user-defined.
...
The cardinality of jobs id ~5m and cardinality of timeseries ids is ~1B
...
We want to store the datapoints of every job timeseries into a key value store. I figured the key could be: hash(job_ts) + fixed_width_unix_timestamp. The value is the datapoint value (e.g. cpu of the job at time T).
...
I want to optimize for our memory footprint on the key value store - what are some techniques/tradeoffs I should consider?".

Map "CPU", "memory ", and "user-defined string" to a 32-bit number stored in a key map table; slower lookup but considerable memory savings.

Forget the hash, instead of xxhash128(job_id_string + timeseries_id_string), your "series ID" (the unique identifier for job_id + timeseries_id) becomes a composite of the two integer IDs.

You could combine these two int32 values into a single int64 (8-byte) value. For instance, you could concatenate them (e.g., (job_int_id << 32) | timeseries_int_id). This 8-byte int64 would then serve as the first part of your key in the key-value store.

Updated Key Structure:

  • Original Key: [16-byte xxhash128(job_id_string + timeseries_id_string)] + [8-byte fixed_width_unix_timestamp] = 24 bytes

  • New Key: [8-byte composite_int64(job_int_id + timeseries_int_id)] + [8-byte fixed_width_unix_timestamp] = 16 bytes

Combine that with Gorilla time series compression.

Consider a Time-series Database such as timescaledb, a PostgreSQL extension for high-performance real-time analytics on time-series and event data, or ArcticDB which sits on MongODB. That saves reinventing the wheel and provides a suggested format for the data; you don't want to convert from what you are using to what format your database uses.

Remember not to get caught by the Y2038 bug, instead of 32 bit you may want to use signed 64-bit time_t integers.

answered Jul 25 at 5:56
5
  • 1
    Note ArcticDB.io is now just-s3 (no MongoDB dependency). It's a python data frame database optimised for storing tables of (primarily) numeric values as time series. Commented Jul 25 at 7:18
  • The traditional way to monitor this in infrastructure would be a metrics store - prometheus, influxdb etc. Cardinality can be quite painful. If you can group your data in a table, and are happy reading / writing data in Python, ArcticDB would be a trivial tool to try. Demos on the website. colab.research.google.com/github/man-group/ArcticDB/blob/v4.5.1/… Commented Jul 25 at 7:19
  • 1
    @James I wanted to suggest "Arctic" but they changed to "ArcticDB" (which is better, but different). I wanted to suggest open source and a simple upgrade to an existing database, rather than a standalone and complex offering: tdengine.com/how-to-choose-the-best-time-series-database - so I'd prefer Prometheus over the expense of Influx. Commented Jul 25 at 9:43
  • Here's a comparison of Prometheus and Influx influxdata.com/comparison/influxdb-vs-prometheus (on Influx's website). Commented Jul 25 at 9:53
  • 2
    Great suggestion, thank you. To maintain the key map table my thinking is to use a mysql table, with primarily key on (job_id, timeseries_id) so I can get atomically incremented IDs. Commented Jul 25 at 16:44

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.