3

In a design book I was reading they describe a method to determine a database sharding scheme by taking the hash (MD5, SHA1, whatever) of a userid (integers or uuids) and then (whether encoded or not) doing mod shard number.

So it looks like this: Hash(userid) % number_of_shards.

The output of the hash function includes non-numeric characters as does encodings like base64.

My question is what method would be used to map the hash output to numeric values so that modulo division can be used?

asked Sep 17, 2021 at 21:25
3
  • 3
    The hash is a number. Commented Sep 17, 2021 at 21:54
  • 2
    The hash function produces a number, typically 128, 256, or 512 bits large. The hash is typically rendered in hex or base64 notation for human convenience, but that is reversible. UUIDs are also 128-bit numbers, and some UUID versions can be used without extra hashing. Note that your sharding scheme is problematic in practice: you cannot change the number of shards without taking the database offline, you must first shuffle all the data to the correct shards. In practice consistent hashing is used instead as it requires less data to be moved during repartitioning. Commented Sep 17, 2021 at 22:11
  • Shenk, in the light of the answers, would you mind to explain if your question was simply based on the wrong assumption that common hash functions produce strings and not numbers? (Maybe because you confused the popular hexadecimal representation of hash values with base64 ?) Or do you actually know a hash function which produces strings and not numbers? Till then I give this question a "needs details or clarity" vote. Commented Sep 18, 2021 at 7:04

2 Answers 2

2

My question is what method would be used to map the hash output to numeric values so that modulo division can be used?

There are many different ways of converting a string to a number. For example, you could simply interpret the octets that make up the string as a long number representation in base 256. However, you need to be careful that your conversion does not destroy the uniform probability distribution of the hash function. For example, my proposal would not work, since it is far more likely that you would have "digits" below 128 than above.

A way that guarantees preserving the distribution is to simply write down every single possible output of the hash function and number them from 0 to n.

There is a simple way of converting arbitrary data to a uniformly distributed number: a hash function! You could simply hash your hash with a hash function whose output is a number instead of a string.

However, in that case the first hash actually serves no useful purpose at all, so a much better solution would be to choose a different hash function in the first place, one which outputs numbers instead of strings.

Here are a couple of possible hash functions that could be used, whose outputs are guaranteed to be numbers:

  • SHA-3 outputs a fixed-size 224 bit, 256 bit, 384 bit, or 512 bit integer.
  • SHA-2 outputs a fixed-size 224 bit, 256 bit, 384 bit, or 512 bit integer.
  • BLAKE outputs a fixed-size 224 bit, 256 bit, 384 bit, or 512 bit integer.
  • Grøstl outputs a fixed-size integer between 8 bit and 512 bit in 8 steps.
  • Skein outputs a fixed-size integer of arbitrary size.
  • Whirlpool outputs a fixed-size 512 bit integer.
  • SHA-1 outputs a fixed-size 160 bit integer.
  • SHA-0 outputs a fixed-size 160 bit integer.
  • MD5 outputs a fixed-size 128 bit integer.
  • GOST outputs a fixed-size 256 bit integer.

Note that all of these are actually Cryptographic Hash Functions, which might be overkill for your situation. Note also that as cryptographic hash functions, SHA-1, SHA-0, MD5 (as well as its predecessors, e.g. MD4), and GOST are considered to be broken, but again, you might not actually need their cryptographically secure properties for your use case.

You could also use one of the following functions:

  • MurmurHash outputs a fixed-size 32 bit or 128 bit integer.
  • SipHash outputs a fixed-size 64 bit integer.
  • xxHash outputs a fixed-size 64 bit integer.

If you don't need all the cryptographic security properties, in particular preimage and second preimage resistance (and you almost certainly don't), xxHash is a good choice. It is blazingly fast (faster than memcpy!!!) and secure against collisions.

The output of the hash function includes non-numeric characters as does encodings like base64.

Base64 is literally a way to turn a number into a string. And it is reversible. So, you can simply decode the Base64 encoded string, and you get a number.

answered Sep 17, 2021 at 22:17
3

Hashes are numbers. Because they are so large, and a fixed number of bits, they are usually written in hexadecimal, but most libraries will also output hashes as a byte array. Byte arrays can be converted to something like a BigInteger in most standard libraries for doing math on.

answered Sep 17, 2021 at 22:13

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.