Modulo Division of Hash Output

Question 1

In a design book I was reading they describe a method to determine a database sharding scheme by taking the hash (MD5, SHA1, whatever) of a userid (integers or uuids) and then (whether encoded or not) doing mod shard number.

So it looks like this: Hash(userid) % number_of_shards.

The output of the hash function includes non-numeric characters as does encodings like base64.

My question is what method would be used to map the hash output to numeric values so that modulo division can be used?

Question 2

The hash is a number.

Question 3

The hash function produces a number, typically 128, 256, or 512 bits large. The hash is typically rendered in hex or base64 notation for human convenience, but that is reversible. UUIDs are also 128-bit numbers, and some UUID versions can be used without extra hashing. Note that your sharding scheme is problematic in practice: you cannot change the number of shards without taking the database offline, you must first shuffle all the data to the correct shards. In practice consistent hashing is used instead as it requires less data to be moved during repartitioning.

Question 4

Shenk, in the light of the answers, would you mind to explain if your question was simply based on the wrong assumption that common hash functions produce strings and not numbers? (Maybe because you confused the popular hexadecimal representation of hash values with base64 ?) Or do you actually know a hash function which produces strings and not numbers? Till then I give this question a "needs details or clarity" vote.

Question 5

My question is what method would be used to map the hash output to numeric values so that modulo division can be used?

There are many different ways of converting a string to a number. For example, you could simply interpret the octets that make up the string as a long number representation in base 256. However, you need to be careful that your conversion does not destroy the uniform probability distribution of the hash function. For example, my proposal would not work, since it is far more likely that you would have "digits" below 128 than above.

A way that guarantees preserving the distribution is to simply write down every single possible output of the hash function and number them from 0 to n.

There is a simple way of converting arbitrary data to a uniformly distributed number: a hash function! You could simply hash your hash with a hash function whose output is a number instead of a string.

However, in that case the first hash actually serves no useful purpose at all, so a much better solution would be to choose a different hash function in the first place, one which outputs numbers instead of strings.

Here are a couple of possible hash functions that could be used, whose outputs are guaranteed to be numbers:

SHA-3 outputs a fixed-size 224 bit, 256 bit, 384 bit, or 512 bit integer.
SHA-2 outputs a fixed-size 224 bit, 256 bit, 384 bit, or 512 bit integer.
BLAKE outputs a fixed-size 224 bit, 256 bit, 384 bit, or 512 bit integer.
Grøstl outputs a fixed-size integer between 8 bit and 512 bit in 8 steps.
Skein outputs a fixed-size integer of arbitrary size.
Whirlpool outputs a fixed-size 512 bit integer.
SHA-1 outputs a fixed-size 160 bit integer.
SHA-0 outputs a fixed-size 160 bit integer.
MD5 outputs a fixed-size 128 bit integer.
GOST outputs a fixed-size 256 bit integer.

Note that all of these are actually Cryptographic Hash Functions, which might be overkill for your situation. Note also that as cryptographic hash functions, SHA-1, SHA-0, MD5 (as well as its predecessors, e.g. MD4), and GOST are considered to be broken, but again, you might not actually need their cryptographically secure properties for your use case.

You could also use one of the following functions:

MurmurHash outputs a fixed-size 32 bit or 128 bit integer.
SipHash outputs a fixed-size 64 bit integer.
xxHash outputs a fixed-size 64 bit integer.

If you don't need all the cryptographic security properties, in particular preimage and second preimage resistance (and you almost certainly don't), xxHash is a good choice. It is blazingly fast (faster than memcpy!!!) and secure against collisions.

The output of the hash function includes non-numeric characters as does encodings like base64.

Base64 is literally a way to turn a number into a string. And it is reversible. So, you can simply decode the Base64 encoded string, and you get a number.

Question 6

Hashes are numbers. Because they are so large, and a fixed number of bits, they are usually written in hexadecimal, but most libraries will also output hashes as a byte array. Byte arrays can be converted to something like a BigInteger in most standard libraries for doing math on.

score 2 · Accepted Answer · 2021-09-17 22:17:15Z

My question is what method would be used to map the hash output to numeric values so that modulo division can be used?

There are many different ways of converting a string to a number. For example, you could simply interpret the octets that make up the string as a long number representation in base 256. However, you need to be careful that your conversion does not destroy the uniform probability distribution of the hash function. For example, my proposal would not work, since it is far more likely that you would have "digits" below 128 than above.

A way that guarantees preserving the distribution is to simply write down every single possible output of the hash function and number them from 0 to n.

There is a simple way of converting arbitrary data to a uniformly distributed number: a hash function! You could simply hash your hash with a hash function whose output is a number instead of a string.

However, in that case the first hash actually serves no useful purpose at all, so a much better solution would be to choose a different hash function in the first place, one which outputs numbers instead of strings.

Here are a couple of possible hash functions that could be used, whose outputs are guaranteed to be numbers:

SHA-3 outputs a fixed-size 224 bit, 256 bit, 384 bit, or 512 bit integer.
SHA-2 outputs a fixed-size 224 bit, 256 bit, 384 bit, or 512 bit integer.
BLAKE outputs a fixed-size 224 bit, 256 bit, 384 bit, or 512 bit integer.
Grøstl outputs a fixed-size integer between 8 bit and 512 bit in 8 steps.
Skein outputs a fixed-size integer of arbitrary size.
Whirlpool outputs a fixed-size 512 bit integer.
SHA-1 outputs a fixed-size 160 bit integer.
SHA-0 outputs a fixed-size 160 bit integer.
MD5 outputs a fixed-size 128 bit integer.
GOST outputs a fixed-size 256 bit integer.

Note that all of these are actually Cryptographic Hash Functions, which might be overkill for your situation. Note also that as cryptographic hash functions, SHA-1, SHA-0, MD5 (as well as its predecessors, e.g. MD4), and GOST are considered to be broken, but again, you might not actually need their cryptographically secure properties for your use case.

You could also use one of the following functions:

MurmurHash outputs a fixed-size 32 bit or 128 bit integer.
SipHash outputs a fixed-size 64 bit integer.
xxHash outputs a fixed-size 64 bit integer.

If you don't need all the cryptographic security properties, in particular preimage and second preimage resistance (and you almost certainly don't), xxHash is a good choice. It is blazingly fast (faster than memcpy!!!) and secure against collisions.

The output of the hash function includes non-numeric characters as does encodings like base64.

Base64 is literally a way to turn a number into a string. And it is reversible. So, you can simply decode the Base64 encoded string, and you get a number.

Stack Exchange Network

Modulo Division of Hash Output

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Modulo Division of Hash Output

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions