Problem I'm trying to solve, apologies in advance for the length:
Given a large number of stored records, each with a unique (String) field S. I'd like to be able to find through an indexed query all records where Hash(S) % N == K for any arbitrary N, K (e.g. given a million strings, find all strings where HashCode(s) % 17 = 5. Is there some way of memoizing this so that we can quickly answer any question of this form without doing the % on every value?
The motivation for this is a system of N distributed nodes, where each record has to be assigned to at least one node. The nodes are numbered 0 - (K-1) , and each node has to load up all of the records that match it's number:
If we have 3 nodes
- Node 0 loads all records where Hash % 3 ==0
- Node 1 loads all records where Hash % 3 ==1
- Node 2 loads all records where Hash % 3 ==2
adding a 4th node, obviously all the assignments have to be recomputed -
- Node 0 loads all records where Hash % 4 ==0
- ...
- etc
I'd like to easily find these records through an indexed query without having to compute the mod individually.
The best I've been able to come up with so far:
If we take the prime factors of N (p1 * p2 * ... )
if N % M == I then p % M == I % p for all of N's prime factors
e.g. 10 nodes :
N % 10 == 6 then
- N % 2 = 0 == 6 %2
- N % 5 = 1 == 6 %5
so storing an array of the "%" of N for the first "reasonable" number of primes for my data set should be helpful. For example in the above example we store the hash and the primes
HASH PRIMES (array of %2, %3, %5, %7, ... ])
16 [0 1 1 2 .. ]
so looking for N%10 == 6 is equivalent to looking for all values where array[1]==1 and array[2] == 1.
However, this breaks at the first prime larger than the highest number I'm storing in the factor table. Is there a better way?
2 Answers 2
How important is it that formula be "Hash % num_machines"? This formula is used for distributed caches, like memcached. It works great until you add/remove nodes. At that point, the advice is to abandon it and use consistent hashing.
-
Thanks for the excellent suggestion, although I didn't see an easy application of consistent hashing in this case. Suggestions for a better distribution than Hash % #machines (i.e. anything that's both fair and stable) would also be welcome.Steve B.– Steve B.2012年08月23日 19:06:01 +00:00Commented Aug 23, 2012 at 19:06
Unless I've misunderstood something, your conjecture is incorrect.
If we take the prime factors of N (p1 * p2 * ... )
if N % M == I then p % M == I % p for all of N's prime factors
How did you come up with that?
Let's say N = 36
and M = 6
The prime factorization of N = 2 * 2 * 3 * 3
. 36 % 6 = 0
According to your statement, the following should hold:
p % 6 = 0 % p = 0
But clearly, this is not the case: 2 % 6 = 2 != 0
N
change its meaning throughout your question? At first, it seemed to be the number of nodes, later, it seems to represent a hash code?