Algorithmic problem - quickly finding all #'s where value %x is some given value

Question 1

Problem I'm trying to solve, apologies in advance for the length:

Given a large number of stored records, each with a unique (String) field S. I'd like to be able to find through an indexed query all records where Hash(S) % N == K for any arbitrary N, K (e.g. given a million strings, find all strings where HashCode(s) % 17 = 5. Is there some way of memoizing this so that we can quickly answer any question of this form without doing the % on every value?

The motivation for this is a system of N distributed nodes, where each record has to be assigned to at least one node. The nodes are numbered 0 - (K-1) , and each node has to load up all of the records that match it's number:

If we have 3 nodes

Node 0 loads all records where Hash % 3 ==0
Node 1 loads all records where Hash % 3 ==1
Node 2 loads all records where Hash % 3 ==2

adding a 4th node, obviously all the assignments have to be recomputed -

Node 0 loads all records where Hash % 4 ==0
...
etc

I'd like to easily find these records through an indexed query without having to compute the mod individually.

The best I've been able to come up with so far:

If we take the prime factors of N (p1 * p2 * ... )

if N % M == I then p % M == I % p for all of N's prime factors

e.g. 10 nodes :

N % 10 == 6 then

N % 2 = 0 == 6 %2
N % 5 = 1 == 6 %5

so storing an array of the "%" of N for the first "reasonable" number of primes for my data set should be helpful. For example in the above example we store the hash and the primes

HASH PRIMES (array of %2, %3, %5, %7, ... ])

16 [0 1 1 2 .. ]

so looking for N%10 == 6 is equivalent to looking for all values where array[1]==1 and array[2] == 1.

However, this breaks at the first prime larger than the highest number I'm storing in the factor table. Is there a better way?

Question 2

You may get a better luck on math.stackexchange.com. You may need to re-phrase this a bit to bridge the gap between CS and Math.

Question 3

Agreed - the fact that it's an "algorithm" doesn't necessarily mean CS skills :) This is highly mathematical in nature and I second @dasblinkenlight on migrating this to math.SE

Question 4

presumably, you're doing this for load balancing of some sort. How often are you re-balancing? How often and how do the Strings change? How often does your range of nodes change? If your problem can accept some constraints (such as only allowing N nodes to range between X and Y) then some pragmatic solutions can be presented. They won't necessarily be the mathematically pure solutions, but they will still work.

Question 5

Have I misunderstood something, or did N change its meaning throughout your question? At first, it seemed to be the number of nodes, later, it seems to represent a hash code?

Question 6

If you data is stored in RDBMS, then the operation you are trying to avoid is very cheap using a SQL SELECT (assuming HASH(s) is a column in your table)

Question 7

How important is it that formula be "Hash % num_machines"? This formula is used for distributed caches, like memcached. It works great until you add/remove nodes. At that point, the advice is to abandon it and use consistent hashing.

Question 8

Thanks for the excellent suggestion, although I didn't see an easy application of consistent hashing in this case. Suggestions for a better distribution than Hash % #machines (i.e. anything that's both fair and stable) would also be welcome.

Question 9

Unless I've misunderstood something, your conjecture is incorrect.

If we take the prime factors of N (p1 * p2 * ... )

if N % M == I then p % M == I % p for all of N's prime factors

How did you come up with that?

Let's say N = 36 and M = 6

The prime factorization of N = 2 * 2 * 3 * 3. 36 % 6 = 0

According to your statement, the following should hold:

p % 6 = 0 % p = 0

But clearly, this is not the case: 2 % 6 = 2 != 0

score 3 · Answer 1 · 2012-08-23 18:40:40Z

3

How important is it that formula be "Hash % num_machines"? This formula is used for distributed caches, like memcached. It works great until you add/remove nodes. At that point, the advice is to abandon it and use consistent hashing.

Share

Improve this answer

answered Aug 23, 2012 at 18:40

Martin C. Martin's user avatar

Martin C. Martin Martin C. Martin

1,2231 gold badge9 silver badges9 bronze badges

1

Thanks for the excellent suggestion, although I didn't see an easy application of consistent hashing in this case. Suggestions for a better distribution than Hash % #machines (i.e. anything that's both fair and stable) would also be welcome.

Steve B.
– Steve B.

2012年08月23日 19:06:01 +00:00
Commented Aug 23, 2012 at 19:06

Add a comment |

phant0m phant0m 2,68418 silver badges25 bronze badges · Answer 2 · 2012-08-31 10:56:59Z

Unless I've misunderstood something, your conjecture is incorrect.

If we take the prime factors of N (p1 * p2 * ... )

if N % M == I then p % M == I % p for all of N's prime factors

How did you come up with that?

Let's say N = 36 and M = 6

The prime factorization of N = 2 * 2 * 3 * 3. 36 % 6 = 0

According to your statement, the following should hold:

p % 6 = 0 % p = 0

But clearly, this is not the case: 2 % 6 = 2 != 0

Stack Exchange Network

Algorithmic problem - quickly finding all #'s where value %x is some given value

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Algorithmic problem - quickly finding all #'s where value %x is some given value

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions