Fast hashing algorithm for mapping/quantizing a collection of floats onto a small set of sorted floats

Question 1

With my application, I have

Collection X: thousands of floating-point numbers with the value range [0, 1], not sorted.
Collection Y: 11 floats ranging from [0, 1], sorted.
The size of X is known. Let it be L.

The goal is to quantize X and map it onto Y, so that we get a hash array of indices of Y for X. Eventually Y will be then quantized onto 10 discrete things pointed to it.

An example to make it a bit clearer,

Y = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
X = [0.678, 0.124, ..., 1.0, ., 0.013, 0.475]

I want the algorithm output to be 0-based indices like these:

H(Y[0]) = H(0.678) = 6
H(Y[1]) = H(0.124) = 1
H(Y[n-2]) = H(0.013) = 0
H(Y[n-1]) = H(0.475) = 4

Attempts

Naively, I've tried linear and binary searching for positioning each element of X in Y so that the element is found between an adjacent pair of elements in Y.

However, the performance is not good enough for my application. This quantization happens in a realtime thread that slow computation is undesirable.

Question

What is the best way of this kind of hashing/quantization? X is not sorted.

Thanks!

Question 2

You haven't defined what exactly it is that your output should be. It's very unclear to me.

Question 3

@orlp I updated my question. Would you take a look at it again?

Question 4

I still don't see what the quantised value or index shall be good for. I take $Y$ to always be 11 values: If they are equidistant like the example (or follow a similarly simple pattern), state so in the question. For any given value $x,ドル is it a requirement that the index $H(y)$ of a lower value $y$ is no higher than $H(x)$? Is it necessary that for a uniform distribution of $X$ values the expected standard deviation of $H(X)$ is low? Is it mandatory that $H(y)$ on one computer/language system is the same as $H(y)$ on a different one? Beware idiosyncrasies of floating point.

Question 5

Please add to your question: How did you establish the performance is not good enough for my application? (In a different context, you would be urged to add a sketch of the overall task to accomplish, too.)

Question 6

@greybeard Let's just say that I have 10 discrete things to select from based on what floats I have at hand. These 10 things thus span a linear space for the floats to fit in. The floats happen to be in the same value range, but they must be eventually quantized onto the 1-10 ints. Would that make more sense? Sorry I might have posted requirements that I haven't got completely straight.

Question 7

Take X, multiplied by say 10,000, rounded down to the nearest integer. In plain C, z = (int) (x times 10000.0).

There are 10,000 possible values of z. For most values of z, you can determine the index from z. So create a table with 10,000 entries. In the table, store an index i if you can prove that x should be mapped to i, knowing z, and store -1 if you can’t prove this.

As a result, you get the correct value probably in 9,980 of 10,000 values, and then you use whatever slow algorithm you have for the remaining 1 in 500 values.

PS. The same table size would be used for double precision numbers. Whatever the table size, there will be only few values X that cannot be mapped correctly using this method, maybe 10 or 20. If you take a table of size 10,000 then 99.8% or 99.9% are mapped correctly, and 0.1% or 0.2% need a slow algorithm. The exact same thing happens with double. You could use 1000 entries, then the 10 or 20 failing ones would be 1% or 2%.

And a nice thing is that this method will work however the Y values are distributed. Only if the number of Y values is larger, then you might want to increase the table size.

Question 8

Nice. I had a vague idea of this but wasn't sure if it's 100% accurate. So a coarse-to-fine 2-pass approach. Thanks. Will give it a try.

Question 9

One more question: you cited 10000 because we assume 32bit floats, correct? For double-precision floats we'd have a bigger table?

gnasher729 gnasher729 32.4k36 silver badges56 bronze badges · Accepted Answer · 2020-07-15 15:54:52Z

Take X, multiplied by say 10,000, rounded down to the nearest integer. In plain C, z = (int) (x times 10000.0).

There are 10,000 possible values of z. For most values of z, you can determine the index from z. So create a table with 10,000 entries. In the table, store an index i if you can prove that x should be mapped to i, knowing z, and store -1 if you can’t prove this.

As a result, you get the correct value probably in 9,980 of 10,000 values, and then you use whatever slow algorithm you have for the remaining 1 in 500 values.

PS. The same table size would be used for double precision numbers. Whatever the table size, there will be only few values X that cannot be mapped correctly using this method, maybe 10 or 20. If you take a table of size 10,000 then 99.8% or 99.9% are mapped correctly, and 0.1% or 0.2% need a slow algorithm. The exact same thing happens with double. You could use 1000 entries, then the 10 or 20 failing ones would be 1% or 2%.

And a nice thing is that this method will work however the Y values are distributed. Only if the number of Y values is larger, then you might want to increase the table size.

Nice. I had a vague idea of this but wasn't sure if it's 100% accurate. So a coarse-to-fine 2-pass approach. Thanks. Will give it a try.
One more question: you cited 10000 because we assume 32bit floats, correct? For double-precision floats we'd have a bigger table?

Stack Exchange Network

Fast hashing algorithm for mapping/quantizing a collection of floats onto a small set of sorted floats

Attempts

Question

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Fast hashing algorithm for mapping/quantizing a collection of floats onto a small set of sorted floats

Attempts

Question

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions