With my application, I have
- Collection X: thousands of floating-point numbers with the value range [0, 1], not sorted.
- Collection Y: 11 floats ranging from [0, 1], sorted.
- The size of X is known. Let it be L.
The goal is to quantize X and map it onto Y, so that we get a hash array of indices of Y for X. Eventually Y will be then quantized onto 10 discrete things pointed to it.
An example to make it a bit clearer,
Y = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
X = [0.678, 0.124, ..., 1.0, ., 0.013, 0.475]
I want the algorithm output to be 0-based indices like these:
H(Y[0]) = H(0.678) = 6
H(Y[1]) = H(0.124) = 1
H(Y[n-2]) = H(0.013) = 0
H(Y[n-1]) = H(0.475) = 4
Attempts
Naively, I've tried linear and binary searching for positioning each element of X in Y so that the element is found between an adjacent pair of elements in Y.
However, the performance is not good enough for my application. This quantization happens in a realtime thread that slow computation is undesirable.
Question
What is the best way of this kind of hashing/quantization? X is not sorted.
Thanks!
1 Answer 1
Take X, multiplied by say 10,000, rounded down to the nearest integer. In plain C, z = (int) (x times 10000.0).
There are 10,000 possible values of z. For most values of z, you can determine the index from z. So create a table with 10,000 entries. In the table, store an index i if you can prove that x should be mapped to i, knowing z, and store -1 if you can’t prove this.
As a result, you get the correct value probably in 9,980 of 10,000 values, and then you use whatever slow algorithm you have for the remaining 1 in 500 values.
PS. The same table size would be used for double precision numbers. Whatever the table size, there will be only few values X that cannot be mapped correctly using this method, maybe 10 or 20. If you take a table of size 10,000 then 99.8% or 99.9% are mapped correctly, and 0.1% or 0.2% need a slow algorithm. The exact same thing happens with double. You could use 1000 entries, then the 10 or 20 failing ones would be 1% or 2%.
And a nice thing is that this method will work however the Y values are distributed. Only if the number of Y values is larger, then you might want to increase the table size.
-
$\begingroup$ Nice. I had a vague idea of this but wasn't sure if it's 100% accurate. So a coarse-to-fine 2-pass approach. Thanks. Will give it a try. $\endgroup$kakyo– kakyo2020年07月15日 16:11:23 +00:00Commented Jul 15, 2020 at 16:11
-
$\begingroup$ One more question: you cited 10000 because we assume 32bit floats, correct? For double-precision floats we'd have a bigger table? $\endgroup$kakyo– kakyo2020年07月15日 16:13:27 +00:00Commented Jul 15, 2020 at 16:13
11
values: If they are equidistant like the example (or follow a similarly simple pattern), state so in the question. For any given value $x,ドル is it a requirement that the index $H(y)$ of a lower value $y$ is no higher than $H(x)$? Is it necessary that for a uniform distribution of $X$ values the expected standard deviation of $H(X)$ is low? Is it mandatory that $H(y)$ on one computer/language system is the same as $H(y)$ on a different one? Beware idiosyncrasies of floating point. $\endgroup$the performance is not good enough for my application
? (In a different context, you would be urged to add a sketch of the overall task to accomplish, too.) $\endgroup$