-
Notifications
You must be signed in to change notification settings - Fork 32
Bulk quantiles #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk quantiles #26
Conversation
I started reviewing this PR. This is great work. I created LukeMathWalker#1 with a few changes.
I had a couple of thoughts:
-
I think we should be able to improve on the computational complexity of
get_many_from_sorted_mut_unchecked(which AFAICT currently has an average case complexity ofO(q*n)whereqis the number of quantiles andnis the size of the array).The primary observation is that the current implementation throws away information from within
get_from_sorted_mut. In other words, when we compute the first quantile withget_from_sorted_mut, we're callingpartition_mutat randomly generatedpivot_indexes and shrinking the remaining size each time. When we go to compute the next quantile, the only information we're using from the first quantile's computation is its index; we're discarding all of the previouspartition_mutcalls.We should be able to avoid discarding this information with the following algorithm:
-
Pick a random
pivot_indexand partition the array withlet partition_index = self.partition_mut(pivot_index). -
Partition the search indices by
partition_index.Now, we've split the array and search indices into two pieces: (1) the portion of the array before
partition_indexand the search indices less thanpartition_indexand (2) the portion of the array afterpartition_indexand the search indices abovepartition_index.Any indices that are equal to
partition_indexare finished and should be removed from further recursion. -
For each piece, we have a recursive call (go to (i) for each piece).
(Continue until we run out of indices for which we haven't found the value.)
I think this algorithm has an average case computational complexity of
O((n + q)*log(q)), which is better thanO(q*n)assuminglog(q) < n. AFAICT, the primary disadvantage is that it would be more complex to implement.Does this make sense? Do you think this would be better?
-
-
The
IndexMapreturn types seem inconvenient to use. IMO,get_many_from_sorted_mutshould returnVec<A>(where the order matchesindexes),quantiles_mutshould returnOption<Vec<A>>(where the order matchesqs), andquantiles_axis_mutshould returnOption<Array<A, D>>(where the order along the axis matchesqs).
Edit: It would be good to rebase this branch off the latest master to resolve merge conflicts and incorporate #28.
I have merged your PR - all useful additions/style changes!
I can't confirm the average complexity you estimated for your alternative implementation, but it intuitively looks faster and log(q) < n is true for all relevant cases I'd say. I'll give it a go in a separate branch and then we can run some benchmarks 👍
Re:IndexMap - I think it's a matter of preference, I usually find it error-prone to match input/output indexes, the way NumPy forces you to work sometimes, an IndexMap looks more ergonomic to me . The solid pro I can see in returning an Option<Array<A, D>> is that you can do cross-quantile computation more easily, that is significant.
...e for bulk computation
This has a few advantages: * It's now possible to save the interpolation strategy in a variable and easily re-use it. * We can now freely add more type parameters to the `quantile*` methods as needed without making them more difficult to call. * We now have the flexibility to add more advanced interpolation strategies in the future (e.g. one that wraps a closure). * Calling the `quantile*` methods is now slightly more compact because a turbofish isn't necessary.
This is slightly more versatile because `ArrayBase` allows arbitrary strides.
I finished reviewing this PR and added some more changes to LukeMathWalker#5. The primary additional changes are:
-
The bulk quantiles methods now return an
Arrayinstead of anIndexMap.Option<Array<A, D>>is easier to work with thanOption<IndexMap<N64, Array<A, D::Smaller>>>. It also needs only one heap allocation and should have better performance for most operations. -
I've added the interpolation strategy as an explicit parameter to the quantile methods. See the commit message for a list of the advantages.
-
I've changed
get_many_from_sorted_mutand the bulk quantiles methods to take the indices as an array instead of a slice. This is a little more versatile because arrays can have arbitrary strides. It also seems more consistent with the rest of the API.
With change (1), we can now change qs back to f64 instead of N64 if desired. I think this would probably be a good idea just because most things work with f64 instead of N64.
What do you think?
Improve bulk quantiles
I think it makes sense to take interpolate as a parameter - in a recent conversation with @xd009642 it turned out that it would be nice to expose EquiSpaced as a strategy, but given that it requires some array-independent parameters (e.g. number of bins), it was troublesome to get it to work with the existing arrangement. Nothing to say on the other two changes.
I'd keep N64 - I think that we should leverage the expressiveness of the type system to communicate constraints, if it doesn't add complexity or hinders readability of our API/the code using it.
I have merged master and aligned return types (Result instead of Option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few comments. Everything else looks good.
Co-Authored-By: LukeMathWalker <LukeMathWalker@users.noreply.github.com>
I have used InvalidQuantile in the end - let me know what you think @jturner314.
Yay! 🎉 Great job on these PRs @LukeMathWalker!
Thanks for your help @jturner314 - you always manage to make them much better 🙏
Uh oh!
There was an error while loading. Please reload this page.
Using the ordering guarantees we have on the output of
quantile_mut/sorted_get_mut, it provides a method optimized to compute multiple quantiles at once (by scanning an increasingly smaller subset of the original array, thanks to the computation of the previous quantile).Breaking changes:
f64toN64- floats are not hashable, hence they cannot be used as keys inIndexMapand the function panics anyway if that argument isNaN. This change propagates to theInterpolatetypes;sorted_get_muttoget_from_sorted. It plays better with the bulk version,get_many_from_sorted, and I think it's clearer;quantile_axis_mutandquantile_axis_skipnan_mutnow return anOptioninstead of panicking if the axis length is 0.