Bulk quantiles #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

LukeMathWalker merged 84 commits into rust-ndarray:master from LukeMathWalker:bulk-quantiles

Apr 6, 2019

Merged

Bulk quantiles #26

LukeMathWalker merged 84 commits into rust-ndarray:master from LukeMathWalker:bulk-quantiles

Apr 6, 2019

Conversation

@LukeMathWalker

Copy link

Member

@LukeMathWalker LukeMathWalker commented Feb 11, 2019 •

edited

Loading

Using the ordering guarantees we have on the output of quantile_mut/sorted_get_mut, it provides a method optimized to compute multiple quantiles at once (by scanning an increasingly smaller subset of the original array, thanks to the computation of the previous quantile).

Breaking changes:

I have changed the quantile parameter from f64 to N64 - floats are not hashable, hence they cannot be used as keys in IndexMap and the function panics anyway if that argument is NaN. This change propagates to the Interpolate types;
I have renamed sorted_get_mut to get_from_sorted. It plays better with the bulk version, get_many_from_sorted, and I think it's clearer;
quantile_axis_mut and quantile_axis_skipnan_mut now return an Option instead of panicking if the axis length is 0.

@LukeMathWalker LukeMathWalker added the Breaking changes label

Feb 11, 2019

@LukeMathWalker LukeMathWalker mentioned this pull request

Feb 11, 2019

Roadmap #1

Open

17 tasks

@jturner314

Copy link

Member

jturner314 commented Mar 10, 2019 •

edited

Loading

I started reviewing this PR. This is great work. I created LukeMathWalker#1 with a few changes.

I had a couple of thoughts:

I think we should be able to improve on the computational complexity of get_many_from_sorted_mut_unchecked (which AFAICT currently has an average case complexity of O(q*n) where q is the number of quantiles and n is the size of the array).

The primary observation is that the current implementation throws away information from within get_from_sorted_mut. In other words, when we compute the first quantile with get_from_sorted_mut, we're calling partition_mut at randomly generated pivot_indexes and shrinking the remaining size each time. When we go to compute the next quantile, the only information we're using from the first quantile's computation is its index; we're discarding all of the previous partition_mut calls.

We should be able to avoid discarding this information with the following algorithm:
1. Pick a random pivot_index and partition the array with let partition_index = self.partition_mut(pivot_index).
2. Partition the search indices by partition_index.
  
  Now, we've split the array and search indices into two pieces: (1) the portion of the array before partition_index and the search indices less than partition_index and (2) the portion of the array after partition_index and the search indices above partition_index.
  
  Any indices that are equal to partition_index are finished and should be removed from further recursion.
3. For each piece, we have a recursive call (go to (i) for each piece).
(Continue until we run out of indices for which we haven't found the value.)

I think this algorithm has an average case computational complexity of O((n + q)*log(q)), which is better than O(q*n) assuming log(q) < n. AFAICT, the primary disadvantage is that it would be more complex to implement.

Does this make sense? Do you think this would be better?
The IndexMap return types seem inconvenient to use. IMO, get_many_from_sorted_mut should return Vec<A> (where the order matches indexes), quantiles_mut should return Option<Vec<A>> (where the order matches qs), and quantiles_axis_mut should return Option<Array<A, D>> (where the order along the axis matches qs).

Edit: It would be good to rebase this branch off the latest master to resolve merge conflicts and incorporate #28.

@LukeMathWalker

Copy link

Member Author

LukeMathWalker commented Mar 10, 2019

I have merged your PR - all useful additions/style changes!

I can't confirm the average complexity you estimated for your alternative implementation, but it intuitively looks faster and log(q) < n is true for all relevant cases I'd say. I'll give it a go in a separate branch and then we can run some benchmarks 👍

Re:IndexMap - I think it's a matter of preference, I usually find it error-prone to match input/output indexes, the way NumPy forces you to work sometimes, an IndexMap looks more ergonomic to me . The solid pro I can see in returning an Option<Array<A, D>> is that you can do cross-quantile computation more easily, that is significant.

LukeMathWalker added 26 commits

March 16, 2019 18:50

@LukeMathWalker


 Promoted module to directory

06b82be

@LukeMathWalker


 Moved interpolate to separate file

47c1696

@LukeMathWalker


 Re-implemented quantile_axis_mut to get closer to something we can us...

8f1e7cd

...e for bulk computation

@LukeMathWalker


 Use a set instead of a vec to avoid repeating computations

c81f6be

@LukeMathWalker


 Use bulk method for single quantile

7aee452

@LukeMathWalker


 Implement bulk method to get sorted

745e45b

@LukeMathWalker


 Refactored quantiles_axis_mut to use sorted_get_many_mut

74eda81

@LukeMathWalker


 Avoid recomputing index value

93531de

@LukeMathWalker


 Add quantiles_mut to 1d trait

c00620d

@LukeMathWalker


 Return hashmaps from bulk methods

a7111e9

@LukeMathWalker


 Fixed tests

36284d2

@LukeMathWalker


 Use IndexSet to preserve insertion order

fc56ca4

@LukeMathWalker


 Fix indentation

67a4477

@LukeMathWalker


 IndexMap provides a more intuitive behaviour

ac0ca03

@LukeMathWalker


 Remove prints

a4c1508

@LukeMathWalker


 Renamed methods

aa3a157

@LukeMathWalker


 Docs for get_many_from_sorted_mut

2ea9233

@LukeMathWalker


 Added docs for private free function

12a7944

@LukeMathWalker


 Docs for quantiles_mut

ac93a1e

@LukeMathWalker


 Fixed several typos in docs

c408c67

@LukeMathWalker


 More robust test

c471955

@LukeMathWalker


 Added test for quantiles

1411f15

@LukeMathWalker


 Test quantiles_axis_mut

48f2bf0

@LukeMathWalker


 Add comments

c27feb1

@LukeMathWalker


 Return options when the lane we are computing against is empty

00e14f7

@LukeMathWalker


 Fixed docs

846c336

jturner314 added 5 commits

April 1, 2019 20:08

@jturner314


 Make quantiles_* return Array instead of IndexMap

e965e85

@jturner314


 Add interpolate parameter to quantile*

cfc408f

This has a few advantages:
* It's now possible to save the interpolation strategy in a variable
 and easily re-use it.
* We can now freely add more type parameters to the `quantile*`
 methods as needed without making them more difficult to call.
* We now have the flexibility to add more advanced interpolation
 strategies in the future (e.g. one that wraps a closure).
* Calling the `quantile*` methods is now slightly more compact because
 a turbofish isn't necessary.

@jturner314


 Make get_many_from_sorted_mut take array of indexes

b5d8a08

This is slightly more versatile because `ArrayBase` allows arbitrary
strides.

@jturner314


 Make quantiles* take array instead of slice

00a21c0

@jturner314


 Remove unnecessary IndexSet

8f9f0b6

@jturner314

Copy link

Member

jturner314 commented Apr 2, 2019

I finished reviewing this PR and added some more changes to LukeMathWalker#5. The primary additional changes are:

The bulk quantiles methods now return an Array instead of an IndexMap. Option<Array<A, D>> is easier to work with than Option<IndexMap<N64, Array<A, D::Smaller>>>. It also needs only one heap allocation and should have better performance for most operations.
I've added the interpolation strategy as an explicit parameter to the quantile methods. See the commit message for a list of the advantages.
I've changed get_many_from_sorted_mut and the bulk quantiles methods to take the indices as an array instead of a slice. This is a little more versatile because arrays can have arbitrary strides. It also seems more consistent with the rest of the API.

With change (1), we can now change qs back to f64 instead of N64 if desired. I think this would probably be a good idea just because most things work with f64 instead of N64.

What do you think?

@LukeMathWalker


 Merge pull request #5 from jturner314/bulk-quantiles

a4e8c5d

Improve bulk quantiles

@LukeMathWalker

Copy link

Member Author

LukeMathWalker commented Apr 2, 2019

I think it makes sense to take interpolate as a parameter - in a recent conversation with @xd009642 it turned out that it would be nice to expose EquiSpaced as a strategy, but given that it requires some array-independent parameters (e.g. number of bins), it was troublesome to get it to work with the existing arrangement. Nothing to say on the other two changes.

I'd keep N64 - I think that we should leverage the expressiveness of the type system to communicate constraints, if it doesn't add complexity or hinders readability of our API/the code using it.

LukeMathWalker added 6 commits

April 2, 2019 08:27

@LukeMathWalker


 Merge master

beec7ae

@LukeMathWalker


 Return EmptyInput instead of None

5ff4430

@LukeMathWalker


 Fix tests

ca9f3db

@LukeMathWalker


 Match output type for argmin/max_skipnan

7ca6b7f

@LukeMathWalker


 Fix tests

22cbfbb

@LukeMathWalker

Fmt

56906cf

@LukeMathWalker

Copy link

Member Author

LukeMathWalker commented Apr 2, 2019

I have merged master and aligned return types (Result instead of Option).

jturner314

jturner314 reviewed

Apr 2, 2019

View reviewed changes

Copy link

Member

@jturner314 jturner314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments. Everything else looks good.

src/lib.rs Outdated Show resolved Hide resolved

src/quantile/mod.rs Outdated Show resolved Hide resolved

src/quantile/mod.rs Show resolved Hide resolved

src/quantile/mod.rs Outdated Show resolved Hide resolved

jturner314 and others added 6 commits

April 5, 2019 08:27

@jturner314 @LukeMathWalker


 Update src/lib.rs

950cd44

Co-Authored-By: LukeMathWalker <LukeMathWalker@users.noreply.github.com>

@LukeMathWalker


 Add quantile error

37b3b19

@LukeMathWalker


 Renamed InvalidFraction to InvalidQuantile

1f37d44

@LukeMathWalker


 Return QuantileError

1e9ba18

@LukeMathWalker


 Fix tests

caad47d

@LukeMathWalker


 Fix docs

fab842c

@LukeMathWalker

Copy link

Member Author

LukeMathWalker commented Apr 5, 2019

I have used InvalidQuantile in the end - let me know what you think @jturner314.

@LukeMathWalker

Fmt

a32d9a8

jturner314

jturner314 reviewed

Apr 6, 2019

View reviewed changes

src/quantile/mod.rs Outdated Show resolved Hide resolved

@LukeMathWalker


 Simplify and deduplicate

a315f70

@LukeMathWalker LukeMathWalker merged commit ed3f782 into rust-ndarray:master

Apr 6, 2019

@LukeMathWalker LukeMathWalker deleted the bulk-quantiles branch

April 6, 2019 17:38

@jturner314

Copy link

Member

jturner314 commented Apr 7, 2019

Yay! 🎉 Great job on these PRs @LukeMathWalker!

@LukeMathWalker

Copy link

Member Author

LukeMathWalker commented Apr 9, 2019

Thanks for your help @jturner314 - you always manage to make them much better 🙏

Labels

Breaking changes

2 participants

@LukeMathWalker @jturner314

Bulk quantiles #26

Bulk quantiles #26

Uh oh!

Conversation

@LukeMathWalker LukeMathWalker commented Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jturner314 commented Mar 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeMathWalker commented Mar 10, 2019

Uh oh!

jturner314 commented Apr 2, 2019

Uh oh!

LukeMathWalker commented Apr 2, 2019

Uh oh!

LukeMathWalker commented Apr 2, 2019

Uh oh!

@jturner314 jturner314 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LukeMathWalker commented Apr 5, 2019

Uh oh!

Uh oh!

jturner314 commented Apr 7, 2019

Uh oh!

LukeMathWalker commented Apr 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

@LukeMathWalker LukeMathWalker commented Feb 11, 2019 •

edited

Loading

jturner314 commented Mar 10, 2019 •

edited

Loading