-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
⚡ avoid expensive & unnecessary groupby in px #3761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for this PR, it looks like a promising optimization!
I certainly like the idea of skipping the grouping when it's all one_group... When plotting, say, a 5000-row dataframe with 20 colors and 5 symbols or something, what is the performance hit here of the nunique call?
In the latest commit I did some minor refactoring (renaming variables) + improved the efficiency of the groupby check that I added (replaced .nunique with the suggestion from here https://stackoverflow.com/a/54405767)
I checked the performance for the following plot; (is this what you intended?)
import pandas as pd import numpy as np import plotly.express as px df = pd.DataFrame(np.random.randn(5_000, 1), columns=["data"]) df["c"] = np.tile(np.arange(20), 5_000 // 20) # 20 colors df["s"] = np.tile(np.arange(5), 5_000 // 5) # 5 symbols fig = px.scatter(df, y="data", color="c", symbol="s")
The checks that I added for grouping (i.e., creating the all_same_group variable) take 0.1-0.2% of the total time of make_figure.
%lprun -f px._core.make_figure px.scatter(df, y="data", color="c", symbol="s")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the optimization where we don't group if the groupers are all one_group but I'm worried about the duplication of logic here... If we just do the optimization when no grouping is requested at all (i.e. leave out the case when we do have e.g. symbol but there's only one value) then this PR can be less invasive/have less duplication. I've done an implementation of that over here #3765 if you want to take a look at it.
I think that most of the value of this PR comes from the "all one_group" case, and that leaving out the case where there happens to only be one group would be OK for now as this is likely not all that common, but I'm not dead set on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the optimization where we don't group if the groupers are all one_group but I'm worried about the duplication of logic here... If we just do the optimization when no grouping is requested at all (i.e. leave out the case when we do have e.g. symbol but there's only one value) then this PR can be less invasive/have less duplication. I've done an implementation of that over here #3765 if you want to take a look at it.
I think that most of the value of this PR comes from the "all one_group" case, and that leaving out the case where there happens to only be one group would be OK for now as this is likely not all that common, but I'm not dead set on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In digging into this a bit more, I see/remember that we actually call .uniques() on all non-one_group groups in the awkward get_orderings() function anyway, so if we did want to do this other optimization, maybe (re-)inlining the code from get_orderings() into make_figure() and interleaving this check would make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean like your implementation in #3765?
Superseded by #3765
Uh oh!
There was an error while loading. Please reload this page.
This PR enables significant speed-ups for
plotly.expresswhen no groupby operation is required. This PR stems from the observations I made in #1743; current behavior always performs at least one groupby, even when this is not necessary (e.g.,px.scatterof a 1D array).Performing such a groupby is a rather expensive operation + the
.get_groupand theget_ordeningsoperations are also rather time-consuming. This PR checks via theall_samevariable whether grouping is required. If no grouping is necessary, no groupby will be performed, resulting in significant performance gains.Illustration of the speed-up
plotly 5.8.0
image
this PR
image
Can you validate that this speed-up is correct?
(FYI: all pytest tests succeed locally)
Code PR checklist
plotly.graph_objects, my modifications concern thecodegenfiles and not generated files.