-
-
Couldn't load subscription status.
- Fork 2.7k
Support dataframe interchange protocol #4244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This allows plotly express to take in any dataframe that supports the dataframe protocol, see: https://data-apis.org/blog/dataframe_protocol_rfc/ https://data-apis.org/dataframe-protocol/latest/index.html Test includes an example with vaex, which should work with vaexio/vaex#1509 (not yet released)
...erchange-protocol
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
a5f792b to
9b2154e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a mock test, "plotly" do not need to have a test for every library that supports this protocol.
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
This looks great, thank you! So just so I'm clear on the user-facing impact here: this should make it possible for PX to accept Vaex and Modin dataframes (which don't have .to_pandas()) and will use this new (better? faster?) pathway for other non-Pandas systems that happen to also have a .to_pandas() like... CuDF and Polars? So hard to keep track of all the flavours 🍨
This looks great, thank you! So just so I'm clear on the user-facing impact here: this should make it possible for PX to accept Vaex and Modin dataframes (which don't have
.to_pandas()) and will use this new (better? faster?) pathway for other non-Pandas systems that happen to also have a.to_pandas()like... CuDF and Polars? So hard to keep track of all the flavours 🍨
You're right. The new path will be used even if the library has to_pandas method. I believe this is the preferred path, as this protocol aims to become a widely accepted standard, which should be accompanied by good support. From a performance point of view, it seems most likely that this way could be more performant (for example, in the protocol itself there is the following mention: Must be zero-copy wherever possible), at least as good as a simple to_pandas call.
@nicolaskruchten CI is green, may I know what do you think about merging it?
I'll endorse this PR but I'll defer to @alexcjohnson and @LiamConnors for timing on when to merge it, how to document it etc :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💃 This seems good to me in principle although I haven't tested it manually and I don't have a strong sense of what the list is of which non-Pandas dataframes will go through the new codepath and which will be zero-copy, so it's not super-clear to me how to document this.
This PR also needs a changelog entry (which should probably contain some of the answers to the questions above)
Hi @alexcjohnson! May I have your opinion on this pull request?
I haven't looked enough into what you do with the DataFrame objects, but I just wanted to note that if df is a polars DataFrame, then:
pd.api.interchange.from_dataframe(df)will probably make some copy (currently, the result doesn't use pyarrow)df.to_pandas()should typically be zero-copy
So it may be more user-friendly to first try to_pandas, and then fallback to the interchange protocl? (cc @ritchie46 in case I got this wrong)
ritchie46
commented
Jun 20, 2023
df.to_pandas() should typically be zero-copy
df.to_pandas(use_pyarrow_extension_array=True) is zero copy yes.
Right, and that's not the default (I thought it was - sorry, I should've checked)
OK, then using the interchange protocol wouldn't be a regression for polars
Maybe a separate PR could be made so that if it is a polars dataframe, then to_pandas with use_pyarrow_extension_array=True should be passed
But that's a separate discussion, sorry everyone for the noise
+1 for proceeding with this!
This looks great, thanks @anmyachev, and thanks @nicolaskruchten for reviewing. I'd love to see a test or two that actually directly use polars and vaex dataframes in px. Compliance with a standard protocol is a great way to achieve this, and in principle if these other packages don't support this it's their problem to solve not ours, but (a) fundamentally what our users care about is that they can use their preferred dataframe manager, and (b) having tests will allow us over time to deepen this support to improve performance.
That and a changelog entry 😎
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
CHANGELOG.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if cudf supports this protocol
...erchange-protocol
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change, the protocol cannot be used in tests (for example, vaex test).
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
...from_vaex' test Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
This looks great, thanks @anmyachev, and thanks @nicolaskruchten for reviewing. I'd love to see a test or two that actually directly use polars and vaex dataframes in px. Compliance with a standard protocol is a great way to achieve this, and in principle if these other packages don't support this it's their problem to solve not ours, but (a) fundamentally what our users care about is that they can use their preferred dataframe manager, and (b) having tests will allow us over time to deepen this support to improve performance.
That and a changelog entry 😎
@alexcjohnson thanks for the answer! I made the changes you mentioned. Ready for review.
ritchie46
commented
Jun 26, 2023
Could there be a polars test for this as well? It would be great for us to know that plotly always stays working on both sides. :)
Could there be a polars test for this as well? It would be great for us to know that plotly always stays working on both sides. :)
@ritchie46 Sure, I can add. Does polars have a function like from_pandas in vaex? (for example, vaex.from_pandas(iris_pandas) call in the test)
ritchie46
commented
Jun 26, 2023
Yes, there is pl.from_pandas. 👍
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@alexcjohnson friendly ping :)
I tried this out by creating a new venv and installing pandas and polars. When I run an app I get the following traceback asking for pyarrow>=11.0.0
Traceback (most recent call last):
File "app.py", line 19, in <module>
fig = px.scatter(df, x="date", y="float")
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
return make_figure(args=locals(), constructor=go.Scatter)
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2012, in make_figure
args = build_dataframe(args, constructor)
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1317, in build_dataframe
df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.
Hmm we definitely don't want a hard dependency on pyarrow ... can we check to see if it's present before using this branch if it's routinely-required?
pyarrow isn't strictly required for interchanging to pandas, I think it's just that polars implements the interchange protocol by going via pyarrow's implementation of it
why not swap the branches, and try to_pandas first?
@LiamConnors thanks for the feedback.
Hmm we definitely don't want a hard dependency on pyarrow ... can we check to see if it's present before using this branch if it's routinely-required?
I think right now I can rewrite the logic as follows (so that when the version of plotly is updated, the code does not stop working for users):
try: pandas.api.interchange.from_dataframe(df_not_pandas) except (ImportError, NotImplemetedError): df_not_pandas.to_pandas()
why not swap the branches, and try to_pandas first?
@MarcoGorelli I guess we should try the more standardized and basic way first than specialized. In the case when the dataframe library has both options implemented, then the execution of the code will not reach a potentially better way.
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@nicolaskruchten CI build failed for another reason. Could you manually restart it if possible?
Co-authored-by: Alex Johnson <alex@plot.ly>
@LiamConnors thanks for testing this out - would you try again, both without pyarrow and then with it installed? AFAICT the logic is solid at this point, so if you're happy with how it works I'll be happy to approve and merge. Thanks for the iterations @anmyachev 🙇
@YarShev
YarShev
Jun 30, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you not add modin here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_build_df_using_interchange_protocol_mock test should be enough for modin. On the other hand, I do not want to expand the list of dependencies, there are already enough of them.
I guess this was already there, because I now get this:
Traceback (most recent call last):
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1318, in build_dataframe
df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "app.py", line 19, in <module>
fig = px.scatter(df, x="date", y="float")
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
return make_figure(args=locals(), constructor=go.Scatter)
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2022, in make_figure
args = build_dataframe(args, constructor)
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
df_pandas = df_not_pandas.to_pandas()
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'
I guess this was already there, because I now get this:
Traceback (most recent call last): File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1318, in build_dataframe df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas) File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe return _from_dataframe(df.__dataframe__(allow_copy=allow_copy)) File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__ raise ImportError( ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "app.py", line 19, in <module> fig = px.scatter(df, x="date", y="float") File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter return make_figure(args=locals(), constructor=go.Scatter) File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2022, in make_figure args = build_dataframe(args, constructor) File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe df_pandas = df_not_pandas.to_pandas() File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas record_batches = self._df.to_pandas() ModuleNotFoundError: No module named 'pyarrow'
Do I understand correctly that this should not stop the merge of this pull request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll want ModuleNotFoundError, else you'll get the reported error if pyarrow isn't installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ModuleNotFoundError comes from to_pandas call. The same behavior should already be on the master branch.
Stack from error above:
File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe df_pandas = df_not_pandas.to_pandas() File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas record_batches = self._df.to_pandas() ModuleNotFoundError: No module named 'pyarrow'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah you're right - it doesn't make any difference then, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're OK:
>>> issubclass(ModuleNotFoundError, ImportError) True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with pyarrow installed and didn't encounter any other issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that this should not stop the merge of this pull request?
Yep, at that point we're out of fallbacks, all we can do is allow that error to propagate, so the user is alerted to install pyarrow if they want to use this feature.
💃 Merging, we'll likely release this with the next plotly.js, which I'm guessing will be in a week or so. Thanks again @anmyachev, and everyone else for your comments.
Thanks for the review!
This pull request continues the work that began in #3387.