Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Support dataframe interchange protocol #4244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
alexcjohnson merged 17 commits into plotly:master from anmyachev:interchange-protocol
Jun 30, 2023

Conversation

@anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Jun 12, 2023

This pull request continues the work that began in #3387.

alexander-beedie, MarcoGorelli, and lostmygithubaccount reacted with hooray emoji MarcoGorelli, legendof-selda, and lostmygithubaccount reacted with rocket emoji
maartenbreddels and others added 5 commits June 12, 2023 18:10
This allows plotly express to take in any dataframe that supports
the dataframe protocol, see:
https://data-apis.org/blog/dataframe_protocol_rfc/
https://data-apis.org/dataframe-protocol/latest/index.html
Test includes an example with vaex, which should work with
vaexio/vaex#1509
(not yet released)
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
assert_frame_equal(tips.reset_index()[out["data_frame"].columns], out["data_frame"])


def test_build_df_using_interchange_protocol_mock(add_interchange_module_for_old_pandas):
Copy link
Contributor Author

@anmyachev anmyachev Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a mock test, "plotly" do not need to have a test for every library that supports this protocol.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Copy link
Contributor

This looks great, thank you! So just so I'm clear on the user-facing impact here: this should make it possible for PX to accept Vaex and Modin dataframes (which don't have .to_pandas()) and will use this new (better? faster?) pathway for other non-Pandas systems that happen to also have a .to_pandas() like... CuDF and Polars? So hard to keep track of all the flavours 🍨

Copy link
Contributor Author

This looks great, thank you! So just so I'm clear on the user-facing impact here: this should make it possible for PX to accept Vaex and Modin dataframes (which don't have .to_pandas()) and will use this new (better? faster?) pathway for other non-Pandas systems that happen to also have a .to_pandas() like... CuDF and Polars? So hard to keep track of all the flavours 🍨

You're right. The new path will be used even if the library has to_pandas method. I believe this is the preferred path, as this protocol aims to become a widely accepted standard, which should be accompanied by good support. From a performance point of view, it seems most likely that this way could be more performant (for example, in the protocol itself there is the following mention: Must be zero-copy wherever possible), at least as good as a simple to_pandas call.

Copy link
Contributor Author

@nicolaskruchten CI is green, may I know what do you think about merging it?

Copy link
Contributor

I'll endorse this PR but I'll defer to @alexcjohnson and @LiamConnors for timing on when to merge it, how to document it etc :)

anmyachev reacted with thumbs up emoji

Copy link
Contributor

@nicolaskruchten nicolaskruchten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💃 This seems good to me in principle although I haven't tested it manually and I don't have a strong sense of what the list is of which non-Pandas dataframes will go through the new codepath and which will be zero-copy, so it's not super-clear to me how to document this.

This PR also needs a changelog entry (which should probably contain some of the answers to the questions above)

anmyachev reacted with thumbs up emoji
Copy link
Contributor Author

Hi @alexcjohnson! May I have your opinion on this pull request?

Copy link
Contributor

MarcoGorelli commented Jun 20, 2023
edited
Loading

I haven't looked enough into what you do with the DataFrame objects, but I just wanted to note that if df is a polars DataFrame, then:

  • pd.api.interchange.from_dataframe(df) will probably make some copy (currently, the result doesn't use pyarrow)
  • df.to_pandas() should typically be zero-copy

So it may be more user-friendly to first try to_pandas, and then fallback to the interchange protocl? (cc @ritchie46 in case I got this wrong)

Copy link

df.to_pandas() should typically be zero-copy

df.to_pandas(use_pyarrow_extension_array=True) is zero copy yes.

Copy link
Contributor

MarcoGorelli commented Jun 22, 2023
edited
Loading

Right, and that's not the default (I thought it was - sorry, I should've checked)

OK, then using the interchange protocol wouldn't be a regression for polars

Maybe a separate PR could be made so that if it is a polars dataframe, then to_pandas with use_pyarrow_extension_array=True should be passed

But that's a separate discussion, sorry everyone for the noise

+1 for proceeding with this!

Copy link
Collaborator

This looks great, thanks @anmyachev, and thanks @nicolaskruchten for reviewing. I'd love to see a test or two that actually directly use polars and vaex dataframes in px. Compliance with a standard protocol is a great way to achieve this, and in principle if these other packages don't support this it's their problem to solve not ours, but (a) fundamentally what our users care about is that they can use their preferred dataframe manager, and (b) having tests will allow us over time to deepen this support to improve performance.

That and a changelog entry 😎

anmyachev reacted with thumbs up emoji ritchie46 reacted with rocket emoji

anmyachev added 2 commits June 26, 2023 15:20
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
CHANGELOG.md Outdated
- Add `legend.xref` and `legend.yref` to enable container-referenced positioning of legends [[#6589](https://github.com/plotly/plotly.js/pull/6589)], with thanks to [Gamma Technologies](https://www.gtisoft.com/) for sponsoring the related development.
- Add `colorbar.xref` and `colorbar.yref` to enable container-referenced positioning of colorbars [[#6593](https://github.com/plotly/plotly.js/pull/6593)], with thanks to [Gamma Technologies](https://www.gtisoft.com/) for sponsoring the related development.
- `px` methods now accept data-frame-like objects that support a `to_pandas()` method, such as polars, cudf, vaex etc
- `px` methods now accept data-frame-like objects that support a [dataframe interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html), such as polars, vaex, modin etc
Copy link
Contributor Author

@anmyachev anmyachev Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if cudf supports this protocol

anmyachev added 4 commits June 26, 2023 15:31
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
requests==2.25.1
tenacity==6.2.0
pandas==2.0.1
pandas==2.0.2
Copy link
Contributor Author

@anmyachev anmyachev Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, the protocol cannot be used in tests (for example, vaex test).

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
...from_vaex' test
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Copy link
Contributor Author

This looks great, thanks @anmyachev, and thanks @nicolaskruchten for reviewing. I'd love to see a test or two that actually directly use polars and vaex dataframes in px. Compliance with a standard protocol is a great way to achieve this, and in principle if these other packages don't support this it's their problem to solve not ours, but (a) fundamentally what our users care about is that they can use their preferred dataframe manager, and (b) having tests will allow us over time to deepen this support to improve performance.

That and a changelog entry 😎

@alexcjohnson thanks for the answer! I made the changes you mentioned. Ready for review.

Copy link

Could there be a polars test for this as well? It would be great for us to know that plotly always stays working on both sides. :)

Copy link
Contributor Author

anmyachev commented Jun 26, 2023
edited
Loading

Could there be a polars test for this as well? It would be great for us to know that plotly always stays working on both sides. :)

@ritchie46 Sure, I can add. Does polars have a function like from_pandas in vaex? (for example, vaex.from_pandas(iris_pandas) call in the test)

Copy link

Yes, there is pl.from_pandas. 👍

anmyachev reacted with thumbs up emoji

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Copy link
Contributor Author

@alexcjohnson friendly ping :)

Copy link
Member

I tried this out by creating a new venv and installing pandas and polars. When I run an app I get the following traceback asking for pyarrow>=11.0.0

Traceback (most recent call last):
 File "app.py", line 19, in <module>
 fig = px.scatter(df, x="date", y="float")
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
 return make_figure(args=locals(), constructor=go.Scatter)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2012, in make_figure
 args = build_dataframe(args, constructor)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1317, in build_dataframe
 df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
 return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
 raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.

Copy link
Contributor

Hmm we definitely don't want a hard dependency on pyarrow ... can we check to see if it's present before using this branch if it's routinely-required?

Copy link
Contributor

pyarrow isn't strictly required for interchanging to pandas, I think it's just that polars implements the interchange protocol by going via pyarrow's implementation of it

why not swap the branches, and try to_pandas first?

Copy link
Contributor Author

@LiamConnors thanks for the feedback.

Hmm we definitely don't want a hard dependency on pyarrow ... can we check to see if it's present before using this branch if it's routinely-required?

I think right now I can rewrite the logic as follows (so that when the version of plotly is updated, the code does not stop working for users):

try:
 pandas.api.interchange.from_dataframe(df_not_pandas)
except (ImportError, NotImplemetedError):
 df_not_pandas.to_pandas()

why not swap the branches, and try to_pandas first?

@MarcoGorelli I guess we should try the more standardized and basic way first than specialized. In the case when the dataframe library has both options implemented, then the execution of the code will not reach a potentially better way.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Copy link
Contributor Author

@nicolaskruchten CI build failed for another reason. Could you manually restart it if possible?

Copy link
Collaborator

@LiamConnors thanks for testing this out - would you try again, both without pyarrow and then with it installed? AFAICT the logic is solid at this point, so if you're happy with how it works I'll be happy to approve and merge. Thanks for the iterations @anmyachev 🙇

scikit-image==0.18.1
psutil==5.7.0
kaleido
vaex
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you not add modin here?

Copy link
Contributor Author

@anmyachev anmyachev Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_build_df_using_interchange_protocol_mock test should be enough for modin. On the other hand, I do not want to expand the list of dependencies, there are already enough of them.

MarcoGorelli reacted with thumbs up emoji
Copy link
Member

I guess this was already there, because I now get this:

Traceback (most recent call last):
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1318, in build_dataframe
 df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
 return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
 raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
 File "app.py", line 19, in <module>
 fig = px.scatter(df, x="date", y="float")
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
 return make_figure(args=locals(), constructor=go.Scatter)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2022, in make_figure
 args = build_dataframe(args, constructor)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
 df_pandas = df_not_pandas.to_pandas()
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
 record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'

Copy link
Contributor Author

I guess this was already there, because I now get this:

Traceback (most recent call last):
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1318, in build_dataframe
 df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
 return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
 raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
 File "app.py", line 19, in <module>
 fig = px.scatter(df, x="date", y="float")
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
 return make_figure(args=locals(), constructor=go.Scatter)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2022, in make_figure
 args = build_dataframe(args, constructor)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
 df_pandas = df_not_pandas.to_pandas()
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
 record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'

Do I understand correctly that this should not stop the merge of this pull request?

MarcoGorelli reacted with thumbs up emoji

df_not_pandas = args["data_frame"]
try:
df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
except (ImportError, NotImplementedError) as exc:
Copy link
Contributor

@MarcoGorelli MarcoGorelli Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll want ModuleNotFoundError, else you'll get the reported error if pyarrow isn't installed

Copy link
Contributor Author

@anmyachev anmyachev Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModuleNotFoundError comes from to_pandas call. The same behavior should already be on the master branch.

Stack from error above:

 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
 df_pandas = df_not_pandas.to_pandas()
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
 record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'

Copy link
Contributor

@MarcoGorelli MarcoGorelli Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you're right - it doesn't make any difference then, thanks

anmyachev reacted with thumbs up emoji
Copy link
Collaborator

@alexcjohnson alexcjohnson Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're OK:

>>> issubclass(ModuleNotFoundError, ImportError)
True

MarcoGorelli reacted with hooray emoji
@LiamConnors LiamConnors self-requested a review June 30, 2023 18:41
Copy link
Member

@LiamConnors LiamConnors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with pyarrow installed and didn't encounter any other issues.

MarcoGorelli, alexcjohnson, and anmyachev reacted with hooray emoji
Copy link
Collaborator

@alexcjohnson alexcjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this should not stop the merge of this pull request?

Yep, at that point we're out of fallbacks, all we can do is allow that error to propagate, so the user is alerted to install pyarrow if they want to use this feature.

💃 Merging, we'll likely release this with the next plotly.js, which I'm guessing will be in a week or so. Thanks again @anmyachev, and everyone else for your comments.

anmyachev reacted with thumbs up emoji
@alexcjohnson alexcjohnson merged commit 6a9bbea into plotly:master Jun 30, 2023
Copy link
Contributor Author

Thanks for the review!

Reviewers

@alexcjohnson alexcjohnson alexcjohnson approved these changes

@LiamConnors LiamConnors LiamConnors approved these changes

+3 more reviewers

@YarShev YarShev YarShev left review comments

@MarcoGorelli MarcoGorelli MarcoGorelli left review comments

@nicolaskruchten nicolaskruchten nicolaskruchten approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /