Search code, repositories, users, issues, pull requests...

Copy link

Contributor

@anmyachev anmyachev commented Jun 12, 2023

This pull request continues the work that began in #3387.

maartenbreddels and others added 5 commits

June 12, 2023 18:10

@maartenbreddels @anmyachev


 support dataframe protocol (tested with Vaex)

3c2d1bf

This allows plotly express to take in any dataframe that supports
the dataframe protocol, see:
https://data-apis.org/blog/dataframe_protocol_rfc/
https://data-apis.org/dataframe-protocol/latest/index.html
Test includes an example with vaex, which should work with
vaexio/vaex#1509
(not yet released)

@maartenbreddels @anmyachev


 use pandas 1.5.0 to consume other dataframes

b75324b


 Merge branch 'master' of https://github.com/plotly/plotly.py into int...

7ab8167

...erchange-protocol


 update test

2fec0b2

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>


 add fixture

9b2154e

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

@anmyachev anmyachev force-pushed the interchange-protocol branch from a5f792b to 9b2154e Compare

June 12, 2023 18:46

anmyachev

anmyachev commented

Jun 12, 2023

packages/python/plotly/plotly/tests/test_optional/test_px/test_px_input.py Outdated

assert_frame_equal(tips.reset_index()[out["data_frame"].columns], out["data_frame"])

def test_build_df_using_interchange_protocol_mock(add_interchange_module_for_old_pandas):

Copy link

Contributor Author

@anmyachev anmyachev Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a mock test, "plotly" do not need to have a test for every library that supports this protocol.

support dataframe protocol (tested with Vaex) #3387


 refactor using black

41fcb0c

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

@anmyachev anmyachev mentioned this pull request

Jun 12, 2023

Closed

@nicolaskruchten

Copy link

Contributor

nicolaskruchten commented Jun 12, 2023

This looks great, thank you! So just so I'm clear on the user-facing impact here: this should make it possible for PX to accept Vaex and Modin dataframes (which don't have .to_pandas()) and will use this new (better? faster?) pathway for other non-Pandas systems that happen to also have a .to_pandas() like... CuDF and Polars? So hard to keep track of all the flavours 🍨

PX can accept non-pandas dataframes that can .to_pandas() #3901

Copy link

Contributor Author

anmyachev commented Jun 13, 2023

This looks great, thank you! So just so I'm clear on the user-facing impact here: this should make it possible for PX to accept Vaex and Modin dataframes (which don't have .to_pandas()) and will use this new (better? faster?) pathway for other non-Pandas systems that happen to also have a .to_pandas() like... CuDF and Polars? So hard to keep track of all the flavours 🍨

You're right. The new path will be used even if the library has to_pandas method. I believe this is the preferred path, as this protocol aims to become a widely accepted standard, which should be accompanied by good support. From a performance point of view, it seems most likely that this way could be more performant (for example, in the protocol itself there is the following mention: Must be zero-copy wherever possible), at least as good as a simple to_pandas call.

@nicolaskruchten nicolaskruchten mentioned this pull request

Jun 14, 2023

Merged

Copy link

Contributor Author

anmyachev commented Jun 15, 2023

@nicolaskruchten CI is green, may I know what do you think about merging it?

@nicolaskruchten

Copy link

Contributor

nicolaskruchten commented Jun 15, 2023

I'll endorse this PR but I'll defer to @alexcjohnson and @LiamConnors for timing on when to merge it, how to document it etc :)

@nicolaskruchten nicolaskruchten self-requested a review

June 15, 2023 20:17

nicolaskruchten

nicolaskruchten approved these changes

Jun 15, 2023

Copy link

Contributor

@nicolaskruchten nicolaskruchten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💃 This seems good to me in principle although I haven't tested it manually and I don't have a strong sense of what the list is of which non-Pandas dataframes will go through the new codepath and which will be zero-copy, so it's not super-clear to me how to document this.

This PR also needs a changelog entry (which should probably contain some of the answers to the questions above)

Copy link

Contributor Author

anmyachev commented Jun 20, 2023

Hi @alexcjohnson! May I have your opinion on this pull request?

@MarcoGorelli

Copy link

Contributor

MarcoGorelli commented Jun 20, 2023 •

edited

Loading

I haven't looked enough into what you do with the DataFrame objects, but I just wanted to note that if df is a polars DataFrame, then:

pd.api.interchange.from_dataframe(df) will probably make some copy (currently, the result doesn't use pyarrow)
df.to_pandas() should typically be zero-copy

So it may be more user-friendly to first try to_pandas, and then fallback to the interchange protocl? (cc @ritchie46 in case I got this wrong)

@ritchie46

Copy link

ritchie46 commented Jun 20, 2023

df.to_pandas() should typically be zero-copy

df.to_pandas(use_pyarrow_extension_array=True) is zero copy yes.

@MarcoGorelli

Copy link

Contributor

MarcoGorelli commented Jun 22, 2023 •

edited

Loading

Right, and that's not the default (I thought it was - sorry, I should've checked)

OK, then using the interchange protocol wouldn't be a regression for polars

Maybe a separate PR could be made so that if it is a polars dataframe, then to_pandas with use_pyarrow_extension_array=True should be passed

But that's a separate discussion, sorry everyone for the noise

+1 for proceeding with this!

@alexcjohnson

Copy link

Collaborator

alexcjohnson commented Jun 22, 2023

This looks great, thanks @anmyachev, and thanks @nicolaskruchten for reviewing. I'd love to see a test or two that actually directly use polars and vaex dataframes in px. Compliance with a standard protocol is a great way to achieve this, and in principle if these other packages don't support this it's their problem to solve not ours, but (a) fundamentally what our users care about is that they can use their preferred dataframe manager, and (b) having tests will allow us over time to deepen this support to improve performance.

That and a changelog entry 😎

anmyachev added 2 commits

June 26, 2023 15:20


 return vaex test

23f5127

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>


 add changelog entry

2b2d2e6

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev

anmyachev commented

Jun 26, 2023

CHANGELOG.md Outdated

- Add `legend.xref` and `legend.yref` to enable container-referenced positioning of legends [[#6589](https://github.com/plotly/plotly.js/pull/6589)], with thanks to [Gamma Technologies](https://www.gtisoft.com/) for sponsoring the related development.

- Add `colorbar.xref` and `colorbar.yref` to enable container-referenced positioning of colorbars [[#6593](https://github.com/plotly/plotly.js/pull/6593)], with thanks to [Gamma Technologies](https://www.gtisoft.com/) for sponsoring the related development.

- `px` methods now accept data-frame-like objects that support a `to_pandas()` method, such as polars, cudf, vaex etc

- `px` methods now accept data-frame-like objects that support a [dataframe interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html), such as polars, vaex, modin etc

Copy link

Contributor Author

@anmyachev anmyachev Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if cudf supports this protocol

anmyachev added 4 commits

June 26, 2023 15:31


 Merge branch 'master' of https://github.com/plotly/plotly.py into int...

098da03

...erchange-protocol


 move changelog entry

926dcac

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>


 add vaex into 'requirements_39_pandas_2_optional.txt'

e01de2b

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>


 upgrade pandas from 2.0.1 to 2.0.2

d8096d4

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev

anmyachev commented

Jun 26, 2023

packages/python/plotly/test_requirements/requirements_39_pandas_2_optional.txt

requests==2.25.1

tenacity==6.2.0

pandas==2.0.1

pandas==2.0.2

Copy link

Contributor Author

@anmyachev anmyachev Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, the protocol cannot be used in tests (for example, vaex test).


 update changelog entry

0e1dc83

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>


 remove vaex dependency on environments that don't run 'test_build_df_...

91fd0de

...from_vaex' test
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Copy link

Contributor Author

anmyachev commented Jun 26, 2023

This looks great, thanks @anmyachev, and thanks @nicolaskruchten for reviewing. I'd love to see a test or two that actually directly use polars and vaex dataframes in px. Compliance with a standard protocol is a great way to achieve this, and in principle if these other packages don't support this it's their problem to solve not ours, but (a) fundamentally what our users care about is that they can use their preferred dataframe manager, and (b) having tests will allow us over time to deepen this support to improve performance.

That and a changelog entry 😎

@alexcjohnson thanks for the answer! I made the changes you mentioned. Ready for review.

@ritchie46

Copy link

ritchie46 commented Jun 26, 2023

Could there be a polars test for this as well? It would be great for us to know that plotly always stays working on both sides. :)

Copy link

Contributor Author

anmyachev commented Jun 26, 2023 •

edited

Loading

Could there be a polars test for this as well? It would be great for us to know that plotly always stays working on both sides. :)

@ritchie46 Sure, I can add. Does polars have a function like from_pandas in vaex? (for example, vaex.from_pandas(iris_pandas) call in the test)

@ritchie46

Copy link

ritchie46 commented Jun 26, 2023

Yes, there is pl.from_pandas. 👍


 test polars

5f7bb34

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Copy link

Contributor Author

anmyachev commented Jun 29, 2023

@alexcjohnson friendly ping :)

@LiamConnors

Copy link

Member

LiamConnors commented Jun 29, 2023

I tried this out by creating a new venv and installing pandas and polars. When I run an app I get the following traceback asking for pyarrow>=11.0.0

Traceback (most recent call last):
 File "app.py", line 19, in <module>
 fig = px.scatter(df, x="date", y="float")
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
 return make_figure(args=locals(), constructor=go.Scatter)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2012, in make_figure
 args = build_dataframe(args, constructor)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1317, in build_dataframe
 df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
 return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
 raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.

@nicolaskruchten

Copy link

Contributor

nicolaskruchten commented Jun 29, 2023

Hmm we definitely don't want a hard dependency on pyarrow ... can we check to see if it's present before using this branch if it's routinely-required?

@MarcoGorelli

Copy link

Contributor

MarcoGorelli commented Jun 29, 2023

pyarrow isn't strictly required for interchanging to pandas, I think it's just that polars implements the interchange protocol by going via pyarrow's implementation of it

why not swap the branches, and try to_pandas first?

Copy link

Contributor Author

anmyachev commented Jun 29, 2023

@LiamConnors thanks for the feedback.

Hmm we definitely don't want a hard dependency on pyarrow ... can we check to see if it's present before using this branch if it's routinely-required?

I think right now I can rewrite the logic as follows (so that when the version of plotly is updated, the code does not stop working for users):

try:
 pandas.api.interchange.from_dataframe(df_not_pandas)
except (ImportError, NotImplemetedError):
 df_not_pandas.to_pandas()

why not swap the branches, and try to_pandas first?

@MarcoGorelli I guess we should try the more standardized and basic way first than specialized. In the case when the dataframe library has both options implemented, then the execution of the code will not reach a potentially better way.


 catch exceptions and try another way

e301999

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Copy link

Contributor Author

anmyachev commented Jun 29, 2023

@nicolaskruchten CI build failed for another reason. Could you manually restart it if possible?

alexcjohnson

alexcjohnson reviewed

Jun 29, 2023

packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved

@anmyachev @alexcjohnson


 Update packages/python/plotly/plotly/express/_core.py

c6deacd

Co-authored-by: Alex Johnson <alex@plot.ly>

@alexcjohnson

Copy link

Collaborator

alexcjohnson commented Jun 29, 2023

@LiamConnors thanks for testing this out - would you try again, both without pyarrow and then with it installed? AFAICT the logic is solid at this point, so if you're happy with how it works I'll be happy to approve and merge. Thanks for the iterations @anmyachev 🙇

YarShev

YarShev reviewed

packages/python/plotly/test_requirements/requirements_39_pandas_2_optional.txt

scikit-image==0.18.1

psutil==5.7.0

kaleido

vaex

Copy link

@YarShev YarShev Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you not add modin here?

Copy link

Contributor Author

@anmyachev anmyachev Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_build_df_using_interchange_protocol_mock test should be enough for modin. On the other hand, I do not want to expand the list of dependencies, there are already enough of them.

@LiamConnors

Copy link

Member

LiamConnors commented Jun 30, 2023

I guess this was already there, because I now get this:

Traceback (most recent call last):
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1318, in build_dataframe
 df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
 return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
 raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
 File "app.py", line 19, in <module>
 fig = px.scatter(df, x="date", y="float")
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
 return make_figure(args=locals(), constructor=go.Scatter)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2022, in make_figure
 args = build_dataframe(args, constructor)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
 df_pandas = df_not_pandas.to_pandas()
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
 record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'

Copy link

Contributor Author

anmyachev commented Jun 30, 2023

I guess this was already there, because I now get this:

Traceback (most recent call last):
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1318, in build_dataframe
 df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/pandas/core/interchange/from_dataframe.py", line 54, in from_dataframe
 return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 1212, in __dataframe__
 raise ImportError(
ImportError: pyarrow>=11.0.0 is required for converting a Polars dataframe to a dataframe interchange object.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
 File "app.py", line 19, in <module>
 fig = px.scatter(df, x="date", y="float")
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py", line 66, in scatter
 return make_figure(args=locals(), constructor=go.Scatter)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 2022, in make_figure
 args = build_dataframe(args, constructor)
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
 df_pandas = df_not_pandas.to_pandas()
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
 record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'

Do I understand correctly that this should not stop the merge of this pull request?

MarcoGorelli

MarcoGorelli reviewed

packages/python/plotly/plotly/express/_core.py

df_not_pandas = args["data_frame"]

try:

df_pandas = pandas.api.interchange.from_dataframe(df_not_pandas)

except (ImportError, NotImplementedError) as exc:

Copy link

Contributor

@MarcoGorelli MarcoGorelli Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll want ModuleNotFoundError, else you'll get the reported error if pyarrow isn't installed

Copy link

Contributor Author

@anmyachev anmyachev Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModuleNotFoundError comes from to_pandas call. The same behavior should already be on the master branch.

Stack from error above:

 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/plotly/express/_core.py", line 1327, in build_dataframe
 df_pandas = df_not_pandas.to_pandas()
 File "/Users/liamconnors/Desktop/pltest/venv/lib/python3.8/site-packages/polars/dataframe/frame.py", line 2076, in to_pandas
 record_batches = self._df.to_pandas()
ModuleNotFoundError: No module named 'pyarrow'

Copy link

Contributor

@MarcoGorelli MarcoGorelli Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you're right - it doesn't make any difference then, thanks

Copy link

Collaborator

@alexcjohnson alexcjohnson Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're OK:

>>> issubclass(ModuleNotFoundError, ImportError)
True

@LiamConnors LiamConnors self-requested a review

June 30, 2023 18:41

LiamConnors

LiamConnors approved these changes

Copy link

Member

@LiamConnors LiamConnors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with pyarrow installed and didn't encounter any other issues.

alexcjohnson

alexcjohnson approved these changes

Copy link

Collaborator

@alexcjohnson alexcjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this should not stop the merge of this pull request?

Yep, at that point we're out of fallbacks, all we can do is allow that error to propagate, so the user is alerted to install pyarrow if they want to use this feature.

💃 Merging, we'll likely release this with the next plotly.js, which I'm guessing will be in a week or so. Thanks again @anmyachev, and everyone else for your comments.

@alexcjohnson alexcjohnson merged commit 6a9bbea into plotly:master