Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Feature Request: Remove PyArrow requirement for Polars import/export #132

henryharbeck started this conversation in Ideas
Discussion options

Now that support for the Arrow PyCapsule Interface has been implemented in duckdb/duckdb#10716, it's possible to exchange Arrow data with Polars without PyArrow installed. However, this doesn't work in practice because DuckDB checks if the input is a Polars object before checking for existence of the Arrow PyCapsule Interface. So reading Polars data works without pyarrow only if you hide the fact that it's a polars.DataFrame.

Feature request lifted from discussion here: duckdb/duckdb#13827

You must be logged in to vote

Replies: 5 comments 6 replies

Comment options

Yah, polars_io.py relies on fetch_arrow_stream in all paths, but, it looks like the method in 10716 still works which is neat:

import duckdb
import polars as pl
class ArrowStream:
 def __init__(self, obj):
 self.obj = obj
 def __arrow_c_stream__(self, requested_schema=None):
 return self.obj.__arrow_c_stream__(requested_schema=requested_schema)
df = pl.DataFrame({"a": [1, 2, 3, 4]})
stream = ArrowStream(df)
con = duckdb.connect()
# Reading from the wrapper
sql = "SELECT * FROM stream"
con.query(sql)

Perhaps that could be used as a fallback for when pyarrow isn't installed?

You must be logged in to vote
3 replies
Comment options

It is good it works, but the above workaround is not really end-user facing or very ergonomic

Perhaps that could be used as a fallback for when pyarrow isn't installed?

There is no need to go via PyArrow, even when it is installed.
Not requiring PyArrow is one the the goals of the PyCapsule interface ("...instead of hardcoding support for specific Arrow producers." - i.e., PyArrow)

Comment options

For sure, I meant this as a workaround.

That said, I'm curious what happens if you pass a relation to a relation in this manner.

Comment options

I tested reverting the change in #13798, which was to fix duckdb/duckdb#13793:

On the plus side, using the pycapsule stream was 30% faster and slightly less (9%) memory.

On the negative side, the same error* occurs from 13793 when the stream is consumed twice: _duckdb.InternalException: INTERNAL Error: ArrowArrayStream was released by another thread/library

This error is the same as #105.

My test case was:

 df = duckdb.sql("SELECT range AS id, 'value_' || range::VARCHAR AS name FROM range(10000000)").pl()
 result = duckdb.sql("SELECT COUNT(*) as cnt FROM df").fetchone()[0]

---- Edit
See #70 . Fixing this would fix the original issue (13793), which would allow 13798 to be reverted.

Comment options

Oh, and for the reverse, pyrelation.cpp would need a small change to construct the Polars dataframe directly, instead of going through auto arrow = ToArrowTableInternal(batch_size, true);

PolarsDataFrame DuckDBPyRelation::ToPolars(idx_t batch_size, bool lazy) {
	if (!lazy) {
		auto polars_module = pybind11::module_::import("polars");
		return py::cast<PolarsDataFrame>(polars_module.attr("DataFrame")(*this));
	}

This would allow the following without pyarrow installed:

import duckdb
duckdb.execute("select * from range(1)").pl();

Alternatively

Just create the dataframe in Polars from the relation, no pyarrow needed.

import duckdb
import polars as pl
rel = duckdb.sql("select * from range(10)")
df = pl.DataFrame(rel)

* And a similar change to the lazy path, and probably polars_io.py

** happy to open a PR

You must be logged in to vote
1 reply
Comment options

I wonder if this would be a better solution to duckdb/duckdb#19356.... Oh wow, it is!

With this change, peak memory in the 19356 test case is cut from 3.5x to 2.5x. Which makes sense, since we're going direct from relation -> polars, instead of relation -> pyarrow ->rechunking -> polars.

(削除) edit: Altho, it takes twice as long... tanstaafl. (削除ここまで)- nm

See #137

Comment options

Please can this be picked up @evertlammerts
It would be a significant QoL improvement when working between DuckDB and Polars

Polars support via PyCapsule was reverted shortly after release in duckdb/duckdb#13798
I believe this PR duckdb/duckdb#17087 prevents the regression causing the revert from occuring again
duckdb/duckdb#13793 can be added as a test case to ensure the regression does not re-surface

FYI @kylebarron @Tishj

You must be logged in to vote
0 replies
Comment options

There's a secondary issue here, that duckdb-python uses pyarrow for filter and projection pushdown over Polars.

As an experiment / proof of concept, I implemented full polars roundtrip without pyarrow with native polars pushdown in bareduckdb.

This is just to say: this request is possible... but you'll need to consider the impact on pushdowns.

You must be logged in to vote
2 replies
Comment options

Reading through https://github.com/duckdb/duckdb-python/blob/main/duckdb/polars_io.py, it looks like PyArrow is only really used to consume the result as a pyarrow.RecordBatchReader. The actual pushdown happens with the DuckDB relational API.

Maybe this request should only consider DataFrames, and not the LazyFrames. Polars LazyFrames don't implement __arrow_c_stream__, so I'm not too sure how that would work anyway.

Please if I am missing anything, let me know. Or would be keen to understand how it is working in bareduckdb.

Re getting rid of the need for PyArrow, Polars recently released a batched iterator, but I'm not quite sure if it is ready/stable as this stage (ref: pola-rs/polars#23980). How are you batching the data in bareduckdb?

TLDR; getting this working for DataFrame and Series and adding the lazy/pushdown support later would definitely be a step in the right direction. I don't think it necessarily has to happen all at once

Comment options

(this is my non-duckdb core developer understanding, so I am probably wrong somewhere here):

There's (confusingly) two different pushdowns involved here:

  1. pushing SQL expressions into an Arrow Scanner: pyarrow_filter_pushdown.cpp
  2. pushing Polars operations into a DuckDB relation: polars_io.py

Pushing SQL expressions down

For a Polars LazyFrame or DataFrame, if you do:

duckdb.sql("select * from mylazyframe where id=10")

What happens is:

  1. If LazyFrame, collect. If DataFrame, skip to step 2
  2. The LazyFrame/DataFrame is converted to a PyArrow Table, via to_arrow
    auto materialized = entry.attr("collect")();
    auto arrow_dataset = materialized.attr("to_arrow")();
  3. The Table is then converted to a Scanner
    unique_ptr<ArrowArrayStreamWrapper> PythonTableArrowArrayStreamFactory::Produce(uintptr_t factory_ptr,
  4. Pushdowns are applied
    auto filter = PyArrowFilterPushdown::TransformFilter(*filters, parameters.projected_columns.projection_map,
    and implemented in
    py::object PyArrowFilterPushdown::TransformFilter(TableFilterSet &filter_collection,
  5. The Scanner is converted to a RecordBatchReader
    auto record_batches = scanner.attr("to_reader")();

BUT, if you provide a PyCapsule, all of this is bypassed and the data is consumed without the benefit of pushdown.

polars_io.py

polars_io.py pushes SQL expressions into the DuckDB query, when you're using a LazyFrame.

duckdb.sql("SELECT * FROM parquet_file").pl(lazy=True).filter(pl.col("price") > 100)

In this case, polars_io rewrites the query to be "SELECT * FROM parquet_file where price > 100".

Possible Improvements

The two differences between bareduckdb and duckdb-python that I was highlighting:

producing and consuming pyarrow capsules
pushdowns into Polars
  • if given a Polars pycapsule, pushdown is bypassed.
  • I implemented an experimental option in bareduckdb that implements the same type of logic as pyarrow_filter_pushdown, but using Polars primitives. This allows pushdown on both Polars DataFrames and... most importantly... on Polars LazyFrames before collection.

* just to be clear, I'm not suggesting you use bareduckdb. It's purely an experimental project. I'm sharing only to explain possible improvements to duckdb-python. Happy to discuss more at bareduckdb.

Comment options

Hi @evertlammerts , I noticed your struck-through comment on #307

Polars DataFrames with arrow_c_stream (v1.4+) now fall through to the unified path instead of going through .to_arrow(). We keep a fallback for Polars < 1.4.

Edit: this resulted in a big performance degradation. Polars doesn't seem to do zero-copy conversion and will re-convert for every new scan. I've reverted for now.

which seems like it would have closed this issue if not for the performance degradation.

Is there a reproducer you can share for that? If it is an issue on Polars side, I will raise an issue with them.

I am very keen to have first-class support on the DuckDB side to go between Polars and DuckDB without needing PyArrow

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Ideas
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /