Feature Request: Remove PyArrow requirement for Polars import/export · duckdb/duckdb-python · Discussion #132

henryharbeck
Oct 20, 2025

Now that support for the Arrow PyCapsule Interface has been implemented in duckdb/duckdb#10716, it's possible to exchange Arrow data with Polars without PyArrow installed. However, this doesn't work in practice because DuckDB checks if the input is a Polars object before checking for existence of the Arrow PyCapsule Interface. So reading Polars data works without pyarrow only if you hide the fact that it's a polars.DataFrame.

Feature request lifted from discussion here: duckdb/duckdb#13827

Replies: 5 comments 6 replies

paultiq
Oct 20, 2025

Yah, polars_io.py relies on fetch_arrow_stream in all paths, but, it looks like the method in 10716 still works which is neat:

import duckdb
import polars as pl
class ArrowStream:
 def __init__(self, obj):
 self.obj = obj
 def __arrow_c_stream__(self, requested_schema=None):
 return self.obj.__arrow_c_stream__(requested_schema=requested_schema)
df = pl.DataFrame({"a": [1, 2, 3, 4]})
stream = ArrowStream(df)
con = duckdb.connect()
# Reading from the wrapper
sql = "SELECT * FROM stream"
con.query(sql)

Perhaps that could be used as a fallback for when pyarrow isn't installed?

3 replies

@henryharbeck

henryharbeck Oct 21, 2025
Author

It is good it works, but the above workaround is not really end-user facing or very ergonomic

Perhaps that could be used as a fallback for when pyarrow isn't installed?

There is no need to go via PyArrow, even when it is installed.
Not requiring PyArrow is one the the goals of the PyCapsule interface ("...instead of hardcoding support for specific Arrow producers." - i.e., PyArrow)

@paultiq

paultiq Oct 21, 2025

For sure, I meant this as a workaround.

That said, I'm curious what happens if you pass a relation to a relation in this manner.

@paultiq

paultiq Oct 21, 2025

I tested reverting the change in #13798, which was to fix duckdb/duckdb#13793:

On the plus side, using the pycapsule stream was 30% faster and slightly less (9%) memory.

On the negative side, the same error* occurs from 13793 when the stream is consumed twice: _duckdb.InternalException: INTERNAL Error: ArrowArrayStream was released by another thread/library

This error is the same as #105.

My test case was:

 df = duckdb.sql("SELECT range AS id, 'value_' || range::VARCHAR AS name FROM range(10000000)").pl()
 result = duckdb.sql("SELECT COUNT(*) as cnt FROM df").fetchone()[0]

---- Edit
See #70 . Fixing this would fix the original issue (13793), which would allow 13798 to be reverted.

paultiq
Oct 20, 2025

Oh, and for the reverse, pyrelation.cpp would need a small change to construct the Polars dataframe directly, instead of going through auto arrow = ToArrowTableInternal(batch_size, true);

PolarsDataFrame DuckDBPyRelation::ToPolars(idx_t batch_size, bool lazy) {
	if (!lazy) {
		auto polars_module = pybind11::module_::import("polars");
		return py::cast<PolarsDataFrame>(polars_module.attr("DataFrame")(*this));
	}

This would allow the following without pyarrow installed:

import duckdb
duckdb.execute("select * from range(1)").pl();

Alternatively

Just create the dataframe in Polars from the relation, no pyarrow needed.

import duckdb
import polars as pl
rel = duckdb.sql("select * from range(10)")
df = pl.DataFrame(rel)

* And a similar change to the lazy path, and probably polars_io.py

** happy to open a PR

1 reply

@paultiq

paultiq Oct 20, 2025

I wonder if this would be a better solution to duckdb/duckdb#19356.... Oh wow, it is!

With this change, peak memory in the 19356 test case is cut from 3.5x to 2.5x. Which makes sense, since we're going direct from relation -> polars, instead of relation -> pyarrow ->rechunking -> polars.

~~(削除) edit: Altho, it takes twice as long... tanstaafl. (削除ここまで)~~- nm

See #137

henryharbeck
Dec 23, 2025
Author

Please can this be picked up @evertlammerts
It would be a significant QoL improvement when working between DuckDB and Polars

Polars support via PyCapsule was reverted shortly after release in duckdb/duckdb#13798
I believe this PR duckdb/duckdb#17087 prevents the regression causing the revert from occuring again
duckdb/duckdb#13793 can be added as a test case to ensure the regression does not re-surface

FYI @kylebarron @Tishj

0 replies

paultiq
Dec 23, 2025

There's a secondary issue here, that duckdb-python uses pyarrow for filter and projection pushdown over Polars.

As an experiment / proof of concept, I implemented full polars roundtrip without pyarrow with native polars pushdown in bareduckdb.

This is just to say: this request is possible... but you'll need to consider the impact on pushdowns.

2 replies

@henryharbeck

henryharbeck Dec 23, 2025
Author

Reading through https://github.com/duckdb/duckdb-python/blob/main/duckdb/polars_io.py, it looks like PyArrow is only really used to consume the result as a pyarrow.RecordBatchReader. The actual pushdown happens with the DuckDB relational API.

Maybe this request should only consider DataFrames, and not the LazyFrames. Polars LazyFrames don't implement __arrow_c_stream__, so I'm not too sure how that would work anyway.

Please if I am missing anything, let me know. Or would be keen to understand how it is working in bareduckdb.

Re getting rid of the need for PyArrow, Polars recently released a batched iterator, but I'm not quite sure if it is ready/stable as this stage (ref: pola-rs/polars#23980). How are you batching the data in bareduckdb?

TLDR; getting this working for DataFrame and Series and adding the lazy/pushdown support later would definitely be a step in the right direction. I don't think it necessarily has to happen all at once

@paultiq

paultiq Dec 23, 2025

(this is my non-duckdb core developer understanding, so I am probably wrong somewhere here):

There's (confusingly) two different pushdowns involved here:

pushing SQL expressions into an Arrow Scanner: pyarrow_filter_pushdown.cpp
pushing Polars operations into a DuckDB relation: polars_io.py

Pushing SQL expressions down

For a Polars LazyFrame or DataFrame, if you do:

duckdb.sql("select * from mylazyframe where id=10")

What happens is:

If LazyFrame, collect. If DataFrame, skip to step 2
The LazyFrame/DataFrame is converted to a PyArrow Table, via to_arrow

duckdb-python/src/duckdb_py/python_replacement_scan.cpp

Lines 149 to 150 in 694922b

auto materialized = entry.attr("collect")();

auto arrow_dataset = materialized.attr("to_arrow")();
The Table is then converted to a Scanner

duckdb-python/src/duckdb_py/arrow/arrow_array_stream.cpp

Line 61 in 694922b

unique_ptr<ArrowArrayStreamWrapper> PythonTableArrowArrayStreamFactory::Produce(uintptr_t factory_ptr,
Pushdowns are applied

duckdb-python/src/duckdb_py/arrow/arrow_array_stream.cpp

Line 52 in 694922b

auto filter = PyArrowFilterPushdown::TransformFilter(*filters, parameters.projected_columns.projection_map,

and implemented in

duckdb-python/src/duckdb_py/arrow/pyarrow_filter_pushdown.cpp

Line 307 in 694922b

py::object PyArrowFilterPushdown::TransformFilter(TableFilterSet &filter_collection,
The Scanner is converted to a RecordBatchReader

duckdb-python/src/duckdb_py/arrow/arrow_array_stream.cpp

Line 117 in 694922b

auto record_batches = scanner.attr("to_reader")();

BUT, if you provide a PyCapsule, all of this is bypassed and the data is consumed without the benefit of pushdown.

polars_io.py

polars_io.py pushes SQL expressions into the DuckDB query, when you're using a LazyFrame.

duckdb.sql("SELECT * FROM parquet_file").pl(lazy=True).filter(pl.col("price") > 100)

In this case, polars_io rewrites the query to be "SELECT * FROM parquet_file where price > 100".

Possible Improvements

The two differences between bareduckdb and duckdb-python that I was highlighting:

producing and consuming pyarrow capsules

bareduckdb can produce and consume pyarrow capsules without pyarrow.
duckdb can't produce a Polars table right now without pyarrow, but a small change would make it possible as I noted here: Feature Request: Remove PyArrow requirement for Polars import/export #132 (comment)

pushdowns into Polars

if given a Polars pycapsule, pushdown is bypassed.
I implemented an experimental option in bareduckdb that implements the same type of logic as pyarrow_filter_pushdown, but using Polars primitives. This allows pushdown on both Polars DataFrames and... most importantly... on Polars LazyFrames before collection.

* just to be clear, I'm not suggesting you use bareduckdb. It's purely an experimental project. I'm sharing only to explain possible improvements to duckdb-python. Happy to discuss more at bareduckdb.

henryharbeck
Apr 5, 2026
Author

Hi @evertlammerts , I noticed your struck-through comment on #307

Polars DataFrames with arrow_c_stream (v1.4+) now fall through to the unified path instead of going through .to_arrow(). We keep a fallback for Polars < 1.4.

Edit: this resulted in a big performance degradation. Polars doesn't seem to do zero-copy conversion and will re-convert for every new scan. I've reverted for now.

which seems like it would have closed this issue if not for the performance degradation.

Is there a reproducer you can share for that? If it is an issue on Polars side, I will raise an issue with them.

I am very keen to have first-class support on the DuckDB side to go between Polars and DuckDB without needing PyArrow

0 replies

Uh oh!

Feature Request: Remove PyArrow requirement for Polars import/export #132

Uh oh!

Replies: 5 comments · 6 replies

Uh oh!

Uh oh!

henryharbeck Oct 21, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Alternatively

Uh oh!

Uh oh!

Uh oh!

henryharbeck Dec 23, 2025 Author

Uh oh!

Uh oh!

henryharbeck Dec 23, 2025 Author

Uh oh!

Uh oh!

Pushing SQL expressions down

polars_io.py

Possible Improvements

producing and consuming pyarrow capsules

pushdowns into Polars

Uh oh!

henryharbeck Apr 5, 2026 Author

Replies: 5 comments 6 replies

henryharbeck Oct 21, 2025
Author

henryharbeck
Dec 23, 2025
Author

henryharbeck Dec 23, 2025
Author

henryharbeck
Apr 5, 2026
Author