Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

How do you use DuckDB's Python Client? #205

guillesd started this conversation in Polls
Discussion options

Maybe you come from the SQL world and are just looking to automate some queries against your warehouse...
Maybe you come from the Spark world and want to have a more DataFrame -> func() -> DataFrame kind of workflow...

We want to know what API do you use the most (or would love to use more if it had better support)! The options are:

  • DuckDB's SQL API. Think of:
import duckdb
con = duckdb.connect()
arrow_res = con.execute("SELECT ...").arrow()
  • DuckDB's Relational API, which is DuckDB's version of what an API on top of SQL could look like:
import duckdb
con = duckdb.connect()
rel = con.table("my_table")
rel = rel.aggregate('max(value)')
rel.show()
  • DuckDB's Spark API (experimental), which tries to mimic the PySpark API:
from duckdb.experimental.spark.sql import SparkSession as session
from duckdb.experimental.spark.sql.functions import lit, col
spark = session.builder.getOrCreate()
df = spark.createDataFrame(data = [(1,), (2,)], schema = ['id'])
df = df.withColumn(
 'location', lit('Seattle')
)
df.show()

It can also be the case that you like neither of this, in which case, we also want to know!

PS. If you like using projects like ibis or narwhals, let us know why in the comments too!

What is your favourite DuckDB Python API?
DuckDB's SQL API
76%
DuckDB's Relational API
4%
DuckDB's Spark API
4%
None of the above, I love Pandas API, Polars API or no data manipulation API makes me happy!
5%
Wrappers on top of DuckDB like ibis or narwhals
9%

147 votes

You must be logged in to vote

Replies: 15 comments 2 replies

Comment options

the current favorite API is the SQL API. But I would love to have a great native Spark API. Currently using sqlframe for this.

the relational API feels for me like mixture of magic strings (like the sql api) + some python native API.

You must be logged in to vote
2 replies
Comment options

+1 for SQLFrame.
The current state of the duckDB pyspark API is not usable for real work workloads.

I think it would present a huge opportunity for duckDB to offer a pyspark API or Spark connect Server implementation.

Comment options

I actually use none, maybe I shold.
I blend envsubst in bash with SQL templates to import and merge parquet data, and run queries to export JSONL data for subsequent pipeline ingestion.

Comment options

My vote comes from the idea that I wasn't aware of the other options, if I am being honest. The Spark API has my antenna up.

You must be logged in to vote
0 replies
Comment options

I prefer Ibis as it aligns lot more with dataframe way of thinking and the code is significantly more compact than sql.
Plus the same code can be used on any backend.
It would be great if duckdb could support Ibis more , as that would reduce the need for DuckDB's Relational API.

Any duckdb feature not supported by Ibis can be called using inline sql or a UDF.

You must be logged in to vote
0 replies
Comment options

We used the DuckDB Python API to run our SQL worker layer for processing Snowflake BI queries without turning on a warehouse. This heavily leverages the SQL api.

You must be logged in to vote
0 replies
Comment options

SQL API or Ibis, depends on the actual use case.

You must be logged in to vote
0 replies
Comment options

SQL

Mostly with DBT.

Some Ibis.

* Edit: SQLMesh soon.

You must be logged in to vote
0 replies
Comment options

Both myself and colleagues have run into difficult to track down bugs using the relational API so we try to do everything using explicit SQL, which results in fewer surprises.

In particular, we've had issues when passing duckdbpyrelations into functions
duckdb/duckdb#17033
moj-analytical-services/uk_address_matcher#91

In summary I like the idea of the relational API, and use it for simple scripts, but steer clear of it for library/prod code

You must be logged in to vote
0 replies
Comment options

I’m mainly using DuckDB through Ibis.

We have quite a bit of data in BigQuery. Ibis lets me run some operations in both BigQuery or DuckDB without changing much code at all. It's great when I do dev and experimentation with DuckDB and then move to prod in BigQuery.

Also personal preference as I really like R's tidyverse syntax and Ibis is the closest there is in python.

You must be logged in to vote
0 replies
Comment options

I use the relational API pretty heavily, and TBH it feels really messy and inconsistent at the moment. I'm never quite sure if I'm supposed to provide a Python list of columns/expressions, or render them into a comma-separated string, or whatever. I'd really like to be able to just pass lists of expressions around -- Python objects, just like when using Polars for example -- and not have to worry about rendering little quasi-SQL bits.

You must be logged in to vote
0 replies
Comment options

Previously I've used the SQL API in conjunction with Pandas and PyODBC within Jupyter Lab to mix and match queries against various databases and to be able to query dataframes in memory.

As I'm exploring Marimo now, I'm starting to rely on DuckDB more directly where DuckDB and Pandas overlap, and Marimo's SQL cells make it much more pleasant to query in almost an IDE with autocomplete and linting instead of simply wrangling text like I'm used to doing in Jupyter Lab.

Love this stuff, can't get enough of the SQL API and trying to encourage Python and CLI usage of DuckDB throughout my company where it makes sense for automation and reporting analytics 💙😎🙌

You must be logged in to vote
0 replies
Comment options

The SQL API.

I would prefer to see an official SQLAlchemy dialect/adapter maintained rather than developing own Relational API.

You must be logged in to vote
0 replies
Comment options

Would love to use Spark API but it lacked too many features. Settled on Relational API but it is messy. Would prefer to avoid SQL strings manipulation as much as possible.

You must be logged in to vote
0 replies
Comment options

The SQL Api. I just love SQL... I'm familiar with it and it's waaaay more simple than having to learn a new library. I use it on Amphi ( https://github.com/amphi-ai/amphi-etl ) to do some SQL on a pandas dataframe but also as execution engine for some tools (such as Join or Compare Dataframe).

Best regards,

Simon

You must be logged in to vote
0 replies
Comment options

I've started using Marimo for all queries with any degree of complexity. Autocompletion and autoformatting are very helpful.

You must be logged in to vote
0 replies
Comment options

I integrated DuckDB’s Spark-compatible API so the AI agent can execute and validate generated PySpark logic locally, catching syntax and semantic issues early while avoiding the overhead of spinning up a real Spark environment.

So my vote will go for DuckDB's Spark API

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /