Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Crashes while handling non-select result set (DataFrame) #428

Open

Description

What happens?

Hi.

Problem

As the result of a sparl.sql("non-select") where non-select is any SQL statement that is not a select, e.g., USE, INSERT, DROP, CREATE, ... the sql() function will correctly return an empty DataFrame, which is the behavior of the pyspark API.

However, that object crashes when using any of its APIs, because the internal relation object is None. The same applies when trying to create an empty DataFrame without columns. A

Fix

I think the best fix would require fixing the underlying c++ Relation object from the duckdb C++ library to support an empty relation without columns. There are also a couple other fixes like allowing the underlying duckdb.struct_type() to have no fields. That would make the low-level API more robust and require less patching in the python layer.

Then the DuckDBPyConnection::RunQuery function needs to return an empty relation for non-select statement, instead of nullptr. All these fixes felt a bit overwhelming so I won't submit a patch.

To Reproduce

Testcase. All this works with Spark.

@pytest.mark.parametrize("mode", ["pandas", "list", "non-select"])
def test_empty_sdf( spark_session_g, mode):
 from pyspark.sql import functions as f
 from pyspark.sql import types as t
 import pandas as pd
 spark = spark_session_g
 if mode =="pandas":
 sdf = spark.createDataFrame(pd.DataFrame(), t.StructType([]))
 elif mode == "list":
 sdf = spark.createDataFrame([], t.StructType([]))
 else:
 curr_db = spark.catalog.currentDatabase()
 sdf = spark.sql(f"USE {curr_db}") # non-result set query
 assert sdf.schema == t.StructType([])
 assert sdf.columns == []
 assert sdf.collect() == []
 assert sdf.toPandas().empty
 assert sdf.toArrow().shape == (0, 0)
 sdf.createOrReplaceTempView("my_vv1")
 assert spark.sql("SELECT * from my_vv1").toArrow().shape == (0, 0)
 sdf.show() # no-op, no crash
 assert sdf.withColumn("col1", f.lit(1)).columns == ["col1"]
 assert sdf.withColumns({"col1": f.lit(1)}).columns == ["col1"]
 assert sdf.drop("noop").columns == []

OS:

Any

DuckDB Package Version:

Main branch

Python Version:

3.12

Full Name:

João Eiras

Affiliation:

private

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /