Newest 'pyarrow' Questions

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

1,256 questions

0 votes

0 answers

45 views

How to BULK INSERT hex strings into a VARBINARY column in Azure SQL (from CSV) without staging?

I am loading data from Parquet into Azure SQL Database using this pipeline: Parquet → PyArrow → CSV (Azure Blob) → BULK INSERT One column in the Parquet file is binary (hashed passwords). PyArrow CSV ...

mysin's user avatar

mysin

asked Dec 2 at 22:35

0 votes

1 answer

88 views

DLL load failure when importing both libsumo and pyarrow

I'm creating new venv (using virtualenv) with Python 3.12. The only two packages I'm installing are libsumo and pyarrow. When I run only this line: import libsumo or only this line: import pyarrow ...

Godzy's user avatar

Godzy

asked Nov 9 at 23:49

1 vote

0 answers

61 views

How to efficiently denormalize a SQL DB to produce Parquet files

I'm trying to create a parquet file from a heavily normalized SQL database with a snowflake schema. Some of the dimensions have very long text attributes so that a simply running a big set of joins to ...

Davor Cubranic's user avatar

Davor Cubranic

1,150

asked Oct 11 at 18:19

1 vote

0 answers

30 views

Is going from pyarrow chunkedarray to pyarrow table a zero-copy operation? How to check?

If I have import pyarrow as pa ca = pa.chunked_array([[1,2,3]]) and then do t = pa.table({'a': ca}) then was the pa.table operation a zero-copy one? I would expect it to be, but is there any way to ...

ignoring_gravity's user avatar

ignoring_gravity

11.1k

asked Oct 6 at 6:57

1 vote

1 answer

299 views

PySpark ArrayType usage in transformWithStateInPandas state causes java.lang.IllegalArgumentException

I have the following python code that uses PySpark to mock a fraud detection system for credit cards: from pyspark.sql import SparkSession from pyspark.sql.functions import from_json, col, ...

Marco Filippozzi's user avatar

Marco Filippozzi

asked Sep 5 at 7:36

5 votes

1 answer

724 views

import tensorflow statement crashes or hangs on macOS

I have the following Python statement, which I cannot execute in Jupyter Notebook or Python REPL: import tensorflow Python 3.11.10 (main, Sep 20 2024, 14:23:57) [Clang 16.0.0 (clang-1600026.3)] on ...

Mikko Ohtamaa's user avatar

Mikko Ohtamaa

85.1k

asked Aug 23 at 15:16

3 votes

1 answer

237 views

Lambda container - Pyarrow and numpy

I have difficulties from this: (aws-lambda-python-alpha): Failed to install numpy 2.3.0 with Python 3.11 or lower My Dockerfile: FROM public.ecr.aws/lambda/python:3.11 # Install RUN pip install '...

Flo's user avatar

Flo

asked Aug 5 at 21:37

3 votes

1 answer

389 views

Memory Not Released After Each Request Despite Cleanup Attempts

We're running a FastAPI service that fetches data from Trino, processes it using PyArrow and Polars, and uploads the result to AWS S3 in Parquet format. However, we're facing a persistent issue where ...

DonOfDen's user avatar

DonOfDen

4,068

asked Jul 8 at 11:50

1 vote

2 answers

103 views

Cumulative count per group in PyArrow

Say I have data = {'a': [1,1,2], 'b': [4,5,6]} and I'd like to get a cumulative count (1-indexed) per group. In pandas, I can do: import pandas as pd pd.DataFrame(data).groupby('a').cumcount()+1 ...

ignoring_gravity's user avatar

ignoring_gravity

11.1k

asked Jun 14 at 18:58

0 votes

1 answer

148 views

How to predict the size of a Parquet in memory?

I am loading a large Parquet file with pyarrow, and then convert it to a Pandas DataFrame. Since this can be very memory-intensive, I need to see if loading the entire file in one go can fit into the ...

TheLegs's user avatar

TheLegs

asked Jun 5 at 5:51

1 vote

0 answers

196 views

"Out of Memory Error: Failed to allocate block of Bytes" using DuckDB

I'm using DuckDB to process data stored in Parquet files, organized in a Hive-style directory structure partitioned by year, month, day, and hour. Each Parquet file contains around 150 columns, and I ...

Deepank Dhillon's user avatar

Deepank Dhillon

asked Jun 2 at 9:09

0 votes

2 answers

256 views

Delta Lake / Arrow Timestamp Precision/Schema Error

I'm experiencing timestamp precision issues when reading Delta tables created by an Azure Data Factory CDC dataflow. The pipeline extracts data from Azure SQL Database (using native CDC enabled on the ...

neal301's user avatar

neal301

asked May 29 at 13:37

0 votes

0 answers

28 views

Modin: Unable to access .cat.codes after groupby even though dtype is still category

I'm encountering an issue in Modin (v0.32.0) where I can access .cat.codes on a categorical column before a groupby, but not after grouping. import modin.pandas as pd df = pd.read_parquet(path="....

Sumukha G C's user avatar

Sumukha G C

asked May 25 at 8:24

0 votes

0 answers

56 views

Modin: switch to Pandas because of "Mixed Partitioning columns in Parquet"

I would like to use Modin to read a partitioned parquet. The parquet has a single partition key of type int. When I run it automatically switches to the default pandas implementation with the ...

MarcelloDG's user avatar

MarcelloDG

asked Apr 29 at 9:11

0 votes

1 answer

59 views

dtypes for pandas read_parquet

Using the pandas.read_parquet() method to read a file, pandas interprets a column with mixed values as type Object. I want instead for pandas to interpret the column types as they are specified in ...

MikeB2019x's user avatar

MikeB2019x

1,309

asked Apr 28 at 18:55

15 30 50 per page

2 3 4 5

...

84 Next

CollectivesTM on Stack Overflow

How to BULK INSERT hex strings into a VARBINARY column in Azure SQL (from CSV) without staging?

DLL load failure when importing both libsumo and pyarrow

How to efficiently denormalize a SQL DB to produce Parquet files

Is going from pyarrow chunkedarray to pyarrow table a zero-copy operation? How to check?

PySpark ArrayType usage in transformWithStateInPandas state causes java.lang.IllegalArgumentException

import tensorflow statement crashes or hangs on macOS

Lambda container - Pyarrow and numpy

Memory Not Released After Each Request Despite Cleanup Attempts

Cumulative count per group in PyArrow

How to predict the size of a Parquet in memory?

"Out of Memory Error: Failed to allocate block of Bytes" using DuckDB

Delta Lake / Arrow Timestamp Precision/Schema Error

Modin: Unable to access .cat.codes after groupby even though dtype is still category

Modin: switch to Pandas because of "Mixed Partitioning columns in Parquet"

dtypes for pandas read_parquet

Hot Network Questions