1,256 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
0
votes
0
answers
45
views
How to BULK INSERT hex strings into a VARBINARY column in Azure SQL (from CSV) without staging?
I am loading data from Parquet into Azure SQL Database using this pipeline:
Parquet → PyArrow → CSV (Azure Blob) → BULK INSERT
One column in the Parquet file is binary (hashed passwords).
PyArrow CSV ...
0
votes
1
answer
88
views
DLL load failure when importing both libsumo and pyarrow
I'm creating new venv (using virtualenv) with Python 3.12.
The only two packages I'm installing are libsumo and pyarrow.
When I run only this line:
import libsumo
or only this line:
import pyarrow
...
1
vote
0
answers
61
views
How to efficiently denormalize a SQL DB to produce Parquet files
I'm trying to create a parquet file from a heavily normalized SQL database with a snowflake schema. Some of the dimensions have very long text attributes so that a simply running a big set of joins to ...
1
vote
0
answers
30
views
Is going from pyarrow chunkedarray to pyarrow table a zero-copy operation? How to check?
If I have
import pyarrow as pa
ca = pa.chunked_array([[1,2,3]])
and then do
t = pa.table({'a': ca})
then was the pa.table operation a zero-copy one?
I would expect it to be, but is there any way to ...
1
vote
1
answer
299
views
PySpark ArrayType usage in transformWithStateInPandas state causes java.lang.IllegalArgumentException
I have the following python code that uses PySpark to mock a fraud detection system for credit cards:
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, ...
5
votes
1
answer
724
views
import tensorflow statement crashes or hangs on macOS
I have the following Python statement, which I cannot execute in Jupyter Notebook or Python REPL:
import tensorflow
Python 3.11.10 (main, Sep 20 2024, 14:23:57) [Clang 16.0.0 (clang-1600026.3)] on ...
3
votes
1
answer
237
views
Lambda container - Pyarrow and numpy
I have difficulties from this: (aws-lambda-python-alpha): Failed to install numpy 2.3.0 with Python 3.11 or lower
My Dockerfile:
FROM public.ecr.aws/lambda/python:3.11
# Install
RUN pip install '...
3
votes
1
answer
389
views
Memory Not Released After Each Request Despite Cleanup Attempts
We're running a FastAPI service that fetches data from Trino, processes it using PyArrow and Polars, and uploads the result to AWS S3 in Parquet format. However, we're facing a persistent issue where ...
1
vote
2
answers
103
views
Cumulative count per group in PyArrow
Say I have
data = {'a': [1,1,2], 'b': [4,5,6]}
and I'd like to get a cumulative count (1-indexed) per group.
In pandas, I can do:
import pandas as pd
pd.DataFrame(data).groupby('a').cumcount()+1
...
0
votes
1
answer
148
views
How to predict the size of a Parquet in memory?
I am loading a large Parquet file with pyarrow, and then convert it to a Pandas DataFrame.
Since this can be very memory-intensive, I need to see if loading the entire file in one go can fit into the ...
1
vote
0
answers
196
views
"Out of Memory Error: Failed to allocate block of Bytes" using DuckDB
I'm using DuckDB to process data stored in Parquet files, organized in a Hive-style directory structure partitioned by year, month, day, and hour. Each Parquet file contains around 150 columns, and I ...
0
votes
2
answers
256
views
Delta Lake / Arrow Timestamp Precision/Schema Error
I'm experiencing timestamp precision issues when reading Delta tables created by an Azure Data Factory CDC dataflow. The pipeline extracts data from Azure SQL Database (using native CDC enabled on the ...
0
votes
0
answers
28
views
Modin: Unable to access .cat.codes after groupby even though dtype is still category
I'm encountering an issue in Modin (v0.32.0) where I can access .cat.codes on a categorical column before a groupby, but not after grouping.
import modin.pandas as pd
df = pd.read_parquet(path="....
0
votes
0
answers
56
views
Modin: switch to Pandas because of "Mixed Partitioning columns in Parquet"
I would like to use Modin to read a partitioned parquet. The parquet has a single partition key of type int. When I run it automatically switches to the default pandas implementation with the ...
0
votes
1
answer
59
views
dtypes for pandas read_parquet
Using the pandas.read_parquet() method to read a file, pandas interprets a column with mixed values as type Object. I want instead for pandas to interpret the column types as they are specified in ...