Skip to main content
Stack Overflow
  1. About
  2. For Teams
Filter by
Sorted by
Tagged with
0 votes
0 answers
45 views

I am loading data from Parquet into Azure SQL Database using this pipeline: Parquet → PyArrow → CSV (Azure Blob) → BULK INSERT One column in the Parquet file is binary (hashed passwords). PyArrow CSV ...
0 votes
1 answer
88 views

I'm creating new venv (using virtualenv) with Python 3.12. The only two packages I'm installing are libsumo and pyarrow. When I run only this line: import libsumo or only this line: import pyarrow ...
1 vote
0 answers
61 views

I'm trying to create a parquet file from a heavily normalized SQL database with a snowflake schema. Some of the dimensions have very long text attributes so that a simply running a big set of joins to ...
1 vote
0 answers
30 views

If I have import pyarrow as pa ca = pa.chunked_array([[1,2,3]]) and then do t = pa.table({'a': ca}) then was the pa.table operation a zero-copy one? I would expect it to be, but is there any way to ...
1 vote
1 answer
299 views

I have the following python code that uses PySpark to mock a fraud detection system for credit cards: from pyspark.sql import SparkSession from pyspark.sql.functions import from_json, col, ...
5 votes
1 answer
724 views

I have the following Python statement, which I cannot execute in Jupyter Notebook or Python REPL: import tensorflow Python 3.11.10 (main, Sep 20 2024, 14:23:57) [Clang 16.0.0 (clang-1600026.3)] on ...
3 votes
1 answer
237 views

I have difficulties from this: (aws-lambda-python-alpha): Failed to install numpy 2.3.0 with Python 3.11 or lower My Dockerfile: FROM public.ecr.aws/lambda/python:3.11 # Install RUN pip install '...
3 votes
1 answer
389 views

We're running a FastAPI service that fetches data from Trino, processes it using PyArrow and Polars, and uploads the result to AWS S3 in Parquet format. However, we're facing a persistent issue where ...
1 vote
2 answers
103 views

Say I have data = {'a': [1,1,2], 'b': [4,5,6]} and I'd like to get a cumulative count (1-indexed) per group. In pandas, I can do: import pandas as pd pd.DataFrame(data).groupby('a').cumcount()+1 ...
0 votes
1 answer
148 views

I am loading a large Parquet file with pyarrow, and then convert it to a Pandas DataFrame. Since this can be very memory-intensive, I need to see if loading the entire file in one go can fit into the ...
1 vote
0 answers
196 views

I'm using DuckDB to process data stored in Parquet files, organized in a Hive-style directory structure partitioned by year, month, day, and hour. Each Parquet file contains around 150 columns, and I ...
0 votes
2 answers
256 views

I'm experiencing timestamp precision issues when reading Delta tables created by an Azure Data Factory CDC dataflow. The pipeline extracts data from Azure SQL Database (using native CDC enabled on the ...
0 votes
0 answers
28 views

I'm encountering an issue in Modin (v0.32.0) where I can access .cat.codes on a categorical column before a groupby, but not after grouping. import modin.pandas as pd df = pd.read_parquet(path="....
0 votes
0 answers
56 views

I would like to use Modin to read a partitioned parquet. The parquet has a single partition key of type int. When I run it automatically switches to the default pandas implementation with the ...
0 votes
1 answer
59 views

Using the pandas.read_parquet() method to read a file, pandas interprets a column with mixed values as type Object. I want instead for pandas to interpret the column types as they are specified in ...

15 30 50 per page
1
2 3 4 5
...
84

AltStyle によって変換されたページ (->オリジナル) /