Newest 'dask-dataframe' Questions

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

444 questions

1 vote

1 answer

112 views

Unable to use dask-sql due to 'dask_expr.io' module

Aim: Read data from Parquet files Register each df as Table Use dask-sql to join & query from the table Here are the installation step: pip install --force-reinstall --no-cache-dir "dask[...

Aqua 4's user avatar

Aqua 4

asked Jul 7, 2025 at 9:09

0 votes

0 answers

42 views

Python memory error and dask dying while joing on dataframes

I am currently working on processing a huge data and my code is continuously falling into memory error. I have tried the code in both pandas and dask. I am not sure if is it because of the logic i am ...

Shirin's user avatar

Shirin

asked May 21, 2025 at 20:00

2 votes

0 answers

96 views

Down-sampling with Dask - Python

I'm trying to update the dependencies in our repository (running with Python 3.12.8) and stumbled across this phenomenon when updating Dask from dask[complete]==2023年12月1日 to dask[complete]==2024年12月1日: ...

Mina's user avatar

Mina

asked Apr 29, 2025 at 0:09

0 votes

0 answers

39 views

dask: looping over groupby groups efficiently

Example DataFrame: import pandas as pd import dask.dataframe as dd data = { 'A': [1, 2, 1, 3, 2, 1], 'B': ['x', 'y', 'x', 'y', 'x', 'y'], 'C': [10, 20, 30, 40, 50, 60] } pd_df = pd....

tommy.carstensen's user avatar

tommy.carstensen

9,662

asked Mar 25, 2025 at 16:37

0 votes

0 answers

39 views

How to deduplicate index of Dask dataframe?

In the code provided below, I am trying to merge two Dask dataframes def merge_with_aggregated_4(trans_ddf, agg_ddf): # First join condition: Adjust based on minutes trans_ddf["base_hour&...

Oleg's user avatar

Oleg

asked Mar 8, 2025 at 17:51

0 votes

0 answers

60 views

Numpy array converts to truncated string when saving Pandas df to Dask

I have a large Pandas dataset, each row of it's column "Samples" is a very long numpy array of int16 integers ([-14, -15, -16, -17, -18, -19, -20, -21, -22, ...). Due to my local machine ...

outragebeyond's user avatar

outragebeyond

asked Feb 12, 2025 at 18:19

0 votes

1 answer

44 views

Re-partioning data frame and saving to parquet loses index and divisions

Good morning, I have a hash partitioned dataframe from Spark (read in from parquet). I am moving everything to dask. The hash partitioned DF in Spark, when used in dask, performs terribly for joins. I ...

es-code-bar's user avatar

es-code-bar

asked Feb 11, 2025 at 11:49

0 votes

1 answer

66 views

dask row filtering with boolean array

I have a quite large dask dataframe mydataframe and a numpy array mycodes. I want to filter the rows of mydataframe and keep only those where the column CODE is not in mycodes. I reseted the index of ...

miguelsxvi's user avatar

miguelsxvi

asked Feb 6, 2025 at 16:29

1 vote

1 answer

50 views

Using Streamz.Dask and matplotlib and tkiniter window to display graphs and histograms in realtime?

I already have a code using threadpool tkiniter and matplotlib to process signals which are getting written to a file from another process. The Synchronization between the two process is by reading ...

Ayan Banerjee's user avatar

Ayan Banerjee

asked Jan 30, 2025 at 12:37

0 votes

0 answers

28 views

How to see what file Dask is working with at any time for stateful dataloader

Problem: I am training an LLM for which my dataloader makes use of Dask to read in data. During LLM training, sometimes something breaks and you need to start again from the last checkpoint. Ideally ...

d-gg's user avatar

d-gg

asked Jan 9, 2025 at 17:38

2 votes

2 answers

107 views

dask `var` and `std` with ddof in groupby context and other aggregations

Suppose I want to compute variance and/or standard deviation with non-default ddof in a groupby context, I can do: df.groupby("a")["b"].var(ddof=2) If I want that to happen ...

FBruzzesi's user avatar

FBruzzesi

6,613

asked Dec 27, 2024 at 8:18

3 votes

1 answer

34 views

dask groupby without aggregation

I have this pure Pandas statement that works (on small dataset). grouped_dfs = {key: group.drop(columns=['country']) for key, group in df.groupby('country')} Now, to manage very large csv files, I am ...

user1717931's user avatar

user1717931

2,511

asked Dec 23, 2024 at 18:41

1 vote

0 answers

100 views

Dask concat on multiple dataframe axis=1

I am new to Dask. While attempting to run concat on a list of DataFrames, I noticed it is consuming more time, resources, and tasks than expected. Here are the details of my run: Scheduler (same as ...

sandeysh's user avatar

sandeysh

asked Dec 18, 2024 at 11:01

0 votes

1 answer

29 views

Dask, how to drop row with specific value in a variable into lazy computing

I'm trying to learn to walkaround with dask for my machine learning project. My data set is too big to play with Pandas, so I must stay in lazy loading. here a smal sample to show how it is set up: I ...

Jonathan Roy's user avatar

Jonathan Roy

asked Nov 5, 2024 at 21:01

0 votes

0 answers

117 views

How to fix memory errors merging large dask dataframes?

I am trying to read 23 CSV files into dask dataframes, merge them together using dask, and ouptut to parquet. However, it's failing due to memory issues. I used to use pandas to join these together ...

ifightfortheuserz's user avatar

ifightfortheuserz

asked Oct 11, 2024 at 8:41

15 30 50 per page

2 3 4 5

...

30 Next

CollectivesTM on Stack Overflow

Unable to use dask-sql due to 'dask_expr.io' module

Python memory error and dask dying while joing on dataframes

Down-sampling with Dask - Python

dask: looping over groupby groups efficiently

How to deduplicate index of Dask dataframe?

Numpy array converts to truncated string when saving Pandas df to Dask

Re-partioning data frame and saving to parquet loses index and divisions

dask row filtering with boolean array

Using Streamz.Dask and matplotlib and tkiniter window to display graphs and histograms in realtime?

How to see what file Dask is working with at any time for stateful dataloader

dask `var` and `std` with ddof in groupby context and other aggregations

dask groupby without aggregation

Dask concat on multiple dataframe axis=1

Dask, how to drop row with specific value in a variable into lazy computing

How to fix memory errors merging large dask dataframes?

Hot Network Questions