444 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
1
vote
1
answer
112
views
Unable to use dask-sql due to 'dask_expr.io' module
Aim:
Read data from Parquet files
Register each df as Table
Use dask-sql to join & query from the table
Here are the installation step:
pip install --force-reinstall --no-cache-dir "dask[...
0
votes
0
answers
42
views
Python memory error and dask dying while joing on dataframes
I am currently working on processing a huge data and my code is continuously falling into memory error. I have tried the code in both pandas and dask.
I am not sure if is it because of the logic i am ...
2
votes
0
answers
96
views
Down-sampling with Dask - Python
I'm trying to update the dependencies in our repository (running with Python 3.12.8) and stumbled across this phenomenon when updating Dask from dask[complete]==2023年12月1日 to dask[complete]==2024年12月1日:
...
0
votes
0
answers
39
views
dask: looping over groupby groups efficiently
Example DataFrame:
import pandas as pd
import dask.dataframe as dd
data = {
'A': [1, 2, 1, 3, 2, 1],
'B': ['x', 'y', 'x', 'y', 'x', 'y'],
'C': [10, 20, 30, 40, 50, 60]
}
pd_df = pd....
0
votes
0
answers
39
views
How to deduplicate index of Dask dataframe?
In the code provided below, I am trying to merge two Dask dataframes
def merge_with_aggregated_4(trans_ddf, agg_ddf):
# First join condition: Adjust based on minutes
trans_ddf["base_hour&...
0
votes
0
answers
60
views
Numpy array converts to truncated string when saving Pandas df to Dask
I have a large Pandas dataset, each row of it's column "Samples" is a very long numpy array of int16 integers ([-14, -15, -16, -17, -18, -19, -20, -21, -22, ...). Due to my local machine ...
0
votes
1
answer
44
views
Re-partioning data frame and saving to parquet loses index and divisions
Good morning,
I have a hash partitioned dataframe from Spark (read in from parquet). I am moving everything to dask. The hash partitioned DF in Spark, when used in dask, performs terribly for joins. I ...
0
votes
1
answer
66
views
dask row filtering with boolean array
I have a quite large dask dataframe mydataframe and a numpy array mycodes. I want to filter the rows of mydataframe and keep only those where the column CODE is not in mycodes. I reseted the index of ...
1
vote
1
answer
50
views
Using Streamz.Dask and matplotlib and tkiniter window to display graphs and histograms in realtime?
I already have a code using threadpool tkiniter and matplotlib to process signals which are getting written to a file from another process. The Synchronization between the two process is by reading ...
0
votes
0
answers
28
views
How to see what file Dask is working with at any time for stateful dataloader
Problem:
I am training an LLM for which my dataloader makes use of Dask to read in data. During LLM training, sometimes something breaks and you need to start again from the last checkpoint. Ideally ...
2
votes
2
answers
107
views
dask `var` and `std` with ddof in groupby context and other aggregations
Suppose I want to compute variance and/or standard deviation with non-default ddof in a groupby context, I can do:
df.groupby("a")["b"].var(ddof=2)
If I want that to happen ...
3
votes
1
answer
34
views
dask groupby without aggregation
I have this pure Pandas statement that works (on small dataset).
grouped_dfs = {key: group.drop(columns=['country']) for key, group in df.groupby('country')}
Now, to manage very large csv files, I am ...
1
vote
0
answers
100
views
Dask concat on multiple dataframe axis=1
I am new to Dask. While attempting to run concat on a list of DataFrames, I noticed it is consuming more time, resources, and tasks than expected. Here are the details of my run:
Scheduler (same as ...
0
votes
1
answer
29
views
Dask, how to drop row with specific value in a variable into lazy computing
I'm trying to learn to walkaround with dask for my machine learning project.
My data set is too big to play with Pandas, so I must stay in lazy loading.
here a smal sample to show how it is set up:
I ...
0
votes
0
answers
117
views
How to fix memory errors merging large dask dataframes?
I am trying to read 23 CSV files into dask dataframes, merge them together using dask, and ouptut to parquet. However, it's failing due to memory issues.
I used to use pandas to join these together ...