Skip to main content
Stack Overflow
  1. About
  2. For Teams
Filter by
Sorted by
Tagged with
1 vote
1 answer
112 views

Aim: Read data from Parquet files Register each df as Table Use dask-sql to join & query from the table Here are the installation step: pip install --force-reinstall --no-cache-dir "dask[...
0 votes
0 answers
42 views

I am currently working on processing a huge data and my code is continuously falling into memory error. I have tried the code in both pandas and dask. I am not sure if is it because of the logic i am ...
2 votes
0 answers
96 views

I'm trying to update the dependencies in our repository (running with Python 3.12.8) and stumbled across this phenomenon when updating Dask from dask[complete]==2023年12月1日 to dask[complete]==2024年12月1日: ...
0 votes
0 answers
39 views

Example DataFrame: import pandas as pd import dask.dataframe as dd data = { 'A': [1, 2, 1, 3, 2, 1], 'B': ['x', 'y', 'x', 'y', 'x', 'y'], 'C': [10, 20, 30, 40, 50, 60] } pd_df = pd....
0 votes
0 answers
39 views

In the code provided below, I am trying to merge two Dask dataframes def merge_with_aggregated_4(trans_ddf, agg_ddf): # First join condition: Adjust based on minutes trans_ddf["base_hour&...
Oleg's user avatar
  • 463
0 votes
0 answers
60 views

I have a large Pandas dataset, each row of it's column "Samples" is a very long numpy array of int16 integers ([-14, -15, -16, -17, -18, -19, -20, -21, -22, ...). Due to my local machine ...
0 votes
1 answer
44 views

Good morning, I have a hash partitioned dataframe from Spark (read in from parquet). I am moving everything to dask. The hash partitioned DF in Spark, when used in dask, performs terribly for joins. I ...
0 votes
1 answer
66 views

I have a quite large dask dataframe mydataframe and a numpy array mycodes. I want to filter the rows of mydataframe and keep only those where the column CODE is not in mycodes. I reseted the index of ...
1 vote
1 answer
50 views

I already have a code using threadpool tkiniter and matplotlib to process signals which are getting written to a file from another process. The Synchronization between the two process is by reading ...
0 votes
0 answers
28 views

Problem: I am training an LLM for which my dataloader makes use of Dask to read in data. During LLM training, sometimes something breaks and you need to start again from the last checkpoint. Ideally ...
2 votes
2 answers
107 views

Suppose I want to compute variance and/or standard deviation with non-default ddof in a groupby context, I can do: df.groupby("a")["b"].var(ddof=2) If I want that to happen ...
3 votes
1 answer
34 views

I have this pure Pandas statement that works (on small dataset). grouped_dfs = {key: group.drop(columns=['country']) for key, group in df.groupby('country')} Now, to manage very large csv files, I am ...
1 vote
0 answers
100 views

I am new to Dask. While attempting to run concat on a list of DataFrames, I noticed it is consuming more time, resources, and tasks than expected. Here are the details of my run: Scheduler (same as ...
0 votes
1 answer
29 views

I'm trying to learn to walkaround with dask for my machine learning project. My data set is too big to play with Pandas, so I must stay in lazy loading. here a smal sample to show how it is set up: I ...
0 votes
0 answers
117 views

I am trying to read 23 CSV files into dask dataframes, merge them together using dask, and ouptut to parquet. However, it's failing due to memory issues. I used to use pandas to join these together ...

15 30 50 per page
1
2 3 4 5
...
30

AltStyle によって変換されたページ (->オリジナル) /