Down-sampling with Dask - Python

Asked 8 months ago

Viewed 97 times

I'm trying to update the dependencies in our repository (running with Python 3.12.8) and stumbled across this phenomenon when updating Dask from dask[complete]==2023年12月1日 to dask[complete]==2024年12月1日:

I'd like to down-sample a Dask dataframe (df1) to match the indices of another dataframe (df2). When merging the two dataframes, all looks well when I look at result.compute(), but when I look at result.value.compute() the output differs depending on the Dask version:

For the older Dask version, the indices range from 10-105 with step-size 5.
For the new Dask version the same index range repeats itself 3 times.

I feel like this has something to do with re-indexing the dataframe partition-wise (see code sample). Is there a way to fix the re-indexing and/or concatenation, that it works as before?

import hvplot.dask
import dask.dataframe as dd
import pandas as pd
import numpy as np
# Function to re-index a partition
def reindex_partition(partition, new_index):
 return partition.reindex(new_index, method='nearest')
# Even numbers from 0 to 98, called 'value'
index1 = np.arange(0, 100, 2)
df1 = dd.from_pandas(pd.DataFrame({
 'value': np.random.rand(len(index1))
}, index=index1), npartitions=3)
# Multiples of 5 from 10 to 105, called 'other_value'
index2 = np.arange(10, 110, 5) 
df2 = dd.from_pandas(pd.DataFrame({
 'other_value': np.random.rand(len(index2))
}, index=index2), npartitions=3)
# target index for downsampling from df2
target_index = df2.index.compute()
df1_resampled = df1.map_partitions(reindex_partition, target_index)
# Combine the DataFrames
result = dd.concat([df1_resampled, df2], axis=1)
print(result.value.compute()) # <--- this differs depending on the Dask version
result.hvplot.line( # <--- therefore this only works with older Dask version
 x="index",
 y=["value", "other_value"],
 value_label="values df1 resampled, df2",
)

With the new Dask version the plotting results in an IndexError: list index out of range error, caused by the repeating indices of the 'value' column.

Any help is appreciated. If you need further details, don't hestitate to ask!

Improve this question

asked Apr 29, 2025 at 0:09

Mina's user avatar

Mina

1311 silver badge6 bronze badges

1

It worked normally for Python 3.13, but I got a warning that says: Dask dataframe query planning is disabled because dask-expr is not installed. UPDATE: With dask-expr the problem is there.

jmx-jiang
– jmx-jiang

2025年04月29日 00:36:06 +00:00
Commented Apr 29, 2025 at 0:36
Printing target_index gives target_index=Index([ 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105], dtype='int64')

jmx-jiang
– jmx-jiang

2025年04月29日 00:47:09 +00:00
Commented Apr 29, 2025 at 0:47
It seams to be that there is a bug somewhere in Dask. Not sure why it returns this kind of Series, and why the Whole computed Dataframe is different. You could probably work around this by using a partition info and just keeping the correct part of the df2 index that is really needed by the current partition, but it seems like a hack. I recommend openning an issue on github.

Guillaume EB
– Guillaume EB

2025年05月02日 15:49:02 +00:00
Commented May 2, 2025 at 15:49

Add a comment |

0

Sorted by: Reset to default

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Down-sampling with Dask - Python

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions