2

I'm trying to update the dependencies in our repository (running with Python 3.12.8) and stumbled across this phenomenon when updating Dask from dask[complete]==2023年12月1日 to dask[complete]==2024年12月1日:

I'd like to down-sample a Dask dataframe (df1) to match the indices of another dataframe (df2). When merging the two dataframes, all looks well when I look at result.compute(), but when I look at result.value.compute() the output differs depending on the Dask version:

  • For the older Dask version, the indices range from 10-105 with step-size 5.
  • For the new Dask version the same index range repeats itself 3 times.

I feel like this has something to do with re-indexing the dataframe partition-wise (see code sample). Is there a way to fix the re-indexing and/or concatenation, that it works as before?

import hvplot.dask
import dask.dataframe as dd
import pandas as pd
import numpy as np
# Function to re-index a partition
def reindex_partition(partition, new_index):
 return partition.reindex(new_index, method='nearest')
# Even numbers from 0 to 98, called 'value'
index1 = np.arange(0, 100, 2)
df1 = dd.from_pandas(pd.DataFrame({
 'value': np.random.rand(len(index1))
}, index=index1), npartitions=3)
# Multiples of 5 from 10 to 105, called 'other_value'
index2 = np.arange(10, 110, 5) 
df2 = dd.from_pandas(pd.DataFrame({
 'other_value': np.random.rand(len(index2))
}, index=index2), npartitions=3)
# target index for downsampling from df2
target_index = df2.index.compute()
df1_resampled = df1.map_partitions(reindex_partition, target_index)
# Combine the DataFrames
result = dd.concat([df1_resampled, df2], axis=1)
print(result.value.compute()) # <--- this differs depending on the Dask version
result.hvplot.line( # <--- therefore this only works with older Dask version
 x="index",
 y=["value", "other_value"],
 value_label="values df1 resampled, df2",
) 

With the new Dask version the plotting results in an IndexError: list index out of range error, caused by the repeating indices of the 'value' column.

Any help is appreciated. If you need further details, don't hestitate to ask!

asked Apr 29, 2025 at 0:09
3
  • 1
    It worked normally for Python 3.13, but I got a warning that says: Dask dataframe query planning is disabled because dask-expr is not installed. UPDATE: With dask-expr the problem is there. Commented Apr 29, 2025 at 0:36
  • Printing target_index gives target_index=Index([ 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105], dtype='int64') Commented Apr 29, 2025 at 0:47
  • It seams to be that there is a bug somewhere in Dask. Not sure why it returns this kind of Series, and why the Whole computed Dataframe is different. You could probably work around this by using a partition info and just keeping the correct part of the df2 index that is really needed by the current partition, but it seems like a hack. I recommend openning an issue on github. Commented May 2, 2025 at 15:49

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.