I'm trying to update the dependencies in our repository (running with Python 3.12.8) and stumbled across this phenomenon when updating Dask from dask[complete]==2023年12月1日 to dask[complete]==2024年12月1日:
I'd like to down-sample a Dask dataframe (df1) to match the indices of another dataframe (df2). When merging the two dataframes, all looks well when I look at result.compute(), but when I look at result.value.compute() the output differs depending on the Dask version:
- For the older Dask version, the indices range from 10-105 with step-size 5.
- For the new Dask version the same index range repeats itself 3 times.
I feel like this has something to do with re-indexing the dataframe partition-wise (see code sample). Is there a way to fix the re-indexing and/or concatenation, that it works as before?
import hvplot.dask
import dask.dataframe as dd
import pandas as pd
import numpy as np
# Function to re-index a partition
def reindex_partition(partition, new_index):
return partition.reindex(new_index, method='nearest')
# Even numbers from 0 to 98, called 'value'
index1 = np.arange(0, 100, 2)
df1 = dd.from_pandas(pd.DataFrame({
'value': np.random.rand(len(index1))
}, index=index1), npartitions=3)
# Multiples of 5 from 10 to 105, called 'other_value'
index2 = np.arange(10, 110, 5)
df2 = dd.from_pandas(pd.DataFrame({
'other_value': np.random.rand(len(index2))
}, index=index2), npartitions=3)
# target index for downsampling from df2
target_index = df2.index.compute()
df1_resampled = df1.map_partitions(reindex_partition, target_index)
# Combine the DataFrames
result = dd.concat([df1_resampled, df2], axis=1)
print(result.value.compute()) # <--- this differs depending on the Dask version
result.hvplot.line( # <--- therefore this only works with older Dask version
x="index",
y=["value", "other_value"],
value_label="values df1 resampled, df2",
)
With the new Dask version the plotting results in an IndexError: list index out of range error, caused by the repeating indices of the 'value' column.
Any help is appreciated. If you need further details, don't hestitate to ask!
Dask dataframe query planning is disabled because dask-expr is not installed.UPDATE: With dask-expr the problem is there.target_index=Index([ 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105], dtype='int64')