Dataframe transformation to numpy ndarray takes ages to complete

Question 1

I would like to transform my dataframe into an array of fixed-sized chunks from each unique segment. Specifically, I would like to transform the df to a list of m arrays each sized (1,100,4). So at last, I would have an (m,1,100,4) array.

Since I require that the chunks be of fixed-size (1,100,4), and on splitting it is unlikely that each segment produce perfectly this size, the last rows of a segment are usually less, so should be zero-padded.

For this, I start be creating an array of this size, and populate it with all zeros. Then gradually fill in these values with df rows. This way, what's left at end of a particular segment is therefore zero-padded.

To do this, I use the function:

def transform(dataframe, chunk_size):
 
 grouped = dataframe.groupby('id')
 # initialize accumulators
 X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])
 # loop over each group (df[df.id==1] and df[df.id==2])
 for _, group in grouped:
 inputs = group.loc[:, 'A':'D'].values
 label = group.loc[:, 'class'].values[0]
 # calculate number of splits
 N = (len(inputs)-1) // chunk_size
 if N > 0:
 inputs = np.array_split(
 inputs, [chunk_size + (chunk_size*i) for i in range(N)])
 else:
 inputs = [inputs]
 # loop over splits
 for inpt in inputs:
 inpt = np.pad(
 inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
 mode='constant')
 # add each inputs split to accumulators
 X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
 y = np.concatenate([y, label[np.newaxis]], axis=0) 
 return X, y

This function does produce the intended ndarray. However, it is extremely slow. My df has over 21M rows, so the function takes more than 5hours to complete, this is crazy!

I am looking for a way to refactor this function for optimization.

Steps to reproduce the issue:

Generate a random large df:

import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.randn(3_000_000,4), columns=list('ABCD'))
df['class'] = np.random.randint(0, 5, df.shape[0])
df.shape
(3000000, 5)
df['id'] = df.index // 650 +1
df.head()
 A B C D class id
0 -0.696659 -0.724940 0.494385 1.469749 2 1
1 -0.440400 0.744680 -0.684663 -1.962713 4 1
2 -1.207888 -1.003556 -0.926677 -1.455632 3 1
3 1.575943 -0.453352 -0.106494 0.351674 3 1
4 0.888164 0.675754 0.254067 -0.454150 3 1

Transform df to the required ndarray per unique segment.

start = time.time()
X,y = transform(df, 100)
end = time.time()
print(f"Execution time: {(end - start) / 60}")
Execution time: 6.169370893637339

For a 5M rows df this function takes more than 6mins to complete. In my case (>21M rows), it takes hours!!!

How do it write the function to improve speed? Maybe the notion of creating the accumulator is completely wrong.

Question 2

The current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.

Question 3

Welcome to Code Review. I have rolled back your last edit. Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers .

Question 4

The question title still states your concerns about the code rather the task accomplished by the code. Please edit it to summarise the purpose - you might want to re-read How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.

Question 5

@TobySpeight My edit was because the answer slightly changes how the function works (return value).

Question 6

I again have rolled back your last edit. The reason is the same. Please stop changing the code in question which means adding code based on an answer as well.

Question 7

Your code appears to be quadratic in the number of groups. Each call to np.concatenate() allocates enough memory to hold the new array and then copies the data. The first group is copied the first time through the loop. Then the first and second groups on the second time. Then the first to third groups on the third time, etc.

To speed this up, keep a list of the groups and then call np.concatenate() just once on the list of groups.

Another observation is that the code splits a group into chunks only to reassemble them in the loop. The only differences are the group is padded to be a multiple of group_size and the shape of the array has changed. But those can be addressed without splitting and concatenating each group.

The Pandas documentation says to use DataFrame.to_numpy() method rather than .value.

Revised code:

def transform2(dataframe, chunk_size):
 
 parts_to_concat = []
 labels = []
 for _, id_group in dataframe.groupby('id'):
 group = id_group.loc[:, 'A':'D'].to_numpy()
 labels.append(id_group.loc[:, 'class'].iat[0])
 parts_to_concat.append(group)
 
 # add a zero-filled part to the list of parts to 
 # effectively pad the group to be a multiple of chunk_size
 pad = chunk_size - len(group) % chunk_size
 if pad < chunk_size:
 parts_to_concat.append(np.zeros((pad, 4)))
 
 # reassemble the data and change it's shape to match 
 # the output of the original code
 transformed_data = np.concatenate(parts_to_concat)
 transformed_data = .reshape(-1, 1, chunk_size, 4)
 labels = np.array(labels)
 
 return transformed_data, labels

On my laptop, the original code takes almost 12 minutes to run the sample dataframe. The new code takes less than one second--about a 700x speedup.

Question 8

Thank you, but this changes the required array shape from (m, 1, 100, 4) to (m, 4). For example, in the question, X[0].shape is (1, 100, 4) (this is how it is required), in this answer, X[0].shape gives (4,)

Question 9

@super_ask, Didn't return the reshaped array. Fixed. It now returns an array with shape (m, 1, chunk_size, 4).

Question 10

Great! This is really fast. One thing I just noticed is the function in your answer doesn't return chunk labels. I modified your answer but I am not able to get the correct labels, as described in the questio edit.

Question 11

@super_ask, I didn't do the labels because it didn't make sense--they were just random values in the sample data. Code added to duplicate what the original code was doing.

RootTwo RootTwoRootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2021-05-27 14:51:14Z

Your code appears to be quadratic in the number of groups. Each call to np.concatenate() allocates enough memory to hold the new array and then copies the data. The first group is copied the first time through the loop. Then the first and second groups on the second time. Then the first to third groups on the third time, etc.

To speed this up, keep a list of the groups and then call np.concatenate() just once on the list of groups.

Another observation is that the code splits a group into chunks only to reassemble them in the loop. The only differences are the group is padded to be a multiple of group_size and the shape of the array has changed. But those can be addressed without splitting and concatenating each group.

The Pandas documentation says to use DataFrame.to_numpy() method rather than .value.

Revised code:

def transform2(dataframe, chunk_size):
 
 parts_to_concat = []
 labels = []
 for _, id_group in dataframe.groupby('id'):
 group = id_group.loc[:, 'A':'D'].to_numpy()
 labels.append(id_group.loc[:, 'class'].iat[0])
 parts_to_concat.append(group)
 
 # add a zero-filled part to the list of parts to 
 # effectively pad the group to be a multiple of chunk_size
 pad = chunk_size - len(group) % chunk_size
 if pad < chunk_size:
 parts_to_concat.append(np.zeros((pad, 4)))
 
 # reassemble the data and change it's shape to match 
 # the output of the original code
 transformed_data = np.concatenate(parts_to_concat)
 transformed_data = .reshape(-1, 1, chunk_size, 4)
 labels = np.array(labels)
 
 return transformed_data, labels

On my laptop, the original code takes almost 12 minutes to run the sample dataframe. The new code takes less than one second--about a 700x speedup.

Thank you, but this changes the required array shape from (m, 1, 100, 4) to (m, 4). For example, in the question, X[0].shape is (1, 100, 4) (this is how it is required), in this answer, X[0].shape gives (4,)
@super_ask, Didn't return the reshaped array. Fixed. It now returns an array with shape (m, 1, chunk_size, 4).
Great! This is really fast. One thing I just noticed is the function in your answer doesn't return chunk labels. I modified your answer but I am not able to get the correct labels, as described in the questio edit.
@super_ask, I didn't do the labels because it didn't make sense--they were just random values in the sample data. Code added to duplicate what the original code was doing.

Stack Exchange Network

Dataframe transformation to numpy ndarray takes ages to complete

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Dataframe transformation to numpy ndarray takes ages to complete

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions