4
\$\begingroup\$

I would like to transform my dataframe into an array of fixed-sized chunks from each unique segment. Specifically, I would like to transform the df to a list of m arrays each sized (1,100,4). So at last, I would have an (m,1,100,4) array.

Since I require that the chunks be of fixed-size (1,100,4), and on splitting it is unlikely that each segment produce perfectly this size, the last rows of a segment are usually less, so should be zero-padded.

For this, I start be creating an array of this size, and populate it with all zeros. Then gradually fill in these values with df rows. This way, what's left at end of a particular segment is therefore zero-padded.

To do this, I use the function:

def transform(dataframe, chunk_size):
 
 grouped = dataframe.groupby('id')
 # initialize accumulators
 X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])
 # loop over each group (df[df.id==1] and df[df.id==2])
 for _, group in grouped:
 inputs = group.loc[:, 'A':'D'].values
 label = group.loc[:, 'class'].values[0]
 # calculate number of splits
 N = (len(inputs)-1) // chunk_size
 if N > 0:
 inputs = np.array_split(
 inputs, [chunk_size + (chunk_size*i) for i in range(N)])
 else:
 inputs = [inputs]
 # loop over splits
 for inpt in inputs:
 inpt = np.pad(
 inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
 mode='constant')
 # add each inputs split to accumulators
 X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
 y = np.concatenate([y, label[np.newaxis]], axis=0) 
 return X, y

This function does produce the intended ndarray. However, it is extremely slow. My df has over 21M rows, so the function takes more than 5hours to complete, this is crazy!

I am looking for a way to refactor this function for optimization.

Steps to reproduce the issue:

Generate a random large df:

import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.randn(3_000_000,4), columns=list('ABCD'))
df['class'] = np.random.randint(0, 5, df.shape[0])
df.shape
(3000000, 5)
df['id'] = df.index // 650 +1
df.head()
 A B C D class id
0 -0.696659 -0.724940 0.494385 1.469749 2 1
1 -0.440400 0.744680 -0.684663 -1.962713 4 1
2 -1.207888 -1.003556 -0.926677 -1.455632 3 1
3 1.575943 -0.453352 -0.106494 0.351674 3 1
4 0.888164 0.675754 0.254067 -0.454150 3 1

Transform df to the required ndarray per unique segment.

start = time.time()
X,y = transform(df, 100)
end = time.time()
print(f"Execution time: {(end - start) / 60}")
Execution time: 6.169370893637339

For a 5M rows df this function takes more than 6mins to complete. In my case (>21M rows), it takes hours!!!

How do it write the function to improve speed? Maybe the notion of creating the accumulator is completely wrong.

Heslacher
50.8k5 gold badges83 silver badges177 bronze badges
asked May 26, 2021 at 12:46
\$\endgroup\$
6
  • 2
    \$\begingroup\$ The current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly. \$\endgroup\$ Commented May 26, 2021 at 13:04
  • \$\begingroup\$ Welcome to Code Review. I have rolled back your last edit. Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$ Commented May 28, 2021 at 9:41
  • \$\begingroup\$ The question title still states your concerns about the code rather the task accomplished by the code. Please edit it to summarise the purpose - you might want to re-read How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles. \$\endgroup\$ Commented May 28, 2021 at 9:55
  • \$\begingroup\$ @TobySpeight My edit was because the answer slightly changes how the function works (return value). \$\endgroup\$ Commented May 28, 2021 at 11:52
  • 1
    \$\begingroup\$ I again have rolled back your last edit. The reason is the same. Please stop changing the code in question which means adding code based on an answer as well. \$\endgroup\$ Commented May 28, 2021 at 12:00

1 Answer 1

3
\$\begingroup\$

Your code appears to be quadratic in the number of groups. Each call to np.concatenate() allocates enough memory to hold the new array and then copies the data. The first group is copied the first time through the loop. Then the first and second groups on the second time. Then the first to third groups on the third time, etc.

To speed this up, keep a list of the groups and then call np.concatenate() just once on the list of groups.

Another observation is that the code splits a group into chunks only to reassemble them in the loop. The only differences are the group is padded to be a multiple of group_size and the shape of the array has changed. But those can be addressed without splitting and concatenating each group.

The Pandas documentation says to use DataFrame.to_numpy() method rather than .value.

Revised code:

def transform2(dataframe, chunk_size):
 
 parts_to_concat = []
 labels = []
 for _, id_group in dataframe.groupby('id'):
 group = id_group.loc[:, 'A':'D'].to_numpy()
 labels.append(id_group.loc[:, 'class'].iat[0])
 parts_to_concat.append(group)
 
 # add a zero-filled part to the list of parts to 
 # effectively pad the group to be a multiple of chunk_size
 pad = chunk_size - len(group) % chunk_size
 if pad < chunk_size:
 parts_to_concat.append(np.zeros((pad, 4)))
 
 # reassemble the data and change it's shape to match 
 # the output of the original code
 transformed_data = np.concatenate(parts_to_concat)
 transformed_data = .reshape(-1, 1, chunk_size, 4)
 labels = np.array(labels)
 
 return transformed_data, labels

On my laptop, the original code takes almost 12 minutes to run the sample dataframe. The new code takes less than one second--about a 700x speedup.

answered May 27, 2021 at 14:51
\$\endgroup\$
4
  • \$\begingroup\$ Thank you, but this changes the required array shape from (m, 1, 100, 4) to (m, 4). For example, in the question, X[0].shape is (1, 100, 4) (this is how it is required), in this answer, X[0].shape gives (4,) \$\endgroup\$ Commented May 27, 2021 at 17:58
  • \$\begingroup\$ @super_ask, Didn't return the reshaped array. Fixed. It now returns an array with shape (m, 1, chunk_size, 4). \$\endgroup\$ Commented May 27, 2021 at 19:15
  • \$\begingroup\$ Great! This is really fast. One thing I just noticed is the function in your answer doesn't return chunk labels. I modified your answer but I am not able to get the correct labels, as described in the questio edit. \$\endgroup\$ Commented May 28, 2021 at 8:55
  • 1
    \$\begingroup\$ @super_ask, I didn't do the labels because it didn't make sense--they were just random values in the sample data. Code added to duplicate what the original code was doing. \$\endgroup\$ Commented May 28, 2021 at 14:21

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.