I would like to transform my dataframe
into an array of fixed-sized chunks from each unique segment. Specifically, I would like to transform the df
to a list of m
arrays each sized (1,100,4)
. So at last, I would have an (m,1,100,4)
array.
Since I require that the chunks
be of fixed-size (1,100,4)
, and on splitting it is unlikely that each segment produce perfectly this size, the last rows of a segment are usually less, so should be zero-padded.
For this, I start be creating an array of this size, and populate it with all zeros. Then gradually fill in these values with df
rows. This way, what's left at end of a particular segment is therefore zero-padded.
To do this, I use the function:
def transform(dataframe, chunk_size):
grouped = dataframe.groupby('id')
# initialize accumulators
X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])
# loop over each group (df[df.id==1] and df[df.id==2])
for _, group in grouped:
inputs = group.loc[:, 'A':'D'].values
label = group.loc[:, 'class'].values[0]
# calculate number of splits
N = (len(inputs)-1) // chunk_size
if N > 0:
inputs = np.array_split(
inputs, [chunk_size + (chunk_size*i) for i in range(N)])
else:
inputs = [inputs]
# loop over splits
for inpt in inputs:
inpt = np.pad(
inpt, [(0, chunk_size-len(inpt)),(0, 0)],
mode='constant')
# add each inputs split to accumulators
X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
y = np.concatenate([y, label[np.newaxis]], axis=0)
return X, y
This function does produce the intended ndarray
. However, it is extremely slow. My df
has over 21M rows, so the function takes more than 5hours to complete, this is crazy!
I am looking for a way to refactor this function for optimization.
Steps to reproduce the issue:
Generate a random large df
:
import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.randn(3_000_000,4), columns=list('ABCD'))
df['class'] = np.random.randint(0, 5, df.shape[0])
df.shape
(3000000, 5)
df['id'] = df.index // 650 +1
df.head()
A B C D class id
0 -0.696659 -0.724940 0.494385 1.469749 2 1
1 -0.440400 0.744680 -0.684663 -1.962713 4 1
2 -1.207888 -1.003556 -0.926677 -1.455632 3 1
3 1.575943 -0.453352 -0.106494 0.351674 3 1
4 0.888164 0.675754 0.254067 -0.454150 3 1
Transform df
to the required ndarray
per unique segment.
start = time.time()
X,y = transform(df, 100)
end = time.time()
print(f"Execution time: {(end - start) / 60}")
Execution time: 6.169370893637339
For a 5M rows df
this function takes more than 6mins to complete. In my case (>21M rows), it takes hours!!!
How do it write the function to improve speed? Maybe the notion of creating the accumulator is completely wrong.
-
2\$\begingroup\$ The current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly. \$\endgroup\$Mast– Mast ♦2021年05月26日 13:04:01 +00:00Commented May 26, 2021 at 13:04
-
\$\begingroup\$ Welcome to Code Review. I have rolled back your last edit. Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$Heslacher– Heslacher2021年05月28日 09:41:29 +00:00Commented May 28, 2021 at 9:41
-
\$\begingroup\$ The question title still states your concerns about the code rather the task accomplished by the code. Please edit it to summarise the purpose - you might want to re-read How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles. \$\endgroup\$Toby Speight– Toby Speight2021年05月28日 09:55:17 +00:00Commented May 28, 2021 at 9:55
-
\$\begingroup\$ @TobySpeight My edit was because the answer slightly changes how the function works (return value). \$\endgroup\$super_ask– super_ask2021年05月28日 11:52:04 +00:00Commented May 28, 2021 at 11:52
-
1\$\begingroup\$ I again have rolled back your last edit. The reason is the same. Please stop changing the code in question which means adding code based on an answer as well. \$\endgroup\$Heslacher– Heslacher2021年05月28日 12:00:33 +00:00Commented May 28, 2021 at 12:00
1 Answer 1
Your code appears to be quadratic in the number of groups. Each call to np.concatenate()
allocates enough memory to hold the new array and then copies the data. The first group is copied the first time through the loop. Then the first and second groups on the second time. Then the first to third groups on the third time, etc.
To speed this up, keep a list of the groups and then call np.concatenate()
just once on the list of groups.
Another observation is that the code splits a group into chunks only to reassemble them in the loop. The only differences are the group is padded to be a multiple of group_size and the shape of the array has changed. But those can be addressed without splitting and concatenating each group.
The Pandas documentation says to use DataFrame.to_numpy()
method rather than .value
.
Revised code:
def transform2(dataframe, chunk_size):
parts_to_concat = []
labels = []
for _, id_group in dataframe.groupby('id'):
group = id_group.loc[:, 'A':'D'].to_numpy()
labels.append(id_group.loc[:, 'class'].iat[0])
parts_to_concat.append(group)
# add a zero-filled part to the list of parts to
# effectively pad the group to be a multiple of chunk_size
pad = chunk_size - len(group) % chunk_size
if pad < chunk_size:
parts_to_concat.append(np.zeros((pad, 4)))
# reassemble the data and change it's shape to match
# the output of the original code
transformed_data = np.concatenate(parts_to_concat)
transformed_data = .reshape(-1, 1, chunk_size, 4)
labels = np.array(labels)
return transformed_data, labels
On my laptop, the original code takes almost 12 minutes to run the sample dataframe. The new code takes less than one second--about a 700x speedup.
-
\$\begingroup\$ Thank you, but this changes the required array shape from
(m, 1, 100, 4)
to(m, 4)
. For example, in the question,X[0].shape
is(1, 100, 4)
(this is how it is required), in this answer,X[0].shape
gives(4,)
\$\endgroup\$super_ask– super_ask2021年05月27日 17:58:11 +00:00Commented May 27, 2021 at 17:58 -
\$\begingroup\$ @super_ask, Didn't return the reshaped array. Fixed. It now returns an array with shape (m, 1, chunk_size, 4). \$\endgroup\$RootTwo– RootTwo2021年05月27日 19:15:38 +00:00Commented May 27, 2021 at 19:15
-
\$\begingroup\$ Great! This is really fast. One thing I just noticed is the function in your answer doesn't return chunk
labels
. I modified your answer but I am not able to get the correct labels, as described in the questio edit. \$\endgroup\$super_ask– super_ask2021年05月28日 08:55:48 +00:00Commented May 28, 2021 at 8:55 -
1\$\begingroup\$ @super_ask, I didn't do the labels because it didn't make sense--they were just random values in the sample data. Code added to duplicate what the original code was doing. \$\endgroup\$RootTwo– RootTwo2021年05月28日 14:21:19 +00:00Commented May 28, 2021 at 14:21
Explore related questions
See similar questions with these tags.