I performing a set of division operations and reshaping on a HDF5 file. As my datapoints is around 4000 in number it takes around a minute. I will be adding more data, which will further slow my overall code execution time. How can I optimize this code to make it faster?
def pre_proc_data():
jointMatrix = np.array([], dtype=np.float64).reshape(0, 500 * 17)
hdf5_file = h5py.File("/home/Data.hdf5")
for j in range(len(hdf5_file["vector"])):
# Normalization
norm_vec = hdf5_file["vector"][j]
norm_vec[:, 0] = (norm_vec[:, 0] - (-3.059)) / 6.117 # W0 - Left and right
norm_vec[:, 5] = (norm_vec[:, 5] - (-3.059)) / 6.117
norm_vec[:, 1] = (norm_vec[:, 1] - (-1.5707)) / 3.6647 # W1
norm_vec[:, 6] = (norm_vec[:, 6] - (-1.5707)) / 3.6647
norm_vec[:, 2] = (norm_vec[:, 2] - (-3.059)) / 6.117 # W2
norm_vec[:, 14] = (norm_vec[:, 14] - (-3.059)) / 6.117
norm_vec[:, 3] = (norm_vec[:, 3] - (-1.7016)) / 3.4033 # S0
norm_vec[:, 10] = (norm_vec[:, 10] - (-1.7016)) / 3.4033
norm_vec[:, 4] = (norm_vec[:, 4] - (-2.147)) / 3.194 # s1
norm_vec[:, 8] = (norm_vec[:, 8] - (-2.147)) / 3.194
norm_vec[:, 11] = (norm_vec[:, 11] - (-3.0541)) / 6.1083 # eo
norm_vec[:, 15] = (norm_vec[:, 15] - (-3.0541)) / 6.1083
norm_vec[:, 12] = (norm_vec[:, 12] - (-0.05)) / 2.67 # e1
norm_vec[:, 16] = (norm_vec[:, 16] - (-0.05)) / 2.67
reshaped_vec = hdf5_file["vector"][j].reshape(500 * 17)
jointMatrix = np.vstack((jointMatrix, reshaped_vec))
return jointMatrix
jointMatrix = pre_proc_data()
1 Answer 1
It seems that all your code could be vectorized with the help of numpy broadcasting.
At first, instead of using all these norm_vec[:, ...] = ...
you could create two vectors of length 17
containing values you use to normalize the data.
I assume normalizing values to be mean and standard deviation (please tell me if I'm wrong), so I'll be calling them mean
and std
correspondingly.
mean
is a np.ndarray
with values [-3.059, -1.5707, ..., -0.05]
and std
is a np.ndarray
with values [6.117, 3.6647, ..., 2.67]
(indicies ranging from zero to 16).
Using this notation, for loop could be rewritten:
for j in range(len(hdf5_file["vector"])):
norm_vec = (hdf5_file["vector"][j] - mean) / std
reshaped_vec = norm_vec.reshape(500 * 17)
jointMatrix = np.vstack((jointMatrix, reshaped_vec))
This should give a certain speed-up. However, the code could be further optimized by vectorizing the loop itself.
The whole code now looks like this:
def pre_proc_data():
hdf5_file = h5py.File("/home/Data.hdf5")
norm_vec = (hdf5_file["vector"] - mean) / std
# from 3d to 2d
return norm_vec.reshape(-1, 500 * 17)
jointMatrix = pre_proc_data()