3
\$\begingroup\$

I performing a set of division operations and reshaping on a HDF5 file. As my datapoints is around 4000 in number it takes around a minute. I will be adding more data, which will further slow my overall code execution time. How can I optimize this code to make it faster?

def pre_proc_data():
 jointMatrix = np.array([], dtype=np.float64).reshape(0, 500 * 17)
 hdf5_file = h5py.File("/home/Data.hdf5")
 for j in range(len(hdf5_file["vector"])):
 # Normalization
 norm_vec = hdf5_file["vector"][j]
 norm_vec[:, 0] = (norm_vec[:, 0] - (-3.059)) / 6.117 # W0 - Left and right
 norm_vec[:, 5] = (norm_vec[:, 5] - (-3.059)) / 6.117
 norm_vec[:, 1] = (norm_vec[:, 1] - (-1.5707)) / 3.6647 # W1 
 norm_vec[:, 6] = (norm_vec[:, 6] - (-1.5707)) / 3.6647
 norm_vec[:, 2] = (norm_vec[:, 2] - (-3.059)) / 6.117 # W2 
 norm_vec[:, 14] = (norm_vec[:, 14] - (-3.059)) / 6.117
 norm_vec[:, 3] = (norm_vec[:, 3] - (-1.7016)) / 3.4033 # S0 
 norm_vec[:, 10] = (norm_vec[:, 10] - (-1.7016)) / 3.4033
 norm_vec[:, 4] = (norm_vec[:, 4] - (-2.147)) / 3.194 # s1 
 norm_vec[:, 8] = (norm_vec[:, 8] - (-2.147)) / 3.194
 norm_vec[:, 11] = (norm_vec[:, 11] - (-3.0541)) / 6.1083 # eo 
 norm_vec[:, 15] = (norm_vec[:, 15] - (-3.0541)) / 6.1083
 norm_vec[:, 12] = (norm_vec[:, 12] - (-0.05)) / 2.67 # e1 
 norm_vec[:, 16] = (norm_vec[:, 16] - (-0.05)) / 2.67
 reshaped_vec = hdf5_file["vector"][j].reshape(500 * 17)
 jointMatrix = np.vstack((jointMatrix, reshaped_vec))
 return jointMatrix
jointMatrix = pre_proc_data()
200_success
146k22 gold badges190 silver badges479 bronze badges
asked Oct 4, 2017 at 17:48
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

It seems that all your code could be vectorized with the help of numpy broadcasting.

At first, instead of using all these norm_vec[:, ...] = ... you could create two vectors of length 17 containing values you use to normalize the data.

I assume normalizing values to be mean and standard deviation (please tell me if I'm wrong), so I'll be calling them mean and std correspondingly.

mean is a np.ndarray with values [-3.059, -1.5707, ..., -0.05] and std is a np.ndarray with values [6.117, 3.6647, ..., 2.67] (indicies ranging from zero to 16).

Using this notation, for loop could be rewritten:

for j in range(len(hdf5_file["vector"])):
 norm_vec = (hdf5_file["vector"][j] - mean) / std
 reshaped_vec = norm_vec.reshape(500 * 17)
 jointMatrix = np.vstack((jointMatrix, reshaped_vec))

This should give a certain speed-up. However, the code could be further optimized by vectorizing the loop itself.

The whole code now looks like this:

def pre_proc_data():
 hdf5_file = h5py.File("/home/Data.hdf5")
 norm_vec = (hdf5_file["vector"] - mean) / std
 # from 3d to 2d
 return norm_vec.reshape(-1, 500 * 17)
jointMatrix = pre_proc_data()
answered Oct 4, 2017 at 21:36
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.