Normalization, division and reshaping processes on an HDF5 file

Question 1

I performing a set of division operations and reshaping on a HDF5 file. As my datapoints is around 4000 in number it takes around a minute. I will be adding more data, which will further slow my overall code execution time. How can I optimize this code to make it faster?

def pre_proc_data():
 jointMatrix = np.array([], dtype=np.float64).reshape(0, 500 * 17)
 hdf5_file = h5py.File("/home/Data.hdf5")
 for j in range(len(hdf5_file["vector"])):
 # Normalization
 norm_vec = hdf5_file["vector"][j]
 norm_vec[:, 0] = (norm_vec[:, 0] - (-3.059)) / 6.117 # W0 - Left and right
 norm_vec[:, 5] = (norm_vec[:, 5] - (-3.059)) / 6.117
 norm_vec[:, 1] = (norm_vec[:, 1] - (-1.5707)) / 3.6647 # W1 
 norm_vec[:, 6] = (norm_vec[:, 6] - (-1.5707)) / 3.6647
 norm_vec[:, 2] = (norm_vec[:, 2] - (-3.059)) / 6.117 # W2 
 norm_vec[:, 14] = (norm_vec[:, 14] - (-3.059)) / 6.117
 norm_vec[:, 3] = (norm_vec[:, 3] - (-1.7016)) / 3.4033 # S0 
 norm_vec[:, 10] = (norm_vec[:, 10] - (-1.7016)) / 3.4033
 norm_vec[:, 4] = (norm_vec[:, 4] - (-2.147)) / 3.194 # s1 
 norm_vec[:, 8] = (norm_vec[:, 8] - (-2.147)) / 3.194
 norm_vec[:, 11] = (norm_vec[:, 11] - (-3.0541)) / 6.1083 # eo 
 norm_vec[:, 15] = (norm_vec[:, 15] - (-3.0541)) / 6.1083
 norm_vec[:, 12] = (norm_vec[:, 12] - (-0.05)) / 2.67 # e1 
 norm_vec[:, 16] = (norm_vec[:, 16] - (-0.05)) / 2.67
 reshaped_vec = hdf5_file["vector"][j].reshape(500 * 17)
 jointMatrix = np.vstack((jointMatrix, reshaped_vec))
 return jointMatrix
jointMatrix = pre_proc_data()

Question 2

It seems that all your code could be vectorized with the help of numpy broadcasting.

At first, instead of using all these norm_vec[:, ...] = ... you could create two vectors of length 17 containing values you use to normalize the data.

I assume normalizing values to be mean and standard deviation (please tell me if I'm wrong), so I'll be calling them mean and std correspondingly.

mean is a np.ndarray with values [-3.059, -1.5707, ..., -0.05] and std is a np.ndarray with values [6.117, 3.6647, ..., 2.67] (indicies ranging from zero to 16).

Using this notation, for loop could be rewritten:

for j in range(len(hdf5_file["vector"])):
 norm_vec = (hdf5_file["vector"][j] - mean) / std
 reshaped_vec = norm_vec.reshape(500 * 17)
 jointMatrix = np.vstack((jointMatrix, reshaped_vec))

This should give a certain speed-up. However, the code could be further optimized by vectorizing the loop itself.

The whole code now looks like this:

def pre_proc_data():
 hdf5_file = h5py.File("/home/Data.hdf5")
 norm_vec = (hdf5_file["vector"] - mean) / std
 # from 3d to 2d
 return norm_vec.reshape(-1, 500 * 17)
jointMatrix = pre_proc_data()

Dmytro Danevskyi Dmytro Danevskyi 2261 silver badge4 bronze badges · Answer 1 · 2017-10-04 21:36:35Z

It seems that all your code could be vectorized with the help of numpy broadcasting.

At first, instead of using all these norm_vec[:, ...] = ... you could create two vectors of length 17 containing values you use to normalize the data.

I assume normalizing values to be mean and standard deviation (please tell me if I'm wrong), so I'll be calling them mean and std correspondingly.

mean is a np.ndarray with values [-3.059, -1.5707, ..., -0.05] and std is a np.ndarray with values [6.117, 3.6647, ..., 2.67] (indicies ranging from zero to 16).

Using this notation, for loop could be rewritten:

for j in range(len(hdf5_file["vector"])):
 norm_vec = (hdf5_file["vector"][j] - mean) / std
 reshaped_vec = norm_vec.reshape(500 * 17)
 jointMatrix = np.vstack((jointMatrix, reshaped_vec))

This should give a certain speed-up. However, the code could be further optimized by vectorizing the loop itself.

The whole code now looks like this:

def pre_proc_data():
 hdf5_file = h5py.File("/home/Data.hdf5")
 norm_vec = (hdf5_file["vector"] - mean) / std
 # from 3d to 2d
 return norm_vec.reshape(-1, 500 * 17)
jointMatrix = pre_proc_data()

Stack Exchange Network

Normalization, division and reshaping processes on an HDF5 file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Normalization, division and reshaping processes on an HDF5 file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions