Subtracting elements of datasets of an HDF5 file

Asked 5 years, 1 month ago

Viewed 115 times

\$\begingroup\$

I am trying to solve the following problem:

Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).

Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.

Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).

import numpy as np
import time
import h5py
import sys
import csv
f_r = h5py.File('input.h5', 'r+')
dset1 = f_r.get('dataset_1')
dset2 = f_r.get('dataset_2')
r1,c1 = dset1.shape
r2,c2 = dset2.shape
left, right, count = 0,0,0
W = 4000 # Window half-width
n = 1
# **********************************************
# HDF5 Out Creation 
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
for j in range(r1):
 e1 = dset1[j,1]
 # move left pointer so that is within -delta of e
 while left < r2 and dset2[left,1] - e1 <= -W:
 left += 1
 # move right pointer so that is outside of +delta
 while right < r2 and dset2[right,1] - e1 <= W:
 right += 1
 for i in range(left, right):
 delta = e1 - dset2[i,1]
 dset.resize(dset.shape[0] + n, axis=0)
 dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
 count += 1
print("\nFinal shape of dataset created: " + str(dset.shape))
f_w.close()

edited Jul 30, 2020 at 22:50

Jamal's user avatar

Jamal

35.2k13 gold badges134 silver badges238 bronze badges

asked Jul 30, 2020 at 17:05

nuki's user avatar

nuki nuki

611 bronze badge

\$\endgroup\$

Add a comment |

0

Sorted by: Reset to default

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

Stack Exchange Network

Subtracting elements of datasets of an HDF5 file

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Subtracting elements of datasets of an HDF5 file

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions