6
\$\begingroup\$

I am trying to solve the following problem:

Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).

Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.

Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).

import numpy as np
import time
import h5py
import sys
import csv
f_r = h5py.File('input.h5', 'r+')
dset1 = f_r.get('dataset_1')
dset2 = f_r.get('dataset_2')
r1,c1 = dset1.shape
r2,c2 = dset2.shape
left, right, count = 0,0,0
W = 4000 # Window half-width
n = 1
# **********************************************
# HDF5 Out Creation 
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
for j in range(r1):
 e1 = dset1[j,1]
 # move left pointer so that is within -delta of e
 while left < r2 and dset2[left,1] - e1 <= -W:
 left += 1
 # move right pointer so that is outside of +delta
 while right < r2 and dset2[right,1] - e1 <= W:
 right += 1
 for i in range(left, right):
 delta = e1 - dset2[i,1]
 dset.resize(dset.shape[0] + n, axis=0)
 dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
 count += 1
print("\nFinal shape of dataset created: " + str(dset.shape))
f_w.close()
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jul 30, 2020 at 17:05
\$\endgroup\$

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.