h5py conflicts with swifter.apply

Question 1

I am currently working with .h5 files. The file contains several tables, which I have to process (row filtering and other basic stuff). Then, as one of the steps, I have to compute an integral for each row which takes as input two of the two columns. A simplified version of the code looks something like this:

# Inside an object method
# Function to apply which I know it is not vectorized yet
def compute_alpha_val(row):
 weight, degree = row["norm_weight"], row["degree"]
 if degree == 1:
 return 1
 func = lambda x: (1 - x) ** (degree - 2)
 alpha = 1 - (degree - 1) * scipy.integrate.quad(func, 0, weight)[0]
 return round_half_up(alpha, 4)
for chunk in table_chunks: # Generator of pd.DataFrames from the table stored in the h5 file
 # Do some operations
 alphas = chunk.apply(compute_alpha_val, axis=1) # Works
 alphas = chunk.swifter.apply(compute_alpha_val, axis=1) # Does not work
 # Do stuff with alphas

The normal apply is fairly slow (about 50 sec per million of rows) but it works, the swifter one does not; since the function is not vectorized I know that the swifter version would probably be slower, but it throws a completely different error:

BlockingIOError: [Errno 11] Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

which looks as if the code is trying to do multiprocessing on the h5 file itself, causing a lock to raise an error. This should not be the case, since the apply does not involve anything inside the file. Moreover, some print lines that I placed to monitor the progress show some strange stuttering-like behavior.

# Beginning of the script
STARTING table15
Num rows to process 2018915
Starting to run: compute alphas
Starting chunk
Starting bin1_id
# Here the apply should happen
STARTING table15 # As if the script started from the beginning
STARTING table15
Num rows to process 2018915
Num rows to process 2018915

I checked that a dataset is being passed and not something else. I also checked that all possible connections to the h5 files are closed even though it should not matter.

My hypothesis is that somehow h5py sees an open connection to the file and tries to prevent the multiprocessing package underneath swifter to avoid file corruption. Any idea on how to solve this? I am open to anything, as long as in the end I can vectorize the function and speed up this code since it should run on more than 10 billion rows and with current times it is too slow.

Question 2

That error comes from the OS when HDF5 wants to lock the file for read or write access; using flock on Linux. Whatever you think is happening, either you have the file opened for writing and now another process opens it for reading, or vice-versa, or everyone wants to open it for writing

Question 3

@Homer512, Thank you, I get that and it makes sense, but I do not see how the function called by the apply can start a new process on the file... I tried substituting the function with something super simple (like return row["a"] + row["b"]) and in that case it works, but as soon as the apply becomes more complex it breaks. Maybe something about h5 buffers?

Question 4

If you are on Linux, you can use the command line lsof /directory/filename to see which process has a file open

Question 5

I'm not seeing the usual h5py file access (that I'm used to seeing)? Is this using pandas table's access? What is row - a row of a frame already in memory, or one that's loaded on the fly from the file?

Question 6

If you can't find the location where the file is kept open or is opened, we really need a minimal reproducible example on this one

CollectivesTM on Stack Overflow

h5py conflicts with swifter.apply

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions