4
\$\begingroup\$

I have managed to write this piece of code but as I am using it on big data sets it ends up being quite slow. I am pretty sure it would be possible to optimize it but I am very knew to coding and I don't really know where to start.. I think getting rid of the for loop would be one way but honestly I'm lost. A little bit of help would be greatly appreciated !

Basically, the point is to look if one row of the 'data' dataframe match one row of the 'ref' dataframe. And I use np.isclose in order to allow for small differences in the value as I know my 'data' values can be slightly different than the 'ref' values.

Also, because my rows can have a lot of NaN values in them, I first use np.isnan to get the index of where is my last 'real' value in the row and then only do the row comparison with the 'actual' values. I thought it would speed things up but I'm not very sure it did...

match = []
checklist = set()
for read in data.itertuples():
 for ref in ref.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)
 if x == 2:
 if np.isclose(read[4:6],ref[7:9],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
 if not read[1] in checklist:
 match.append([read[1], ref[5]])
 checklist.add(read[1])
 if x > 2:
 read_pos = 3+x-1
 ref_pos = 6+x-1
 if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
 if not read[1] in checklist:
 match.append([read[1], ref[5]])
 checklist.add(read[1])
 if read[1] not in checklist:
 match.append([read[1], "not found"])
 checklist.add(read[1]) 

Thanks in advance !

EDIT:

To download samples of data and ref tables: https://we.tl/RF6lxDZBjt

Short example of the dataframes:

ref = pd.DataFrame({'name':['a-1','a-2','b-1'],
 'start 1':[100,100,100],
 'end 1':[200,200,500],
 'start 2':[300,np.NaN,600],
 'end 2':[400,np.NaN, 700]}, 
 columns=['name', 'start 1', 'end 1', 'start 2', 'end 2'], 
 dtype='float64')
 name start 1 end 1 start 2 end 2
 0 a-1 100.0 200.0 300.0 400.0
 1 a-2 100.0 200.0 NaN NaN
 2 b-1 100.0 500.0 600.0 700.0
data = pd.DataFrame({'name':['read 1','read 2','read 3','read 4', 'read 5'],
 'start 1':[100,102,100,103,600],
 'end 1':[198,504,500,200, 702],
 'start 2':[np.NaN,600,650,601, np.NaN],
 'end 2':[np.NaN,699, 700,702, np.NaN]}, 
 columns=['name', 'start 1', 'end 1', 'start 2', 'end 2'], 
 dtype='float64')
 read start 1 end 1 start 2 end 2
 0 read 1 100.0 200.0 300.0 400.0
 1 read 2 100.0 200.0 NaN NaN
 2 read 3 100.0 500.0 600.0 700.0
 3 read 4 300.0 400.0 600.0 700.0
 4 read 5 600.0 702.0 NaN NaN
Mast
13.8k12 gold badges57 silver badges127 bronze badges
asked Jul 19, 2018 at 20:47
\$\endgroup\$
3
  • \$\begingroup\$ @Graipher Did you happen to have time to look at it ? If not, I totally understand ! If you even have a small idea of things that I could try I'm willing to try it myself ! I tried looking into vectorization but I really don't know where to start.. Thanks anyway for your time, it's greatly appreciated :) \$\endgroup\$ Commented Jul 21, 2018 at 10:05
  • \$\begingroup\$ Not yet, but I will have some time tomorrow. Yes, writing good (in other words vectorised) code in numpy/pandas is a whole new world if you only know vanilla Python. \$\endgroup\$ Commented Jul 21, 2018 at 12:25
  • \$\begingroup\$ Would you think it's possible to remove the outer for loop by using apply instead ? and put the vectorization of the inner loop in a function ? \$\endgroup\$ Commented Jul 23, 2018 at 12:09

1 Answer 1

3
\$\begingroup\$

Invariants

for read in data.itertuples():
 for ref in ref.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)

x doesn't change in the inner loop, so you can move it out of the inner loop, and not execute it repeatedly.

for read in data.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)
 for ref in ref.itertuples():

These following two lines are identical, apart for the end-points of the slices:

if np.isclose(read[4: 6 ],ref[7: 9 ],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:

You already have a variable for the end-points. Why not use it for the first line as well, and only have one case?

read_pos = 3+x-1 if x > 2 else 6
ref_pos = 6+x-1 if x > 2 else 9

Once you've found your target, you can't ever add it again ...

if not read[1] in checklist:
 match.append([read[1], ref[5]])
 checklist.add(read[1])

... but you don't break out of your inner search, which is now pointless.


If I haven't made any errors, this should be a litte faster:

match = []
checklist = set()
for read in data.itertuples():
 x = np.isnan(read[3:]).argmax(axis=0)
 if x >= 2 and read[1] not in checklist:
 read_pos = 3+x-1 if x > 2 else 6
 ref_pos = 6+x-1 if x > 2 else 9
 for ref in ref.itertuples():
 if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
 match.append([read[1], ref[5]])
 checklist.add(read[1])
 break
 if read[1] not in checklist:
 match.append([read[1], "not found"])
 checklist.add(read[1]) 
answered Jul 19, 2018 at 22:48
\$\endgroup\$
1
  • \$\begingroup\$ Thanks ! It didn't improve the speed that much but at least it made realize some mistakes I did and it is much more easy to read and understand so that's already a nice improvement ! \$\endgroup\$ Commented Jul 20, 2018 at 10:05

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.