I have managed to write this piece of code but as I am using it on big data sets it ends up being quite slow. I am pretty sure it would be possible to optimize it but I am very knew to coding and I don't really know where to start.. I think getting rid of the for loop would be one way but honestly I'm lost. A little bit of help would be greatly appreciated !
Basically, the point is to look if one row of the 'data' dataframe match one row of the 'ref' dataframe. And I use np.isclose in order to allow for small differences in the value as I know my 'data' values can be slightly different than the 'ref' values.
Also, because my rows can have a lot of NaN values in them, I first use np.isnan to get the index of where is my last 'real' value in the row and then only do the row comparison with the 'actual' values. I thought it would speed things up but I'm not very sure it did...
match = []
checklist = set()
for read in data.itertuples():
for ref in ref.itertuples():
x = np.isnan(read[3:]).argmax(axis=0)
if x == 2:
if np.isclose(read[4:6],ref[7:9],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
if not read[1] in checklist:
match.append([read[1], ref[5]])
checklist.add(read[1])
if x > 2:
read_pos = 3+x-1
ref_pos = 6+x-1
if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
if not read[1] in checklist:
match.append([read[1], ref[5]])
checklist.add(read[1])
if read[1] not in checklist:
match.append([read[1], "not found"])
checklist.add(read[1])
Thanks in advance !
EDIT:
To download samples of data and ref tables: https://we.tl/RF6lxDZBjt
Short example of the dataframes:
ref = pd.DataFrame({'name':['a-1','a-2','b-1'],
'start 1':[100,100,100],
'end 1':[200,200,500],
'start 2':[300,np.NaN,600],
'end 2':[400,np.NaN, 700]},
columns=['name', 'start 1', 'end 1', 'start 2', 'end 2'],
dtype='float64')
name start 1 end 1 start 2 end 2
0 a-1 100.0 200.0 300.0 400.0
1 a-2 100.0 200.0 NaN NaN
2 b-1 100.0 500.0 600.0 700.0
data = pd.DataFrame({'name':['read 1','read 2','read 3','read 4', 'read 5'],
'start 1':[100,102,100,103,600],
'end 1':[198,504,500,200, 702],
'start 2':[np.NaN,600,650,601, np.NaN],
'end 2':[np.NaN,699, 700,702, np.NaN]},
columns=['name', 'start 1', 'end 1', 'start 2', 'end 2'],
dtype='float64')
read start 1 end 1 start 2 end 2
0 read 1 100.0 200.0 300.0 400.0
1 read 2 100.0 200.0 NaN NaN
2 read 3 100.0 500.0 600.0 700.0
3 read 4 300.0 400.0 600.0 700.0
4 read 5 600.0 702.0 NaN NaN
-
\$\begingroup\$ @Graipher Did you happen to have time to look at it ? If not, I totally understand ! If you even have a small idea of things that I could try I'm willing to try it myself ! I tried looking into vectorization but I really don't know where to start.. Thanks anyway for your time, it's greatly appreciated :) \$\endgroup\$Florian Bernard– Florian Bernard2018年07月21日 10:05:30 +00:00Commented Jul 21, 2018 at 10:05
-
\$\begingroup\$ Not yet, but I will have some time tomorrow. Yes, writing good (in other words vectorised) code in numpy/pandas is a whole new world if you only know vanilla Python. \$\endgroup\$Graipher– Graipher2018年07月21日 12:25:21 +00:00Commented Jul 21, 2018 at 12:25
-
\$\begingroup\$ Would you think it's possible to remove the outer for loop by using apply instead ? and put the vectorization of the inner loop in a function ? \$\endgroup\$Florian Bernard– Florian Bernard2018年07月23日 12:09:44 +00:00Commented Jul 23, 2018 at 12:09
1 Answer 1
Invariants
for read in data.itertuples():
for ref in ref.itertuples():
x = np.isnan(read[3:]).argmax(axis=0)
x doesn't change in the inner loop, so you can move it out of the inner loop, and not execute it repeatedly.
for read in data.itertuples():
x = np.isnan(read[3:]).argmax(axis=0)
for ref in ref.itertuples():
These following two lines are identical, apart for the end-points of the slices:
if np.isclose(read[4: 6 ],ref[7: 9 ],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
You already have a variable for the end-points. Why not use it for the first line as well, and only have one case?
read_pos = 3+x-1 if x > 2 else 6
ref_pos = 6+x-1 if x > 2 else 9
Once you've found your target, you can't ever add it again ...
if not read[1] in checklist:
match.append([read[1], ref[5]])
checklist.add(read[1])
... but you don't break out of your inner search, which is now pointless.
If I haven't made any errors, this should be a litte faster:
match = []
checklist = set()
for read in data.itertuples():
x = np.isnan(read[3:]).argmax(axis=0)
if x >= 2 and read[1] not in checklist:
read_pos = 3+x-1 if x > 2 else 6
ref_pos = 6+x-1 if x > 2 else 9
for ref in ref.itertuples():
if np.isclose(read[4:read_pos],ref[7:ref_pos],atol=5, equal_nan=True).all() == True and np.isnan(ref[6:]).argmax(axis=0) == x:
match.append([read[1], ref[5]])
checklist.add(read[1])
break
if read[1] not in checklist:
match.append([read[1], "not found"])
checklist.add(read[1])
-
\$\begingroup\$ Thanks ! It didn't improve the speed that much but at least it made realize some mistakes I did and it is much more easy to read and understand so that's already a nice improvement ! \$\endgroup\$Florian Bernard– Florian Bernard2018年07月20日 10:05:06 +00:00Commented Jul 20, 2018 at 10:05