For the sections between count=1s and the start and end; combine overlapping positions and output the median of the counts.
Input
chr start stop strand count
chr1 0 13320 - 1
chr1 13320 13321 - 2
chr1 13321 13328 - 1
chr1 13328 13342 - 2
chr1 13342 13343 - 18
chr1 13343 13344 - 36
chr1 13344 13345 - 18
chr1 13345 13346 - 6
chr1 13346 16923 - 1
chr1 16923 16942 - 3
chr1 16942 16943 - 2
Output
chr1 13320 13321 2
chr1 13328 13346 18
chr1 16923 16943 2.5
For the second value:
- Start 13328 - this is because the 4th value in the table has the start 13328.
This is the row after the second count=1. - Stop 13346 - this is because the 8th value in the table has the stop 13346.
This is the row before the third count=1. - Count 18 - this is the median of counts between the 4th and 8th inclusive.
Here is my code.
from pathlib import Path
import pandas as pd
file = Path("bed_file.bed")
# load with pandas
df = pd.read_csv(file, sep='\t', header=None)
# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]
# create new dataframe
newDF = pd.DataFrame(columns=['chr','start', 'stop', 'count'])
# last position
end = df.index[-1]
# parse dataframe
for idx, elem in enumerate(li):
if elem != li[-1]:
next_elem = li[(idx + 1) % len(li)] # next element where count=1
start = df.iloc[elem]['stop'] # start position
stop = df.iloc[next_elem-1]['stop'] # stop position
if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
count = df.iloc[elem+1]['count']
#print(f"start={start}\tstop={stop}\tcount={count}")
else:
count = df.iloc[elem+1:next_elem]['count'].median()
#print(f"start={start}\tstop={stop}\tcount={count}")
newDF = newDF.append({
'chr' : df.loc[0,'chr'],
'start' : start,
'stop' : stop,
'count' : count
},ignore_index=True)
else: # last element in the list
start = df.iloc[elem]['stop']
stop = df.iloc[end]['stop']
count = df.iloc[elem+1:end+1]['count'].median()
#print(f"start={start}\tstop={stop}\tcount={count}")
newDF = newDF.append({
'chr' : df.loc[0,'chr'],
'start' : start,
'stop' : stop,
'count' : count
},ignore_index=True)
Is there a better way to do this? Is my code Pythonic?
1 Answer 1
I'll first offer some critique of your code, and then I'll show you how I would approach the problem.
- Commented out code should be removed before asking for a code review
#print(f"start={start}\tstop={stop}\tcount={count}")
- Many of the comments don't add value.
# last position
doesn't mean much on its own. Why do you want the last position? Why doesn't the code do a good enough job explaining that? - Generally an if/else in a loop where one of branches is only taken once, either at the start or end, can be removed. You can iterate less, and deal with the case explicitly. You can add a sentinel value so you don't have to check if you are at the end of the iterator. You can used the available libraries or built-in functions, which will deal with the case for you.
# load with pandas
df = pd.read_csv(file, sep='\t', header=None)
# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]
If the header is cut short len(df.columns) < len(header)
, the first thing to be cut off is the column df['count']
. You then assume it exists straight away after by using it. Which is it? Will it always exist, or sometimes will there not be enough columns? Erring on the side of it always exists, the code becomes
# load with pandas
df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]
...
for idx, elem in enumerate(li):
If you are using pandas (or numpy) it is generally not the best to move the data back and forth between the library and Python. You lose most of the efficiency of the library, and the code generally becomes far less readable.
Don't use names like li
. It doesn't give any information to the reader. If you have a list of indices, what will you use the list for? That would make a much better name.
Using pandas more, and renaming gives something like
splitting_indices = df.index[df['count'] == 1].tolist()
for idx, elem in enumerate(splitting_indices):
if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
count = df.iloc[elem+1]['count']
#print(f"start={start}\tstop={stop}\tcount={count}")
else:
count = df.iloc[elem+1:next_elem]['count'].median()
Finding this logic in amongst getting the data out from the dataframe is not easy. This is the core logic, and should be treated as such. Put this in a function at the very least.
def extract_median(df, elem, next_elem):
if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
count = df.iloc[elem+1]['count']
else:
count = df.iloc[elem+1:next_elem]['count'].median()
return count
Now it should be much more apparent that the comment is bogus. You CAN compute the median of a single element list. So why are we special casing this? df.iloc[elem+1:next_elem]
works even if next_elem
is only one bigger than elem+1
.
def extract_median(df, elem, next_elem):
return df.iloc[elem+1:next_elem]['count'].median()
And now we can see that a function is probably not necessary.
The approach I would take to implementing this is to try and stay using pandas as long as possible. No loops. No tolist. Since I won't want loops, indices are probably not needed too, so I can limit usage of iloc and df.index.
First, read in the data
df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
chr start stop strand count
0 chr1 0 13320 - 1
1 chr1 13320 13321 - 2
2 chr1 13321 13328 - 1
3 chr1 13328 13342 - 2
4 chr1 13342 13343 - 18
5 chr1 13343 13344 - 36
6 chr1 13344 13345 - 18
7 chr1 13345 13346 - 6
8 chr1 13346 16923 - 1
9 chr1 16923 16942 - 3
10 chr1 16942 16943 - 2
Then, find every row of interest. That would be everywhere count
is not 1.
df['count'] != 1
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 True
10 True
I want to group all the consecutive rows that are True together. The usual method to group consecutive rows by a column value is
- Keep a running tally.
- Compare each value in the column with the next one.
- If they are the same, don't do anything.
- If they are different, add 1 to a running tally.
- Associate the tally to that value.
- Groupby the tally.
In code
mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()
count mask tally
0 1 False 1
1 2 True 2
2 1 False 3
3 2 True 4
4 18 True 4
5 36 True 4
6 18 True 4
7 6 True 4
8 1 False 5
9 3 True 6
10 2 True 6
Grouping then gives
df.groupby(tally).groups
{1: Int64Index([0], dtype='int64'),
2: Int64Index([1], dtype='int64'),
3: Int64Index([2], dtype='int64'),
4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
5: Int64Index([8], dtype='int64'),
6: Int64Index([9, 10], dtype='int64')}
Since you only want the rows where count is not 1, we can reuse the mask to filter them out.
df[mask].groupby(tally).groups
{2: Int64Index([1], dtype='int64'),
4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
6: Int64Index([9, 10], dtype='int64')}
And finally the median is quick to get from a grouper
df[mask].groupby(tally).median()
start stop count
count
2 13320.0 13321.0 2.0
4 13343.0 13344.0 18.0
6 16932.5 16942.5 2.5
In the end, the code is a lot shorter
df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()
df[mask].groupby(tally).median()
-
1\$\begingroup\$ I found the answer using agg() function !
df[mask].groupby(tally).agg({'start': np.min, 'stop': np.max, 'count' : np.median})
\$\endgroup\$PIFASTE– PIFASTE2020年08月12日 10:04:20 +00:00Commented Aug 12, 2020 at 10:04