Parse bed file with Pandas

Question 1

For the sections between count=1s and the start and end; combine overlapping positions and output the median of the counts.

Input

chr start stop strand count
chr1 0 13320 - 1
chr1 13320 13321 - 2
chr1 13321 13328 - 1
chr1 13328 13342 - 2
chr1 13342 13343 - 18
chr1 13343 13344 - 36
chr1 13344 13345 - 18
chr1 13345 13346 - 6
chr1 13346 16923 - 1
chr1 16923 16942 - 3
chr1 16942 16943 - 2

Output

chr1 13320 13321 2
chr1 13328 13346 18
chr1 16923 16943 2.5

For the second value:

Start 13328 - this is because the 4th value in the table has the start 13328.
This is the row after the second count=1.
Stop 13346 - this is because the 8th value in the table has the stop 13346.
This is the row before the third count=1.
Count 18 - this is the median of counts between the 4th and 8th inclusive.

Here is my code.

from pathlib import Path
import pandas as pd
file = Path("bed_file.bed")
# load with pandas
df = pd.read_csv(file, sep='\t', header=None)
# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]
# create new dataframe
newDF = pd.DataFrame(columns=['chr','start', 'stop', 'count'])
# last position
end = df.index[-1]
# parse dataframe
for idx, elem in enumerate(li):
 if elem != li[-1]: 
 next_elem = li[(idx + 1) % len(li)] # next element where count=1
 start = df.iloc[elem]['stop'] # start position 
 stop = df.iloc[next_elem-1]['stop'] # stop position
 if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
 count = df.iloc[elem+1]['count']
 #print(f"start={start}\tstop={stop}\tcount={count}")
 else:
 count = df.iloc[elem+1:next_elem]['count'].median()
 #print(f"start={start}\tstop={stop}\tcount={count}")
 newDF = newDF.append({
 'chr' : df.loc[0,'chr'],
 'start' : start,
 'stop' : stop,
 'count' : count
 
 },ignore_index=True)
 else: # last element in the list
 start = df.iloc[elem]['stop']
 stop = df.iloc[end]['stop']
 count = df.iloc[elem+1:end+1]['count'].median()
 #print(f"start={start}\tstop={stop}\tcount={count}")
 newDF = newDF.append({
 'chr' : df.loc[0,'chr'],
 'start' : start,
 'stop' : stop,
 'count' : count
 },ignore_index=True)

Is there a better way to do this? Is my code Pythonic?

Question 2

I'll first offer some critique of your code, and then I'll show you how I would approach the problem.

Commented out code should be removed before asking for a code review #print(f"start={start}\tstop={stop}\tcount={count}")
Many of the comments don't add value. # last position doesn't mean much on its own. Why do you want the last position? Why doesn't the code do a good enough job explaining that?
Generally an if/else in a loop where one of branches is only taken once, either at the start or end, can be removed. You can iterate less, and deal with the case explicitly. You can add a sentinel value so you don't have to check if you are at the end of the iterator. You can used the available libraries or built-in functions, which will deal with the case for you.

# load with pandas
df = pd.read_csv(file, sep='\t', header=None)
# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

If the header is cut short len(df.columns) < len(header), the first thing to be cut off is the column df['count']. You then assume it exists straight away after by using it. Which is it? Will it always exist, or sometimes will there not be enough columns? Erring on the side of it always exists, the code becomes

# load with pandas
df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]
...
for idx, elem in enumerate(li):

If you are using pandas (or numpy) it is generally not the best to move the data back and forth between the library and Python. You lose most of the efficiency of the library, and the code generally becomes far less readable.

Don't use names like li. It doesn't give any information to the reader. If you have a list of indices, what will you use the list for? That would make a much better name.

Using pandas more, and renaming gives something like

splitting_indices = df.index[df['count'] == 1].tolist()
for idx, elem in enumerate(splitting_indices):

if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
 count = df.iloc[elem+1]['count']
 #print(f"start={start}\tstop={stop}\tcount={count}")
else:
 count = df.iloc[elem+1:next_elem]['count'].median()

Finding this logic in amongst getting the data out from the dataframe is not easy. This is the core logic, and should be treated as such. Put this in a function at the very least.

def extract_median(df, elem, next_elem):
 if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
 count = df.iloc[elem+1]['count']
 else:
 count = df.iloc[elem+1:next_elem]['count'].median()
 return count

Now it should be much more apparent that the comment is bogus. You CAN compute the median of a single element list. So why are we special casing this? df.iloc[elem+1:next_elem] works even if next_elem is only one bigger than elem+1.

def extract_median(df, elem, next_elem):
 return df.iloc[elem+1:next_elem]['count'].median()

And now we can see that a function is probably not necessary.

The approach I would take to implementing this is to try and stay using pandas as long as possible. No loops. No tolist. Since I won't want loops, indices are probably not needed too, so I can limit usage of iloc and df.index.

First, read in the data

df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
 chr start stop strand count
0 chr1 0 13320 - 1
1 chr1 13320 13321 - 2
2 chr1 13321 13328 - 1
3 chr1 13328 13342 - 2
4 chr1 13342 13343 - 18
5 chr1 13343 13344 - 36
6 chr1 13344 13345 - 18
7 chr1 13345 13346 - 6
8 chr1 13346 16923 - 1
9 chr1 16923 16942 - 3
10 chr1 16942 16943 - 2

Then, find every row of interest. That would be everywhere count is not 1.

df['count'] != 1
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 True
10 True

I want to group all the consecutive rows that are True together. The usual method to group consecutive rows by a column value is

Keep a running tally.
Compare each value in the column with the next one.
If they are the same, don't do anything.
If they are different, add 1 to a running tally.
Associate the tally to that value.
Groupby the tally.

In code

mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()
 count mask tally
0 1 False 1
1 2 True 2
2 1 False 3
3 2 True 4
4 18 True 4
5 36 True 4
6 18 True 4
7 6 True 4
8 1 False 5
9 3 True 6
10 2 True 6

Grouping then gives

df.groupby(tally).groups
{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1], dtype='int64'),
 3: Int64Index([2], dtype='int64'),
 4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
 5: Int64Index([8], dtype='int64'),
 6: Int64Index([9, 10], dtype='int64')}

Since you only want the rows where count is not 1, we can reuse the mask to filter them out.

df[mask].groupby(tally).groups
{2: Int64Index([1], dtype='int64'),
 4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
 6: Int64Index([9, 10], dtype='int64')}

And finally the median is quick to get from a grouper

df[mask].groupby(tally).median()
 start stop count
count 
2 13320.0 13321.0 2.0
4 13343.0 13344.0 18.0
6 16932.5 16942.5 2.5

In the end, the code is a lot shorter

df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()
df[mask].groupby(tally).median()

Question 3

I found the answer using agg() function ! df[mask].groupby(tally).agg({'start': np.min, 'stop': np.max, 'count' : np.median})

spyr03 spyr03 3,05213 silver badges27 bronze badges · Answer 1 · 2020-08-11 16:56:02Z

I'll first offer some critique of your code, and then I'll show you how I would approach the problem.

Commented out code should be removed before asking for a code review #print(f"start={start}\tstop={stop}\tcount={count}")
Many of the comments don't add value. # last position doesn't mean much on its own. Why do you want the last position? Why doesn't the code do a good enough job explaining that?
Generally an if/else in a loop where one of branches is only taken once, either at the start or end, can be removed. You can iterate less, and deal with the case explicitly. You can add a sentinel value so you don't have to check if you are at the end of the iterator. You can used the available libraries or built-in functions, which will deal with the case for you.

# load with pandas
df = pd.read_csv(file, sep='\t', header=None)
# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

If the header is cut short len(df.columns) < len(header), the first thing to be cut off is the column df['count']. You then assume it exists straight away after by using it. Which is it? Will it always exist, or sometimes will there not be enough columns? Erring on the side of it always exists, the code becomes

# load with pandas
df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]
...
for idx, elem in enumerate(li):

If you are using pandas (or numpy) it is generally not the best to move the data back and forth between the library and Python. You lose most of the efficiency of the library, and the code generally becomes far less readable.

Don't use names like li. It doesn't give any information to the reader. If you have a list of indices, what will you use the list for? That would make a much better name.

Using pandas more, and renaming gives something like

splitting_indices = df.index[df['count'] == 1].tolist()
for idx, elem in enumerate(splitting_indices):

if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
 count = df.iloc[elem+1]['count']
 #print(f"start={start}\tstop={stop}\tcount={count}")
else:
 count = df.iloc[elem+1:next_elem]['count'].median()

Finding this logic in amongst getting the data out from the dataframe is not easy. This is the core logic, and should be treated as such. Put this in a function at the very least.

def extract_median(df, elem, next_elem):
 if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
 count = df.iloc[elem+1]['count']
 else:
 count = df.iloc[elem+1:next_elem]['count'].median()
 return count

Now it should be much more apparent that the comment is bogus. You CAN compute the median of a single element list. So why are we special casing this? df.iloc[elem+1:next_elem] works even if next_elem is only one bigger than elem+1.

def extract_median(df, elem, next_elem):
 return df.iloc[elem+1:next_elem]['count'].median()

And now we can see that a function is probably not necessary.

The approach I would take to implementing this is to try and stay using pandas as long as possible. No loops. No tolist. Since I won't want loops, indices are probably not needed too, so I can limit usage of iloc and df.index.

First, read in the data

df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
 chr start stop strand count
0 chr1 0 13320 - 1
1 chr1 13320 13321 - 2
2 chr1 13321 13328 - 1
3 chr1 13328 13342 - 2
4 chr1 13342 13343 - 18
5 chr1 13343 13344 - 36
6 chr1 13344 13345 - 18
7 chr1 13345 13346 - 6
8 chr1 13346 16923 - 1
9 chr1 16923 16942 - 3
10 chr1 16942 16943 - 2

Then, find every row of interest. That would be everywhere count is not 1.

df['count'] != 1
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 True
10 True

I want to group all the consecutive rows that are True together. The usual method to group consecutive rows by a column value is

Keep a running tally.
Compare each value in the column with the next one.
If they are the same, don't do anything.
If they are different, add 1 to a running tally.
Associate the tally to that value.
Groupby the tally.

In code

mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()
 count mask tally
0 1 False 1
1 2 True 2
2 1 False 3
3 2 True 4
4 18 True 4
5 36 True 4
6 18 True 4
7 6 True 4
8 1 False 5
9 3 True 6
10 2 True 6

Grouping then gives

df.groupby(tally).groups
{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1], dtype='int64'),
 3: Int64Index([2], dtype='int64'),
 4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
 5: Int64Index([8], dtype='int64'),
 6: Int64Index([9, 10], dtype='int64')}

Since you only want the rows where count is not 1, we can reuse the mask to filter them out.

df[mask].groupby(tally).groups
{2: Int64Index([1], dtype='int64'),
 4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
 6: Int64Index([9, 10], dtype='int64')}

And finally the median is quick to get from a grouper

df[mask].groupby(tally).median()
 start stop count
count 
2 13320.0 13321.0 2.0
4 13343.0 13344.0 18.0
6 16932.5 16942.5 2.5

In the end, the code is a lot shorter

df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()
df[mask].groupby(tally).median()

I found the answer using agg() function ! df[mask].groupby(tally).agg({'start': np.min, 'stop': np.max, 'count' : np.median})

Stack Exchange Network

Parse bed file with Pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse bed file with Pandas

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions