7
\$\begingroup\$

For the sections between count=1s and the start and end; combine overlapping positions and output the median of the counts.

Input

chr start stop strand count
chr1 0 13320 - 1
chr1 13320 13321 - 2
chr1 13321 13328 - 1
chr1 13328 13342 - 2
chr1 13342 13343 - 18
chr1 13343 13344 - 36
chr1 13344 13345 - 18
chr1 13345 13346 - 6
chr1 13346 16923 - 1
chr1 16923 16942 - 3
chr1 16942 16943 - 2

Output

chr1 13320 13321 2
chr1 13328 13346 18
chr1 16923 16943 2.5

For the second value:

  • Start 13328 - this is because the 4th value in the table has the start 13328.
    This is the row after the second count=1.
  • Stop 13346 - this is because the 8th value in the table has the stop 13346.
    This is the row before the third count=1.
  • Count 18 - this is the median of counts between the 4th and 8th inclusive.

Here is my code.

from pathlib import Path
import pandas as pd
file = Path("bed_file.bed")
# load with pandas
df = pd.read_csv(file, sep='\t', header=None)
# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]
# create new dataframe
newDF = pd.DataFrame(columns=['chr','start', 'stop', 'count'])
# last position
end = df.index[-1]
# parse dataframe
for idx, elem in enumerate(li):
 if elem != li[-1]: 
 next_elem = li[(idx + 1) % len(li)] # next element where count=1
 start = df.iloc[elem]['stop'] # start position 
 stop = df.iloc[next_elem-1]['stop'] # stop position
 if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
 count = df.iloc[elem+1]['count']
 #print(f"start={start}\tstop={stop}\tcount={count}")
 else:
 count = df.iloc[elem+1:next_elem]['count'].median()
 #print(f"start={start}\tstop={stop}\tcount={count}")
 newDF = newDF.append({
 'chr' : df.loc[0,'chr'],
 'start' : start,
 'stop' : stop,
 'count' : count
 
 },ignore_index=True)
 else: # last element in the list
 start = df.iloc[elem]['stop']
 stop = df.iloc[end]['stop']
 count = df.iloc[elem+1:end+1]['count'].median()
 #print(f"start={start}\tstop={stop}\tcount={count}")
 newDF = newDF.append({
 'chr' : df.loc[0,'chr'],
 'start' : start,
 'stop' : stop,
 'count' : count
 },ignore_index=True)

Is there a better way to do this? Is my code Pythonic?

Peilonrayz
44.4k7 gold badges80 silver badges157 bronze badges
asked Aug 11, 2020 at 7:13
\$\endgroup\$
0

1 Answer 1

4
\$\begingroup\$

I'll first offer some critique of your code, and then I'll show you how I would approach the problem.

  • Commented out code should be removed before asking for a code review #print(f"start={start}\tstop={stop}\tcount={count}")
  • Many of the comments don't add value. # last position doesn't mean much on its own. Why do you want the last position? Why doesn't the code do a good enough job explaining that?
  • Generally an if/else in a loop where one of branches is only taken once, either at the start or end, can be removed. You can iterate less, and deal with the case explicitly. You can add a sentinel value so you don't have to check if you are at the end of the iterator. You can used the available libraries or built-in functions, which will deal with the case for you.

# load with pandas
df = pd.read_csv(file, sep='\t', header=None)
# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

If the header is cut short len(df.columns) < len(header), the first thing to be cut off is the column df['count']. You then assume it exists straight away after by using it. Which is it? Will it always exist, or sometimes will there not be enough columns? Erring on the side of it always exists, the code becomes

# load with pandas
df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]
...
for idx, elem in enumerate(li):

If you are using pandas (or numpy) it is generally not the best to move the data back and forth between the library and Python. You lose most of the efficiency of the library, and the code generally becomes far less readable.

Don't use names like li. It doesn't give any information to the reader. If you have a list of indices, what will you use the list for? That would make a much better name.

Using pandas more, and renaming gives something like

splitting_indices = df.index[df['count'] == 1].tolist()
for idx, elem in enumerate(splitting_indices):

if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
 count = df.iloc[elem+1]['count']
 #print(f"start={start}\tstop={stop}\tcount={count}")
else:
 count = df.iloc[elem+1:next_elem]['count'].median()

Finding this logic in amongst getting the data out from the dataframe is not easy. This is the core logic, and should be treated as such. Put this in a function at the very least.

def extract_median(df, elem, next_elem):
 if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
 count = df.iloc[elem+1]['count']
 else:
 count = df.iloc[elem+1:next_elem]['count'].median()
 return count

Now it should be much more apparent that the comment is bogus. You CAN compute the median of a single element list. So why are we special casing this? df.iloc[elem+1:next_elem] works even if next_elem is only one bigger than elem+1.

def extract_median(df, elem, next_elem):
 return df.iloc[elem+1:next_elem]['count'].median()

And now we can see that a function is probably not necessary.


The approach I would take to implementing this is to try and stay using pandas as long as possible. No loops. No tolist. Since I won't want loops, indices are probably not needed too, so I can limit usage of iloc and df.index.

First, read in the data

df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
 chr start stop strand count
0 chr1 0 13320 - 1
1 chr1 13320 13321 - 2
2 chr1 13321 13328 - 1
3 chr1 13328 13342 - 2
4 chr1 13342 13343 - 18
5 chr1 13343 13344 - 36
6 chr1 13344 13345 - 18
7 chr1 13345 13346 - 6
8 chr1 13346 16923 - 1
9 chr1 16923 16942 - 3
10 chr1 16942 16943 - 2

Then, find every row of interest. That would be everywhere count is not 1.

df['count'] != 1
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 True
10 True

I want to group all the consecutive rows that are True together. The usual method to group consecutive rows by a column value is

  1. Keep a running tally.
  2. Compare each value in the column with the next one.
  3. If they are the same, don't do anything.
  4. If they are different, add 1 to a running tally.
  5. Associate the tally to that value.
  6. Groupby the tally.

In code

mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()
 count mask tally
0 1 False 1
1 2 True 2
2 1 False 3
3 2 True 4
4 18 True 4
5 36 True 4
6 18 True 4
7 6 True 4
8 1 False 5
9 3 True 6
10 2 True 6

Grouping then gives

df.groupby(tally).groups
{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1], dtype='int64'),
 3: Int64Index([2], dtype='int64'),
 4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
 5: Int64Index([8], dtype='int64'),
 6: Int64Index([9, 10], dtype='int64')}

Since you only want the rows where count is not 1, we can reuse the mask to filter them out.

df[mask].groupby(tally).groups
{2: Int64Index([1], dtype='int64'),
 4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
 6: Int64Index([9, 10], dtype='int64')}

And finally the median is quick to get from a grouper

df[mask].groupby(tally).median()
 start stop count
count 
2 13320.0 13321.0 2.0
4 13343.0 13344.0 18.0
6 16932.5 16942.5 2.5

In the end, the code is a lot shorter

df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()
df[mask].groupby(tally).median()
answered Aug 11, 2020 at 16:56
\$\endgroup\$
1
  • 1
    \$\begingroup\$ I found the answer using agg() function ! df[mask].groupby(tally).agg({'start': np.min, 'stop': np.max, 'count' : np.median}) \$\endgroup\$ Commented Aug 12, 2020 at 10:04

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.