I have (tab-separated) input files:
id1 id2 ....
id1 id3 ....
id3 id4 ....
id2 id1 ....
id3 id4 ....
.....
I need to
- rearrange
col1
andcol2
by numerical sort. For now I do this in a python script. - sort by
col1
thencol2
. For now I am doing this by taking the output of the python script and using GNUsort
.
My question is the following: is there a way to merge step 1 and step 2 (using GNU sort or any other GNU/Linux command-line tools)?
Is there an efficient alternative GNU/Linux command for step 1?
result:
id1 id2 ....
id1 id2 ....
id1 id3 ....
id3 id4 ....
id3 id4 ....
.....
My code actually works, I am looking to improve its speed.
Here's the Python program:
import argparse
import subprocess
import os
parser = argparse.ArgumentParser(description='')
parser.add_argument('-blast', help='input', required=True)
parser.add_argument('-out', help='output', required=True)
args = parser.parse_args()
def get_tmp():
# return a name for temporary file.
dir = os.listdir(".")
cpt = 0
name = "tmp_{}".format(cpt)
while name in dir:
cpt += 1
name = "tmp_{}".format(cpt)
return name
# get a temporary name
tmp_name = get_tmp()
# open inputfile in reading and output in writing
with open(args.blast) as input_blast, open(tmp_name, 'w') as tmp_file:
for line in input_blast:
spt = line.strip().split()
tmp_file.write('\t'.join(sorted(spt[0:2]) + spt[2:]) + '\n')
# sort by field one and two
child = subprocess.Popen("sort -k1 -k2 {} > temps_sort && mv temps_sort {}".format(os.path.abspath(tmp_name), args.out),shell=True)
child.wait()
-
\$\begingroup\$ PS no need to create your own temp-file handling. python standard library has a tempfile module \$\endgroup\$Maarten Fabré– Maarten Fabré2017年08月24日 10:24:43 +00:00Commented Aug 24, 2017 at 10:24
1 Answer 1
There is no reason to use an external file and command. Python can sort strings just fine, unless the file is too big
something like this should work
def rearrange_ids(file):
pattern = re.compile(r'id(\d+)')
for line in file:
spt = line.strip().split()
if spt:
ids = sorted(int(pattern.findall(id_string)[0]) for id_string in spt[0:2])
yield ids, '\t'.join(['id%i' % i for i in ids] + spt[2:]) + '\n'
with open(input_file, 'r') as file, open(output_filename, 'w') as output_file:
output_lines = sorted(rearrange_ids(file), key=lambda x: x[0])
output_file.writelines(line for ids, line in output_lines)
id1 id2 .... id1 id2 .... id1 id3 .... id3 id4 .... id3 id4 ....
edit
I changed my original algorithm because id did not keep the original order when comparing, and sorted 10
before 2