2
\$\begingroup\$

I have (tab-separated) input files:

id1 id2 ....
id1 id3 ....
id3 id4 ....
id2 id1 ....
id3 id4 ....
.....

I need to

  1. rearrange col1 and col2 by numerical sort. For now I do this in a python script.
  2. sort by col1 then col2. For now I am doing this by taking the output of the python script and using GNU sort.

My question is the following: is there a way to merge step 1 and step 2 (using GNU sort or any other GNU/Linux command-line tools)?

Is there an efficient alternative GNU/Linux command for step 1?

result:
id1 id2 ....
id1 id2 ....
id1 id3 .... 
id3 id4 ....
id3 id4 ....
.....

My code actually works, I am looking to improve its speed.

Here's the Python program:

import argparse
import subprocess
import os
parser = argparse.ArgumentParser(description='')
parser.add_argument('-blast', help='input', required=True)
parser.add_argument('-out', help='output', required=True)
args = parser.parse_args()
def get_tmp():
# return a name for temporary file.
 dir = os.listdir(".")
 cpt = 0
 name = "tmp_{}".format(cpt)
 while name in dir:
 cpt += 1
 name = "tmp_{}".format(cpt)
 return name
# get a temporary name
tmp_name = get_tmp()
# open inputfile in reading and output in writing
with open(args.blast) as input_blast, open(tmp_name, 'w') as tmp_file:
 for line in input_blast:
 spt = line.strip().split()
 tmp_file.write('\t'.join(sorted(spt[0:2]) + spt[2:]) + '\n')
# sort by field one and two
child = subprocess.Popen("sort -k1 -k2 {} > temps_sort && mv temps_sort {}".format(os.path.abspath(tmp_name), args.out),shell=True)
child.wait()
Toby Speight
87.2k14 gold badges104 silver badges322 bronze badges
asked Aug 24, 2017 at 9:11
\$\endgroup\$
1
  • \$\begingroup\$ PS no need to create your own temp-file handling. python standard library has a tempfile module \$\endgroup\$ Commented Aug 24, 2017 at 10:24

1 Answer 1

1
\$\begingroup\$

There is no reason to use an external file and command. Python can sort strings just fine, unless the file is too big

something like this should work

def rearrange_ids(file):
 pattern = re.compile(r'id(\d+)')
 for line in file:
 spt = line.strip().split()
 if spt:
 ids = sorted(int(pattern.findall(id_string)[0]) for id_string in spt[0:2])
 yield ids, '\t'.join(['id%i' % i for i in ids] + spt[2:]) + '\n'
with open(input_file, 'r') as file, open(output_filename, 'w') as output_file:
 output_lines = sorted(rearrange_ids(file), key=lambda x: x[0])
 output_file.writelines(line for ids, line in output_lines)
id1 id2 ....
id1 id2 ....
id1 id3 ....
id3 id4 ....
id3 id4 ....

edit

I changed my original algorithm because id did not keep the original order when comparing, and sorted 10 before 2

answered Aug 24, 2017 at 10:23
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.