linux rearange field and sort by column

Question 1

I have (tab-separated) input files:

id1 id2 ....
id1 id3 ....
id3 id4 ....
id2 id1 ....
id3 id4 ....
.....

I need to

rearrange col1 and col2 by numerical sort. For now I do this in a python script.
sort by col1 then col2. For now I am doing this by taking the output of the python script and using GNU sort.

My question is the following: is there a way to merge step 1 and step 2 (using GNU sort or any other GNU/Linux command-line tools)?

Is there an efficient alternative GNU/Linux command for step 1?

result:
id1 id2 ....
id1 id2 ....
id1 id3 .... 
id3 id4 ....
id3 id4 ....
.....

My code actually works, I am looking to improve its speed.

Here's the Python program:

import argparse
import subprocess
import os
parser = argparse.ArgumentParser(description='')
parser.add_argument('-blast', help='input', required=True)
parser.add_argument('-out', help='output', required=True)
args = parser.parse_args()
def get_tmp():
# return a name for temporary file.
 dir = os.listdir(".")
 cpt = 0
 name = "tmp_{}".format(cpt)
 while name in dir:
 cpt += 1
 name = "tmp_{}".format(cpt)
 return name
# get a temporary name
tmp_name = get_tmp()
# open inputfile in reading and output in writing
with open(args.blast) as input_blast, open(tmp_name, 'w') as tmp_file:
 for line in input_blast:
 spt = line.strip().split()
 tmp_file.write('\t'.join(sorted(spt[0:2]) + spt[2:]) + '\n')
# sort by field one and two
child = subprocess.Popen("sort -k1 -k2 {} > temps_sort && mv temps_sort {}".format(os.path.abspath(tmp_name), args.out),shell=True)
child.wait()

Question 2

PS no need to create your own temp-file handling. python standard library has a tempfile module

Question 3

There is no reason to use an external file and command. Python can sort strings just fine, unless the file is too big

something like this should work

def rearrange_ids(file):
 pattern = re.compile(r'id(\d+)')
 for line in file:
 spt = line.strip().split()
 if spt:
 ids = sorted(int(pattern.findall(id_string)[0]) for id_string in spt[0:2])
 yield ids, '\t'.join(['id%i' % i for i in ids] + spt[2:]) + '\n'
with open(input_file, 'r') as file, open(output_filename, 'w') as output_file:
 output_lines = sorted(rearrange_ids(file), key=lambda x: x[0])
 output_file.writelines(line for ids, line in output_lines)

id1 id2 ....
id1 id2 ....
id1 id3 ....
id3 id4 ....
id3 id4 ....

edit

I changed my original algorithm because id did not keep the original order when comparing, and sorted 10 before 2

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Accepted Answer · 2017-08-24 10:23:22Z

There is no reason to use an external file and command. Python can sort strings just fine, unless the file is too big

something like this should work

def rearrange_ids(file):
 pattern = re.compile(r'id(\d+)')
 for line in file:
 spt = line.strip().split()
 if spt:
 ids = sorted(int(pattern.findall(id_string)[0]) for id_string in spt[0:2])
 yield ids, '\t'.join(['id%i' % i for i in ids] + spt[2:]) + '\n'
with open(input_file, 'r') as file, open(output_filename, 'w') as output_file:
 output_lines = sorted(rearrange_ids(file), key=lambda x: x[0])
 output_file.writelines(line for ids, line in output_lines)

id1 id2 ....
id1 id2 ....
id1 id3 ....
id3 id4 ....
id3 id4 ....

edit

I changed my original algorithm because id did not keep the original order when comparing, and sorted 10 before 2

Stack Exchange Network

linux rearange field and sort by column

1 Answer 1

edit

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

linux rearange field and sort by column

1 Answer 1

edit

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions