Compare sequence & maps headers in fasta file

Question 1

This is the perl code which compares the sequence in fasta file & maps the header. Though the code is working well, I still would like to make it more efficient. Since the files I compare has>100000 sequences it is taking huge time despite using hashes. Can you please suggest more efficient way ?

Example:

File1:
>Seq1
ABCDEFGH
>Seq2
RNADIMSEQ
>Seq6
XYZ
File2:
>Seq3
ABCDEFGH
>Seq4
RNADIMSEQ
Output:
>Seq1 >Seq3
>Seq2 >Seq4
>Seq6 Not found

The code: my $start_run = time();

%hash=();
open(out, ">Output.txt");
open(sbjt, "File1.fasta") or die "File not found"; #Bigger file
$count =0;
while(<sbjt>)
 {
 chomp;
 if($_ =~ m/^\w+/) 
 {
 $hash{$previous} = $_ ;
 #print "$previous\n";
 }
 else
 {
 $previous = $_;
 }
 $count++ ; 
 }
 close sbjt;
 #print "$hash{$previous}";
%dash=();
$previous = undef;
open(query, "File2.fasta") or die "File not found"; #smaller
while(<query>)
 {
 chomp;
 if($_ =~ m/^\w+/) 
 {
 $dash{$previous} = $_ ;
 #print "$previous\n";
 }
 else
 {
 $previous = $_;
 }
 }
 close query;
foreach $key (keys %dash)
{
 foreach $temp (keys %hash)
 {
 if($hash{$temp} eq $dash{$key})
 {
 push(@new_array, $temp);
 }
 next; #added
 }
 if(scalar @new_array>0)
 { 
 print out "$key\t@new_array\n";
 }
 else
 {
 print out "$key\tNot found\n";
 }
 @new_array=();
 }
 my $end_run = time();
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";

Question 2

Obvious things: use strict and warnings. Use three argument form of open, use lexical file handles, check the success of open or use autodie.

When I run the program, Seq6 is not reported in the output. That's because the outer loop exhausts File2, not File1.

#!/usr/bin/perl
use warnings;
use strict;
my $start_run = time;
my %hash;
my $previous;
open my $SBJ, '<', 'File1.fasta' or die 'File not found';
while (<$SBJ>) {
 chomp;
 if (/^\w+/) {
 $hash{$previous} = $_;
 } else {
 $previous = $_;
 }
}
close $SBJ;
my %dash;
undef $previous;
open my $QUERY, '<', 'File2.fasta' or die 'File not found';
while (<$QUERY>) {
 chomp;
 if (/^\w+/) {
 $dash{$previous} = $_;
 } else {
 $previous = $_;
 }
}
close $QUERY;
open my $OUT, '>', 'Output.txt' or die $!;
my @new_array;
for my $temp (keys %hash) {
 for my $key (keys %dash) {
 if($hash{$temp} eq $dash{$key}) {
 push @new_array, $key;
 }
 }
 if (@new_array > 0) {
 print {$OUT} "$temp\t@new_array\n";
 } else {
 print {$OUT} "$temp\tNot found\n";
 }
 @new_array = ();
}
close $OUT or die $!;
my $end_run = time;
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";

If performance is a problem, you can try trading memory for speed: Hash by the sequences, not by their names:

#!/usr/bin/perl
use warnings;
use strict;
my $start_run = time;
my %hash;
open my $SBJ, '<', 'File1.fasta' or die 'File not found';
my $id;
while (<$SBJ>) {
 chomp;
 if (/^>/) {
 $id = $_;
 } else {
 $hash{$_} = $id;
 }
}
close $SBJ;
my %dash;
undef $id;
open my $QUERY, '<', 'File2.fasta' or die 'File not found';
while (<$QUERY>) {
 chomp;
 if (/^>/) {
 $id = $_;
 } else {
 $dash{$_} = $id;
 }
}
close $QUERY;
open my $OUT, '>', 'Output.txt' or die $!;
for my $seq1 (keys %hash) {
 if (exists $dash{$seq1}) {
 print {$OUT} "$hash{$seq1}\t$dash{$seq1}\n";
 } else {
 print {$OUT} "$hash{$seq1}\tNot found\n";
 }
}
close $OUT or die $!;
my $end_run = time;
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";

Reading the files is exactly the same code with just the hash name different. I'd refactor it into a subroutine to keep the code DRY (Don't Repeat Yourself):

sub read_fasta {
 my $filename = shift;
 my ($id, %hash);
 open my $FH, '<', $filename or die "Can't open $filename: $!";
 while (<$FH>) {
 chomp;
 if (/^>/) {
 $id = $_;
 } else {
 $hash{$_} = $id;
 }
 }
 return %hash
}
my %hash = read_fasta('File1.fasta');
my %dash = read_fasta('File2.fasta');

Question 3

Wow Great! thanks. Now it took only 7 seconds to parse files with 867690 & 938976 sequences respectively !!

choroba choroba 1,4139 silver badges11 bronze badges · Answer 1 · 2016-02-12 10:44:18Z

Obvious things: use strict and warnings. Use three argument form of open, use lexical file handles, check the success of open or use autodie.

When I run the program, Seq6 is not reported in the output. That's because the outer loop exhausts File2, not File1.

#!/usr/bin/perl
use warnings;
use strict;
my $start_run = time;
my %hash;
my $previous;
open my $SBJ, '<', 'File1.fasta' or die 'File not found';
while (<$SBJ>) {
 chomp;
 if (/^\w+/) {
 $hash{$previous} = $_;
 } else {
 $previous = $_;
 }
}
close $SBJ;
my %dash;
undef $previous;
open my $QUERY, '<', 'File2.fasta' or die 'File not found';
while (<$QUERY>) {
 chomp;
 if (/^\w+/) {
 $dash{$previous} = $_;
 } else {
 $previous = $_;
 }
}
close $QUERY;
open my $OUT, '>', 'Output.txt' or die $!;
my @new_array;
for my $temp (keys %hash) {
 for my $key (keys %dash) {
 if($hash{$temp} eq $dash{$key}) {
 push @new_array, $key;
 }
 }
 if (@new_array > 0) {
 print {$OUT} "$temp\t@new_array\n";
 } else {
 print {$OUT} "$temp\tNot found\n";
 }
 @new_array = ();
}
close $OUT or die $!;
my $end_run = time;
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";

If performance is a problem, you can try trading memory for speed: Hash by the sequences, not by their names:

#!/usr/bin/perl
use warnings;
use strict;
my $start_run = time;
my %hash;
open my $SBJ, '<', 'File1.fasta' or die 'File not found';
my $id;
while (<$SBJ>) {
 chomp;
 if (/^>/) {
 $id = $_;
 } else {
 $hash{$_} = $id;
 }
}
close $SBJ;
my %dash;
undef $id;
open my $QUERY, '<', 'File2.fasta' or die 'File not found';
while (<$QUERY>) {
 chomp;
 if (/^>/) {
 $id = $_;
 } else {
 $dash{$_} = $id;
 }
}
close $QUERY;
open my $OUT, '>', 'Output.txt' or die $!;
for my $seq1 (keys %hash) {
 if (exists $dash{$seq1}) {
 print {$OUT} "$hash{$seq1}\t$dash{$seq1}\n";
 } else {
 print {$OUT} "$hash{$seq1}\tNot found\n";
 }
}
close $OUT or die $!;
my $end_run = time;
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";

Reading the files is exactly the same code with just the hash name different. I'd refactor it into a subroutine to keep the code DRY (Don't Repeat Yourself):

sub read_fasta {
 my $filename = shift;
 my ($id, %hash);
 open my $FH, '<', $filename or die "Can't open $filename: $!";
 while (<$FH>) {
 chomp;
 if (/^>/) {
 $id = $_;
 } else {
 $hash{$_} = $id;
 }
 }
 return %hash
}
my %hash = read_fasta('File1.fasta');
my %dash = read_fasta('File2.fasta');

Wow Great! thanks. Now it took only 7 seconds to parse files with 867690 & 938976 sequences respectively !!

Stack Exchange Network

Compare sequence & maps headers in fasta file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Compare sequence & maps headers in fasta file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions