Compare a sequence with the reference frequency of hexamers

Question 1

I have written this function (and others similar to that one) But I am not sure I am using references on their full power.

My currently concerns is if I make a huge use of memory. The subroutine recieve a reference to two files, the subroutines, like these one return a hash (except &log_time, which returns a scalar). The subroutines expect a reference, thats why I use my %current_seq = ($id_name[0] => $seqs[$j]); my %freq_seq = &hexamer_freq(\%current_seq); on the subroutine, I don't think this is a good way to do it, but I can't imagine a better way to do it.

sub comparer{
 # Compare a sequence with the reference frequency of hexamers
 # First argument the file of to analyse, second argument the file of reference
 my %score;
 #Reading arguments
 my $seq = shift;
 my $ref_seq = shift;
 # Calculating the reference log2
 my %ref_seq = &read_fasta($$ref_seq);
 my %freq_ref = &hexamer_freq(\%ref_seq);
 # Counting hexamers and frequencies for each sequence
 my %seqs = &read_fasta($seq);
 while( my ($id, $sequen) = each %seqs){
 my @id_seq = split(/\s+/, $id);
 my @id_name = split(/\./, $id_seq[0]);
 my $max = 0;
 my $min = 999;
 for (my $i = 0; $i < 3; $i++){
 last if length $sequen <= $i;
 my $sequ = substr($sequen, $i);
 next unless (defined $sequ);
 my $rev_sequ = reverse($sequ);
 my @seqs = ($sequ, $rev_sequ);
 for (my $j = 0; $j < scalar(@seqs); $j++){
 my %current_seq = ($id_name[0] => $seqs[$j]);
 my %freq_seq = &hexamer_freq(\%current_seq);
 # Handle the sequences that are too short to contain an hexamer
 if (scalar keys %freq_seq == 0){
 print STDERR &log_time(), "Unable to calculate the Hexamer score of $id_name[0]\n";
 next;
 };
 # Calculate the hexamer score
 my $score = 0;
 my $n_hexamers = scalar keys %freq_seq;
 foreach my $hex (keys %freq_seq){
 if (defined $freq_ref{$hex}){
 $score += log2($freq_seq{$hex}/$freq_ref{$hex});
 }
 };
 # Store the two possible candidates of "best score"
 if ($score/$n_hexamers > $max){
 $max = $score/$n_hexamers;
 };
 unless ($score/$n_hexamers > $min) {
 $min = $score/$n_hexamers
 }
 # Store the data for each sequence
 my $key = $id_name[0] . " frame: $i";
 $key .= " FWD" if $j == 0; # The fwd + or - have the same hexamers
 $key .= " REV" if $j == 1;
 $score{$key} = [$max, $min];
 };
 };
 }
 return %score;
};

Besides, I would like to improve the way the $min is calculated. Now the 999 is an arbitrary number, I expect it won't be reached, because the $freq_ are numbers between 0 and 1, and it is unlikely to have so big numbers (but it may happen).

Question 2

I removed your comments and inserted mine.

# Documentation goes to POD.
=item comparer
Compare a sequence with the reference frequency of hexamers.
First argument the file of to analyse, second argument the file of reference
=cut
sub comparer {
 my ($seq, $ref_seq) = @_; # I like arguments being processed as the first step in the sub. No need to shift twice.
 my %score;
 my %freq_ref = hexamer_freq({ read_fasta($$ref_seq) });
 my %seqs = read_fasta($seq);
 while (my ($id, $sequen) = each %seqs) {
 my @id_seq = split ' ', $id;
 my @id_name = split /\./, $id_seq[0];
 my ($max, $min);
 for my $i (0 .. 2) { # No need for a C-style for.
 last if length $sequen <= $i;
 my $sequ = substr $sequen, $i;
 next unless defined $sequ;
 my $rev_sequ = reverse $sequ;
 my @seqs = ($sequ, $rev_sequ);
 for my $j (0 .. $#seqs) { # C-style eliminated again.
 my %freq_seq = hexamer_freq({ $id_name[0] => $seqs[$j] }); # Anonymous hash.
 if (keys %freq_seq == 0) {
 print STDERR log_time(), "Unable to calculate the Hexamer score of $id_name[0]\n";
 next
 }
 my $score = 0;
 my $n_hexamers = keys %freq_seq; # "scalar" not needed in scalar context.
 for my $hex (keys %freq_seq){
 if (defined $freq_ref{$hex}){
 $score += log2($freq_seq{$hex} / $freq_ref{$hex});
 }
 }
 if (! defined $max || $score / $n_hexamers > $max) {
 $max = $score / $n_hexamers;
 }
 if (!defined $min || $score / $n_hexamers <= $min) {
 $min = $score / $n_hexamers
 }
 my $key = $id_name[0] . " frame: $i";
 $key .= (' FWD', ' REV')[$j] if $j < 2; # Poor man's "switch".
 $score{$key} = [$max, $min];
 }
 }
 }
 return %score
}

There's no need to call subroutines with the & prepended.
Semicolons after blocks are not needed.
To avoid guessing the minimum, just use undef and check for that in the condition. You could also try 'INF', but it's not portable.
Why is the subroutine's second parameter a scalar reference? Is it a very long string?

Question 3

I know I don't need & but it is easier for me to identify them (it's my second month using perl :D ) The second parameter is not a long string (~20 characters), but I thought it would be better to use reference than passing it directly. I didn't know it is C-style the way I do the loops... So in general rather than using references if I should rather use anonymous hash/arrays? Many thanks

Question 4

Anonymous array is a reference to an array without a variable keeping the array. If you want to pass it to a subroutine, you must use a reference in the sub.

Question 5

& for subroutine calls is a holdover from Perl 4, no longer needed in Perl 5. I also recommend dropping it, as Perl is already a punctuation-heavy language.

Question 6

@200_success Ok, I'll try to avoid it. Thanks for the edit btw.

choroba chorobachoroba 1,4139 silver badges11 bronze badges · Accepted Answer · 2015-11-12 09:00:58Z

I removed your comments and inserted mine.

# Documentation goes to POD.
=item comparer
Compare a sequence with the reference frequency of hexamers.
First argument the file of to analyse, second argument the file of reference
=cut
sub comparer {
 my ($seq, $ref_seq) = @_; # I like arguments being processed as the first step in the sub. No need to shift twice.
 my %score;
 my %freq_ref = hexamer_freq({ read_fasta($$ref_seq) });
 my %seqs = read_fasta($seq);
 while (my ($id, $sequen) = each %seqs) {
 my @id_seq = split ' ', $id;
 my @id_name = split /\./, $id_seq[0];
 my ($max, $min);
 for my $i (0 .. 2) { # No need for a C-style for.
 last if length $sequen <= $i;
 my $sequ = substr $sequen, $i;
 next unless defined $sequ;
 my $rev_sequ = reverse $sequ;
 my @seqs = ($sequ, $rev_sequ);
 for my $j (0 .. $#seqs) { # C-style eliminated again.
 my %freq_seq = hexamer_freq({ $id_name[0] => $seqs[$j] }); # Anonymous hash.
 if (keys %freq_seq == 0) {
 print STDERR log_time(), "Unable to calculate the Hexamer score of $id_name[0]\n";
 next
 }
 my $score = 0;
 my $n_hexamers = keys %freq_seq; # "scalar" not needed in scalar context.
 for my $hex (keys %freq_seq){
 if (defined $freq_ref{$hex}){
 $score += log2($freq_seq{$hex} / $freq_ref{$hex});
 }
 }
 if (! defined $max || $score / $n_hexamers > $max) {
 $max = $score / $n_hexamers;
 }
 if (!defined $min || $score / $n_hexamers <= $min) {
 $min = $score / $n_hexamers
 }
 my $key = $id_name[0] . " frame: $i";
 $key .= (' FWD', ' REV')[$j] if $j < 2; # Poor man's "switch".
 $score{$key} = [$max, $min];
 }
 }
 }
 return %score
}

There's no need to call subroutines with the & prepended.
Semicolons after blocks are not needed.
To avoid guessing the minimum, just use undef and check for that in the condition. You could also try 'INF', but it's not portable.
Why is the subroutine's second parameter a scalar reference? Is it a very long string?

I know I don't need & but it is easier for me to identify them (it's my second month using perl :D ) The second parameter is not a long string (~20 characters), but I thought it would be better to use reference than passing it directly. I didn't know it is C-style the way I do the loops... So in general rather than using references if I should rather use anonymous hash/arrays? Many thanks
Anonymous array is a reference to an array without a variable keeping the array. If you want to pass it to a subroutine, you must use a reference in the sub.
& for subroutine calls is a holdover from Perl 4, no longer needed in Perl 5. I also recommend dropping it, as Perl is already a punctuation-heavy language.
@200_success Ok, I'll try to avoid it. Thanks for the edit btw.

Stack Exchange Network

Compare a sequence with the reference frequency of hexamers

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Compare a sequence with the reference frequency of hexamers

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions