2
\$\begingroup\$

I have written this function (and others similar to that one) But I am not sure I am using references on their full power.

My currently concerns is if I make a huge use of memory. The subroutine recieve a reference to two files, the subroutines, like these one return a hash (except &log_time, which returns a scalar). The subroutines expect a reference, thats why I use my %current_seq = ($id_name[0] => $seqs[$j]); my %freq_seq = &hexamer_freq(\%current_seq); on the subroutine, I don't think this is a good way to do it, but I can't imagine a better way to do it.

sub comparer{
 # Compare a sequence with the reference frequency of hexamers
 # First argument the file of to analyse, second argument the file of reference
 my %score;
 #Reading arguments
 my $seq = shift;
 my $ref_seq = shift;
 # Calculating the reference log2
 my %ref_seq = &read_fasta($$ref_seq);
 my %freq_ref = &hexamer_freq(\%ref_seq);
 # Counting hexamers and frequencies for each sequence
 my %seqs = &read_fasta($seq);
 while( my ($id, $sequen) = each %seqs){
 my @id_seq = split(/\s+/, $id);
 my @id_name = split(/\./, $id_seq[0]);
 my $max = 0;
 my $min = 999;
 for (my $i = 0; $i < 3; $i++){
 last if length $sequen <= $i;
 my $sequ = substr($sequen, $i);
 next unless (defined $sequ);
 my $rev_sequ = reverse($sequ);
 my @seqs = ($sequ, $rev_sequ);
 for (my $j = 0; $j < scalar(@seqs); $j++){
 my %current_seq = ($id_name[0] => $seqs[$j]);
 my %freq_seq = &hexamer_freq(\%current_seq);
 # Handle the sequences that are too short to contain an hexamer
 if (scalar keys %freq_seq == 0){
 print STDERR &log_time(), "Unable to calculate the Hexamer score of $id_name[0]\n";
 next;
 };
 # Calculate the hexamer score
 my $score = 0;
 my $n_hexamers = scalar keys %freq_seq;
 foreach my $hex (keys %freq_seq){
 if (defined $freq_ref{$hex}){
 $score += log2($freq_seq{$hex}/$freq_ref{$hex});
 }
 };
 # Store the two possible candidates of "best score"
 if ($score/$n_hexamers > $max){
 $max = $score/$n_hexamers;
 };
 unless ($score/$n_hexamers > $min) {
 $min = $score/$n_hexamers
 }
 # Store the data for each sequence
 my $key = $id_name[0] . " frame: $i";
 $key .= " FWD" if $j == 0; # The fwd + or - have the same hexamers
 $key .= " REV" if $j == 1;
 $score{$key} = [$max, $min];
 };
 };
 }
 return %score;
};

Besides, I would like to improve the way the $min is calculated. Now the 999 is an arbitrary number, I expect it won't be reached, because the $freq_ are numbers between 0 and 1, and it is unlikely to have so big numbers (but it may happen).

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Nov 12, 2015 at 8:15
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

I removed your comments and inserted mine.

# Documentation goes to POD.
=item comparer
Compare a sequence with the reference frequency of hexamers.
First argument the file of to analyse, second argument the file of reference
=cut
sub comparer {
 my ($seq, $ref_seq) = @_; # I like arguments being processed as the first step in the sub. No need to shift twice.
 my %score;
 my %freq_ref = hexamer_freq({ read_fasta($$ref_seq) });
 my %seqs = read_fasta($seq);
 while (my ($id, $sequen) = each %seqs) {
 my @id_seq = split ' ', $id;
 my @id_name = split /\./, $id_seq[0];
 my ($max, $min);
 for my $i (0 .. 2) { # No need for a C-style for.
 last if length $sequen <= $i;
 my $sequ = substr $sequen, $i;
 next unless defined $sequ;
 my $rev_sequ = reverse $sequ;
 my @seqs = ($sequ, $rev_sequ);
 for my $j (0 .. $#seqs) { # C-style eliminated again.
 my %freq_seq = hexamer_freq({ $id_name[0] => $seqs[$j] }); # Anonymous hash.
 if (keys %freq_seq == 0) {
 print STDERR log_time(), "Unable to calculate the Hexamer score of $id_name[0]\n";
 next
 }
 my $score = 0;
 my $n_hexamers = keys %freq_seq; # "scalar" not needed in scalar context.
 for my $hex (keys %freq_seq){
 if (defined $freq_ref{$hex}){
 $score += log2($freq_seq{$hex} / $freq_ref{$hex});
 }
 }
 if (! defined $max || $score / $n_hexamers > $max) {
 $max = $score / $n_hexamers;
 }
 if (!defined $min || $score / $n_hexamers <= $min) {
 $min = $score / $n_hexamers
 }
 my $key = $id_name[0] . " frame: $i";
 $key .= (' FWD', ' REV')[$j] if $j < 2; # Poor man's "switch".
 $score{$key} = [$max, $min];
 }
 }
 }
 return %score
}
  • There's no need to call subroutines with the & prepended.
  • Semicolons after blocks are not needed.
  • To avoid guessing the minimum, just use undef and check for that in the condition. You could also try 'INF', but it's not portable.
  • Why is the subroutine's second parameter a scalar reference? Is it a very long string?
answered Nov 12, 2015 at 9:00
\$\endgroup\$
4
  • \$\begingroup\$ I know I don't need & but it is easier for me to identify them (it's my second month using perl :D ) The second parameter is not a long string (~20 characters), but I thought it would be better to use reference than passing it directly. I didn't know it is C-style the way I do the loops... So in general rather than using references if I should rather use anonymous hash/arrays? Many thanks \$\endgroup\$ Commented Nov 12, 2015 at 9:21
  • 1
    \$\begingroup\$ Anonymous array is a reference to an array without a variable keeping the array. If you want to pass it to a subroutine, you must use a reference in the sub. \$\endgroup\$ Commented Nov 12, 2015 at 9:29
  • \$\begingroup\$ & for subroutine calls is a holdover from Perl 4, no longer needed in Perl 5. I also recommend dropping it, as Perl is already a punctuation-heavy language. \$\endgroup\$ Commented Nov 12, 2015 at 9:44
  • \$\begingroup\$ @200_success Ok, I'll try to avoid it. Thanks for the edit btw. \$\endgroup\$ Commented Nov 12, 2015 at 9:55

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.