I have written this function (and others similar to that one) But I am not sure I am using references on their full power.
My currently concerns is if I make a huge use of memory. The subroutine recieve a reference to two files, the subroutines, like these one return a hash (except &log_time
, which returns a scalar). The subroutines expect a reference, thats why I use my %current_seq = ($id_name[0] => $seqs[$j]); my %freq_seq = &hexamer_freq(\%current_seq);
on the subroutine, I don't think this is a good way to do it, but I can't imagine a better way to do it.
sub comparer{
# Compare a sequence with the reference frequency of hexamers
# First argument the file of to analyse, second argument the file of reference
my %score;
#Reading arguments
my $seq = shift;
my $ref_seq = shift;
# Calculating the reference log2
my %ref_seq = &read_fasta($$ref_seq);
my %freq_ref = &hexamer_freq(\%ref_seq);
# Counting hexamers and frequencies for each sequence
my %seqs = &read_fasta($seq);
while( my ($id, $sequen) = each %seqs){
my @id_seq = split(/\s+/, $id);
my @id_name = split(/\./, $id_seq[0]);
my $max = 0;
my $min = 999;
for (my $i = 0; $i < 3; $i++){
last if length $sequen <= $i;
my $sequ = substr($sequen, $i);
next unless (defined $sequ);
my $rev_sequ = reverse($sequ);
my @seqs = ($sequ, $rev_sequ);
for (my $j = 0; $j < scalar(@seqs); $j++){
my %current_seq = ($id_name[0] => $seqs[$j]);
my %freq_seq = &hexamer_freq(\%current_seq);
# Handle the sequences that are too short to contain an hexamer
if (scalar keys %freq_seq == 0){
print STDERR &log_time(), "Unable to calculate the Hexamer score of $id_name[0]\n";
next;
};
# Calculate the hexamer score
my $score = 0;
my $n_hexamers = scalar keys %freq_seq;
foreach my $hex (keys %freq_seq){
if (defined $freq_ref{$hex}){
$score += log2($freq_seq{$hex}/$freq_ref{$hex});
}
};
# Store the two possible candidates of "best score"
if ($score/$n_hexamers > $max){
$max = $score/$n_hexamers;
};
unless ($score/$n_hexamers > $min) {
$min = $score/$n_hexamers
}
# Store the data for each sequence
my $key = $id_name[0] . " frame: $i";
$key .= " FWD" if $j == 0; # The fwd + or - have the same hexamers
$key .= " REV" if $j == 1;
$score{$key} = [$max, $min];
};
};
}
return %score;
};
Besides, I would like to improve the way the $min
is calculated. Now the 999 is an arbitrary number, I expect it won't be reached, because the $freq_
are numbers between 0 and 1, and it is unlikely to have so big numbers (but it may happen).
1 Answer 1
I removed your comments and inserted mine.
# Documentation goes to POD.
=item comparer
Compare a sequence with the reference frequency of hexamers.
First argument the file of to analyse, second argument the file of reference
=cut
sub comparer {
my ($seq, $ref_seq) = @_; # I like arguments being processed as the first step in the sub. No need to shift twice.
my %score;
my %freq_ref = hexamer_freq({ read_fasta($$ref_seq) });
my %seqs = read_fasta($seq);
while (my ($id, $sequen) = each %seqs) {
my @id_seq = split ' ', $id;
my @id_name = split /\./, $id_seq[0];
my ($max, $min);
for my $i (0 .. 2) { # No need for a C-style for.
last if length $sequen <= $i;
my $sequ = substr $sequen, $i;
next unless defined $sequ;
my $rev_sequ = reverse $sequ;
my @seqs = ($sequ, $rev_sequ);
for my $j (0 .. $#seqs) { # C-style eliminated again.
my %freq_seq = hexamer_freq({ $id_name[0] => $seqs[$j] }); # Anonymous hash.
if (keys %freq_seq == 0) {
print STDERR log_time(), "Unable to calculate the Hexamer score of $id_name[0]\n";
next
}
my $score = 0;
my $n_hexamers = keys %freq_seq; # "scalar" not needed in scalar context.
for my $hex (keys %freq_seq){
if (defined $freq_ref{$hex}){
$score += log2($freq_seq{$hex} / $freq_ref{$hex});
}
}
if (! defined $max || $score / $n_hexamers > $max) {
$max = $score / $n_hexamers;
}
if (!defined $min || $score / $n_hexamers <= $min) {
$min = $score / $n_hexamers
}
my $key = $id_name[0] . " frame: $i";
$key .= (' FWD', ' REV')[$j] if $j < 2; # Poor man's "switch".
$score{$key} = [$max, $min];
}
}
}
return %score
}
- There's no need to call subroutines with the
&
prepended. - Semicolons after blocks are not needed.
- To avoid guessing the minimum, just use
undef
and check for that in the condition. You could also try'INF'
, but it's not portable. - Why is the subroutine's second parameter a scalar reference? Is it a very long string?
-
\$\begingroup\$ I know I don't need & but it is easier for me to identify them (it's my second month using perl :D ) The second parameter is not a long string (~20 characters), but I thought it would be better to use reference than passing it directly. I didn't know it is C-style the way I do the loops... So in general rather than using references if I should rather use anonymous hash/arrays? Many thanks \$\endgroup\$llrs– llrs2015年11月12日 09:21:47 +00:00Commented Nov 12, 2015 at 9:21
-
1\$\begingroup\$ Anonymous array is a reference to an array without a variable keeping the array. If you want to pass it to a subroutine, you must use a reference in the sub. \$\endgroup\$choroba– choroba2015年11月12日 09:29:24 +00:00Commented Nov 12, 2015 at 9:29
-
\$\begingroup\$
&
for subroutine calls is a holdover from Perl 4, no longer needed in Perl 5. I also recommend dropping it, as Perl is already a punctuation-heavy language. \$\endgroup\$200_success– 200_success2015年11月12日 09:44:34 +00:00Commented Nov 12, 2015 at 9:44 -
\$\begingroup\$ @200_success Ok, I'll try to avoid it. Thanks for the edit btw. \$\endgroup\$llrs– llrs2015年11月12日 09:55:44 +00:00Commented Nov 12, 2015 at 9:55
Explore related questions
See similar questions with these tags.