Re: Count the sequence length of each entry in the file
by kcott (Archbishop) on Oct 01, 2020 at 23:25 UTC
|
G'day davi54,
"I want to count the number of alphabets in each entry, length of each entry, etc."
That's confusing as you combine the counts for all entries in @count.
Other variables, outside the while loop, appear misplaced as they look like they're associated with individual entries:
I'd expect them to be inside the while loop.
The following code collects all the data that I believe you want.
You can combine values for all entries if necessary.
#!/usr/bin/env perl
use strict;
use warnings;
my %results;
{
local $/ = '';
while (my $record = <DATA>) {
$record =~ s/\A>(.+?)$//m;
my $entry = $1;
$record =~ s/\s//gm;
$results{$entry}{len} = length $record;
for (0 .. $results{$entry}{len} - 1) {
++$results{$entry}{count}{substr $record, $_, 1};
}
}
}
use Data::Dump;
dd \%results;
__DATA__
>sp_0005
VQLQESGGGLVQAGGSLRLSCAASGRAVSMYNMGWFRQAPGQERELVAAISRGGSIYYA
DSVKGRFTISRDNAKNTLYLQMNNLKPEDTGVYQCRQGSTLGQGTQVTVSS
>sp_0017
HVQLVESGGGSVQAGGSLRLTCAASGFTFSNYYMSWVRQAPGKGLEWVSSIYSVGSNGYY
ADSVKGRSTISRDNAKNTLYLQMNSLKPEDTAVYYCAAEPGGSWWDAYSYWGQGTQVTVS S
Extract of output:
{
sp_0005 => {
count => {
A => 9,
C => 2,
...,
W => 1,
Y => 5,
},
len => 110,
},
sp_0017 => {
count => {
A => 10,
C => 2,
...,
W => 5,
Y => 10,
},
len => 121,
},
}
Notes:
-
You posted sample data but did so using paragraph text.
Please use <code> tags for this sort of thing.
I've made a best guess at what the original looked like:
I was unsure about the embedded space at the end of the second entry
(cf. 'QVTVSS' in 1st entry and 'QVTVS S' in 2nd entry)
so left it as written.
-
I've truncated the names to avoid excessive text wrapping.
They're still unique but now just look like sp_NNNN.
-
When you modify special variables, such as $/ here,
you should localise the change in the smallest scope possible
(see my code for an example of this).
-
I've used substr instead of a regex to get the counts.
Perl's string-handling functions are typically faster than regexes that achieve the same outcome.
-
Do note that this code is only intended to show a technique.
It does not contain any I/O or formatted reporting.
| [reply] [d/l] [select] |
|
|
Thank you for your help. Actually, the input file is formatted to have only 60 characters in each line and then it moves to the new line. So, that's the input file format when you see just a single S on the last line of the second entry. On a different note, for the first entry, the sequence length value you get in your output (110) is one less than the actual sequence length which is 111. However, my output gives me a sequence length of 115, which is even worse. Do you know where the error might be?
| [reply] |
|
|
"for the first entry, the sequence length value you get in your output (110) is one less than the actual sequence length which is 111."
$ perl -E 'say length "VQLQESGGGLVQAGGSLRLSCAASGRAVSMYNMGWFRQAPGQERELV
+AAISRGGSIYYA"'
59
$ perl -E 'say length "DSVKGRFTISRDNAKNTLYLQMNNLKPEDTGVYQCRQGSTLGQGTQV
+TVSS"'
51
$ perl -E 'say 59+51'
110
If you add the newline between those two strings you'll get 111 for \n or 112 for \r\n.
There's also whitespace after those strings which will further increase the length of the line.
As I already stated, you posted your data as paragraph text: I can't tell what the original data was.
I removed all whitespace in my code:
$record =~ s/\s//gm;
The correct length, after removing white space, is 110.
| [reply] [d/l] [select] |
Re: Count the sequence length of each entry in the file
by GrandFather (Saint) on Oct 01, 2020 at 20:05 UTC
|
I tried to make your script a self contained example as follows, but it doesn't like the "data" I gave it. Would you like to improve the sample data?
#!/usr/bin/perl
use strict;
use warnings;
$/ = ''; # Set paragraph mode
my @count;
my %absent;
my $name;
while ( my $para = <DATA> ) {
# Remove fasta header line
if ( $para =~ s/^>(.*)//m ){
$name = $1;
};
# Remove comment line(s)
$para =~ s/^\s*#.*//mg;
my %prot;
$para =~ s/([ACDEFGHIKLMNPQRSTVWY])/ ++$prot{ $1 } /eg;
my $len = length($para);
my $num = scalar keys %prot;
push @count,[$num,$name];
printf "Counted %d for %s ..\n",$num,substr($name,0,50);
print "$name\n";
print join( ' ', map "$_=$prot{$_}", sort keys %prot ), "\n";
printf "Amino acid alphabet = %d\n\n",$num ;
print "Sequence length = ", $len;
# count absent
for ('A'..'Z'){
++$absent{$_} unless exists $prot{$_};
};
};
# sort names by count in ascending order to get lowest
my @sorted = sort { $a->[0] <=> $b->[0] } @count;
my $lowest = $sorted[0]->[0];
# maybe more than 1 lowest
printf "Least number of amino acids is %d in these entries\n",$lowest;
my @lowest = grep { $_->[0] == $lowest } @sorted;
print "$_->[1]\n" for @lowest;
# show all results
print "\nAll results in ascending count\n";
for (@sorted){
printf "%d %s\n", @$_;
};
print "\nExclusion of various amino acids is as follows\n";
for (sort keys %absent){
printf "%s=%d\n",$_,$absent{$_};
};
__DATA__
print 'Please enter protein sequence filename: ';
chomp( my $prot_filename = <STDIN> );
open my $PROTFILE, '<', $prot_filename
or die "Cannot open '$prot_filename' because: $!";
my $report_name = $prot_filename.'_report';
open my $out_file, '>', $report_name
or die "Cannot open '$report_name' because: $!";
close $out_file;
print "Results are printed in $report_name\n";
# print absent counts
print "\nExclusion of various amino acids in $prot_filename is as foll
+ows\n";
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] |
|
|
| [reply] |
|
|
#!/usr/bin/perl
use strict;
use warnings;
$/ = ''; # Set paragraph mode
my @count;
my %absent;
my $name;
while ( my $para = <DATA> ) {
# Remove fasta header line
if ( $para =~ s/^>(.*)//m ){
$name = $1;
};
# Remove comment line(s)
$para =~ s/^\s*#.*//mg;
my %prot;
$para =~ s/([ACDEFGHIKLMNPQRSTVWY])/ ++$prot{ $1 } /eg;
my $len = length($para);
my $num = scalar keys %prot;
push @count,[$num,$name];
printf "Counted %d for %s ..\n",$num,substr($name,0,50);
print "$name\n";
print join( ' ', map "$_=$prot{$_}", sort keys %prot ), "\n";
printf "Amino acid alphabet = %d\n\n",$num ;
print "Sequence length = $len\n";
# count absent
for ('A'..'Z'){
++$absent{$_} unless exists $prot{$_};
};
};
# sort names by count in ascending order to get lowest
my @sorted = sort { $a->[0] <=> $b->[0] } @count;
my $lowest = $sorted[0]->[0];
# maybe more than 1 lowest
printf "Least number of amino acids is %d in these entries\n",$lowest;
my @lowest = grep { $_->[0] == $lowest } @sorted;
print "$_->[1]\n" for @lowest;
# show all results
print "\nAll results in ascending count\n";
for (@sorted){
printf "%d %s\n", @$_;
};
print "\nExclusion of various amino acids is as follows\n";
for (sort keys %absent){
printf "%s=%d\n",$_,$absent{$_};
};
__DATA__
>sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcripti
+on-activation domain
VQLQESGGGLVQAGGSLRLSCAASGRAVSMYNMGWFRQAPGQERELVAAISRGGSIYYA
DSVKGRFTISRDNAKNTLYLQMNNLKPEDTGVYQCRQGSTLGQGTQVTVSS
>sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extrusi
+on sbmate
HVQLVESGGGSVQAGGSLRLTCAASGFTFSNYYMSWVRQAPGKGLEWVSSIYSVGSNGYY
ADSVKGRSTISRDNAKNTLYLQMNSLKPEDTAVYYCAAEPGGSWWDAYSYWGQGTQVTVS S
which prints:
Counted 19 for sp_0005_SySynthetic ConstructTumor protein p53 N-t ..
sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcriptio
+n-activation domain
A=9 C=2 D=3 E=4 F=2 G=15 I=3 K=3 L=9 M=3 N=5 P=2 Q=10 R=8 S=12 T=6 V=8
+ W=1 Y=5
Amino acid alphabet = 19
Sequence length = 124
Counted 20 for sp_0017_CaCamelidSorghum bicolor multidrug and tox ..
sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extrusio
+n sbmate
A=10 C=2 D=4 E=4 F=2 G=15 H=1 I=2 K=4 L=7 M=2 N=5 P=3 Q=6 R=4 S=18 T=7
+ V=10 W=5 Y=10
Amino acid alphabet = 20
Sequence length = 143
Least number of amino acids is 19 in these entries
sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcriptio
+n-activation domain
All results in ascending count
19 sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcri
+ption-activation domain
20 sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extr
+usion sbmate
Exclusion of various amino acids is as follows
B=2
H=1
J=2
O=2
U=2
X=2
Z=2
including Sequence length = 124 and Sequence length = 143. Is there a problem with your input data such as line endings that aren't as you expect?
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
|
|
|
|
Actually, the input file is ... [update: quoted from
this post]
Rather than saying (update: in a different
post!) what your example data actually is,
why not post that data as it actually is, in a form
immediately usable by other monks (who are trying to help you). That
is, use <code> ... </code> tags to post your data, maybe
something like:
<code>
>sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcripti
+on-activation domain
VQLQESGGGLVQAGGSLRLSCAASGRAVSMYNMGWFRQAPGQERELVAAISRGGSIYYA
DSVKGRFTISRDNAKNTLYLQMNNLKPEDTGVYQCRQGSTLGQGTQVTVSS
>sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extrusi
+on sbmate
HVQLVESGGGSVQAGGSLRLTCAASGFTFSNYYMSWVRQAPGKGLEWVSSIYSVGSNGYY
ADSVKGRSTISRDNAKNTLYLQMNSLKPEDTAVYYCAAEPGGSWWDAYSYWGQGTQVTVS
S
</code>
In addition to being your exact example data, this
data is easily [download]-able by anyone who might want to help with
this problem. You might want to add this <code>-tagged
example data as an update (see How do I change/delete my post?) to your
original post in addition to finally updating
this post, or at the very least update
this post and add to the OP a note saying that
example data can be found there.
Please see Markup in the Monastery, Writeup Formatting Tips and What shortcuts can I use for linking to other information?, and perhaps
some of the articles in the Understanding and Using PerlMonks section of the Monastery's
Tutorials.
Update: Some trivial wording changes; added links to
actual "Actually..." post.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Count the sequence length of each entry in the file
by Anonymous Monk on Oct 01, 2020 at 20:21 UTC
|
my %prot;
$para =~ s/([ACDEFGHIKLMNPQRSTVWY])/ ++$prot{ $1 } /eg;
$len = length($para);
You are getting the length after you modify the variable!
| [reply] [d/l] |
|
|
Ohh, I see. I instead placed it like this:
$para =~ s/^\s*#.*//mg;
$len = length($para);
and got the output. However, the count is incorrect. For, example, if you look at the first entry, after removing the header, the sequence length (alphabets in uppercase) is 111, whereas the script gives me 115 as the output for sequence length. I don't know what I'm doing wrong, why is the script returning me wrong value? | [reply] [d/l] |
|
|
When you strip off the header the line endings remain. Try deleting "\n" characters. If on windows also delete "\r" characters.
# Remove fasta header line
if ( $para =~ s/^>(.*)//m ){
$name = $1;
};
# Remove comment line(s)
$para =~ s/^\s*#.*//mg;
$para =~ tr/\r\n//d;
| [reply] [d/l] |
|
|
|
|