I expect that's supposed to be a response to me: you've actually responded to yourself.
Here's the sort of thing I think you want:
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use autodie;
my $header = '';
my (@headers_found, %header_data);
open my $fasta_fh, '<', $ARGV[0];
while (<$fasta_fh>) {
chomp;
next if /^\s*($|#)/;
if (/^>(.*)$/) {
push @headers_found, ($header = $1);
}
else {
die 'Sequence data found without a header!' unless $header;
for (split '', "\U$_") {
/(?<a>[DNEQ])|(?<b>[KRH])|(?<n>.)/;
++$header_data{$header}{(keys %+)[0]};
}
}
}
close $fasta_fh;
for (@headers_found) {
say "Header: $_";
say "\tAcidic: $header_data{$_}{a}";
say "\tBasic: $header_data{$_}{b}";
say "\tNeutral: $header_data{$_}{n}";
}
I made a slight modification to ++mtmcc's sample data: just added some comments with and without leading whitespace for testing purposes.
Do note that's what sample data should look like! It's not the command you type.
Here's a sample run:
$ pm_fasta_amino_charge_calc.pl pm_mtmccs_data_modified.fasta
Header: DROTME_HH_Q02936
Acidic: 14
Basic: 18
Neutral: 67
Header: DROME_HH_Q02937
Acidic: 33
Basic: 35
Neutral: 136
Header: DROME_HH_Q02938
Acidic: 18
Basic: 17
Neutral: 69
Notes:
-
The way you're identifying neutral amino acids looks wrong.
I note ++Cristoforo has already addressed this issue.
My code takes much the same approach: if it's not acidic or basic, then it's neutral.
-
You're splitting the sequence into individal characters then attempting to match a pattern against a single character: there's no possibility of a repeat match so the 'g' modifier is redundant.
(I see Cristoforo has discussed this also.)
-
You're attempting a match on every charcter three times. This is not only pointless when the matches are mutually exclusive but also wastes processing and slows down your program.
I've used an alternation in my code; an if/elsif/else construct would also be more efficient.
-
Doing a case-insensitive match ('i' modifier) is slower than a case-sensitive match.
You're doing this three times on every character individually: I would expect a noticeable slowness with the amount of data you indicate (~700,000 characters).
Change the case of the entire string once before splitting and matching.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.