It's possible because my regex didn't work with your reformatted FASTA records. :) aaron_baugher's suggestion to repost your records using <code> or <pre> was spot on, and helped with crafting the following new-and-improved solution--after your re-posting:
use strict;
use warnings;
my %FASTAhash;
{
local $/ = '>';
open my $file, '<FASTA.txt' or die $!;
while (<$file>) {
next if !/(.*?)\n/;
chomp( $FASTAhash{$1} = $' )
if !$FASTAhash{$1}
or length $' > length $FASTAhash{$1};
}
}
print ">$_\n$FASTAhash{$_}" for keys %FASTAhash;
Within a block, we start by letting perl know that '>' is the new record separator, instead of the default "\n" (so we read the file a FASTA record at a time, instead of a line at a time), and then tweaked the regex a bit to grab the ID.
You'll note that we don't use close $file; when we're done, since the file's automatically close when my $file falls out of scope (when the block ends).
Here's the output:
>ENSG00000147724
MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTESSSLHSA
CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS
EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP
>ENSG00000067082
Sequence unavailable
>ENSG00000010072
MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ
LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV
TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY
GYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD
KPNRGEAQLVIPFSGKGYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI
KVKFEQNGSSKNSHLVSPAVSNSHQNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP
KNSVSSSSQRRVSSSKISLRNSSKVTESASVMPSQDVSGSEDTFPNKRPRLEDKTVFDNF
FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQNEVLESQINEHLDWCLEGDS
IKVKSEESL*
Hope this version's helpful!
Update: After posting the above, just noticed aaron_baugher's solution using $/ = '>' and I think this makes good sense, since this is the FASTA record delimiter. |