splitting data advice requested

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that looks like this

>SEQ1
-----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTKA-IESIRS--A-
-E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L------K
IF---------------L-----GINGPVF------------------------------
>SEQ2
-MKI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A-
-E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S
IL---------------I-----G----TS-----------------GP-VV--------
>SEQ3
--KI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A-
-E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S
IL---------------I-----G----TS-----------------GP-VV--------
---AE--D------GG---A---------------------------------------I
[download]

The lines starting with '>' are the headers and the series of letters and dashes underneath each header is the 'sequence' associated with that header. I would like to split this data so that the header is one element in the array and the sequence data underneath is in another so - for example

element[0] = >SEQ1
element[1] = -----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTKA-IESIRS-
+-A-
-E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L------K
IF---------------L-----GINGPVF------------------------------
[download]

On first sight, it seems a fairly easy split operation, and I thought initially I could just split on new line, but I cant due to the sequence data occurring over several lines in the file. Any advise/thoughts on how I might be able to do this much appreciated. Thanks in advance.

Comment on splitting data advice requested Select or Download Code

Replies are listed 'Best First'.
Re: splitting data advice requested by kennethk (Abbot) on May 13, 2009 at 14:44 UTC
Combining bloodnok's suggestion for a hash with almut's suggestion for an approach to splitting combined with split's limit argument and a positive look-ahead assertion, I give you: use strict; use warnings; local $/; my %hash = map split(/\n/, $_, 2), split /\n(?=>)/, <DATA>; s/\n//g foreach values %hash; __DATA__ >SEQ1 -----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTKA-IESIRS--A- -E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L------K IF---------------L-----GINGPVF------------------------------ >SEQ2 -MKI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- >SEQ3 --KI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- ---AE--D------GG---A---------------------------------------I [download]	[reply] [d/l]
Re^2: splitting data advice requested by johngg (Canon) on May 13, 2009 at 15:45 UTC
Another way of getting rid of the newlines would be to lose all of them with the second split, passing the fields out in an anonymous array, and then map out the key and sequence using shift and join. use strict; use warnings; use Data::Dumper; my %hash = map { ( shift @$_, join q{}, @$_ ) } map { [ split m{\n} ] } map { split m{\n(?=>)} } do { local $/; <DATA> }; print Data::Dumper->Dumpxs( [ \ %hash ], [ qw{ *hash } ] ); __DATA__ >SEQ1 -----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTKA-IESIRS--A- -E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L------K IF---------------L-----GINGPVF------------------------------ >SEQ2 -MKI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- >SEQ3 --KI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- ---AE--D------GG---A---------------------------------------I [download] The output. %hash = ( '>SEQ1' => '-----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTK +A-IESIRS--A--E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L---- +--KIF---------------L-----GINGPVF------------------------------', '>SEQ3' => '--KI----KA--ISIDIDG-TIT------YPN-R-------MIHEK-- +A-LEAIRR--A--E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A---- +--SIL---------------I-----G----TS-----------------GP-VV-----------AE- +-D------GG---A---------------------------------------I', '>SEQ2' => '-MKI----KA--ISIDIDG-TIT------YPN-R-------MIHEK-- +A-LEAIRR--A--E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A---- +--SIL---------------I-----G----TS-----------------GP-VV--------' ); [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re: splitting data advice requested by VinsWorldcom (Prior) on May 13, 2009 at 14:13 UTC
`#!/usr/bin/perl use strict; use Data::Dumper; open (IN, "in.txt"); my @element; my $data; while (<IN>) { chomp ($_); if ($_ =~ /^>/) { if ($data) { push @element, $data; $data = ''; } push @element, $_; } else { $data .= $_ . "\n" } } push @element, $data; print Dumper \@element;` [download]	[reply] [d/l]
Re: splitting data advice requested by bichonfrise74 (Vicar) on May 13, 2009 at 21:00 UTC
How about this? #!/usr/bin/perl use strict; use Data::Dumper; local $/ = ">"; my %hash; while (<DATA>) { s/>//; s/(^SEQ\d)//; $hash{">". $1} = $_ if ( defined( $1 )); } print Dumper(\%hash); __DATA__ >SEQ1 -----I--RL--AAIDVDG-NLT----------D--R--D-RL-ISTKA-IESIRS--A- -E-K--------K-GLT-VSL----LS------GN-V----I-PVV---YA-L------K IF---------------L-----GINGPVF------------------------------ >SEQ2 -MKI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- >SEQ3 --KI----KA--ISIDIDG-TIT------YPN-R-------MIHEK--A-LEAIRR--A- -E-S--------L-GIP-IML----VT------GN-T----V-QFA---EA-A------S IL---------------I-----G----TS-----------------GP-VV-------- ---AE--D------GG---A---------------------------------------I [download]	[reply] [d/l]
Re^2: splitting data advice requested by oxone (Friar) on May 13, 2009 at 22:19 UTC
I like your approach, because it avoids the problems of the "full file slurping" options suggested above (ie. won't scale for very large input files), and it recognises that '>' is a handy delimiter here. A small suggested improvement so that it removes the line breaks as per the OP, and is a bit shorter: `... while (<DATA>) { $hash{">$1"} = $2 if s/[>\n]//g && /^(SEQ\d+)(.*)/; } ...` [download]	[reply] [d/l]
Re: splitting data advice requested by Bloodnok (Vicar) on May 13, 2009 at 14:13 UTC
See Re: splitting data advice required... Update: Mended broken link - TFT kennethk A user level that continues to overstate my experience :-))	[reply]