how to split

heidi has asked for the wisdom of the Perl Monks concerning the following question:

i have an output like this having 2 sets of alignments.. in this, i want the first 4 lines of the 1st set of alignment and the first 4 lines of the 2nd set of alignment in an array. The same way, The alignment part of the 1st and the 2nd set in an array. The two sets of alignments are seperated by a '>' symbol.

>1gmz_B mol:protein length:122  PHOSPHOLIPASE A2
          Length = 122

 Score =  103 bits (233), Expect = 9e-23
 Identities = 56/124 (45%), Positives = 67/124 (54%), Gaps = 12/124 (9
+%)

Query: 2   LWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPVDDLDRCCQTHDNCYKQAKKLDS
+C 61
           LWQF  MI  K     P   +  YGCYCG+GG G P D  DRCC  HD CY    KL S
+C
Sbjct: 2   LWQFGKMI-LKETGKLPFPYYVTYGCYCGVGGRGGPKDATDRCCFVHDCCY---GKLTS
+C 57

Query: 62  KVLVDNPYTNNYSYSCSNNEITCSSENNACEAFICNCDRNAAICFSK--VPYNKEHKNL
+D 119
           K     P T+ YSYS  +  I C  EN+ C   IC CD+ AA+CF +    YNK++ + 
+ 
Sbjct: 58  K-----PKTDRYSYSRKDGTIVC-GENDPCRKEICECDKAAAVCFRENLDTYNKKYMSY
+L 111

Query: 120 KKNC 123
           K  C
Sbjct: 112 KSLC 115

>1b4w_A mol:protein length:122  PROTEIN (PHOSPHOLIPASE A2)
          Length = 122

 Score = 95.7 bits (215), Expect = 2e-20
 Identities = 46/105 (43%), Positives = 61/105 (58%), Gaps = 10/105 (9
+%)

Query: 2   LWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPVDDLDRCCQTHDNCYKQAKKLDS
+C 61
           L QF  MIK K+   EP++ +  YGCYCG GG G P D  DRCC  HD CY   +K+  
+C
Sbjct: 2   LLQFRKMIK-KMTGKEPVVSYAFYGCYCGSGGRGKPKDATDRCCFVHDCCY---EKVTG
+C 57

Query: 62  KVLVDNPYTNNYSYSCSNNEITCSSENNACEAFICNCDRNAAICF 106
                +P  ++Y+YS  N  I C  + + C+  +C CD+ AAICF
Sbjct: 58  -----DPKWDDYTYSWKNGTIVCGGD-DPCKKEVCECDKAAAICF 96
[download]

i have now done this :

#!/usr/bin/perl
open FILE, "/home/guest/align.txt";
my @arr123;
local $/ = '';
@arr123 = <FILE>;    #output file.
print "<pre>";
#print @arr123;
$array=join('',@arr123);
@arr=split(/\n/,$array);
foreach $a(@arr){
    if($a=~/>/ || $a=~ /Length/ || $a=~ /Score/ || $a=~ /Identities/){
+push (@header,$a);}    #header information.
    if($a=~/^Query/||$a=~/           /||$a=~/^Sbjct/){push(@lin,$a);} 
+       
}
foreach(@header){print $_,"\n";}
foreach(@lin){print $_,"\n";}
[download]

i m sure there are better ways to do it.....and also tell me which is the best website to learn perl?

Comment on how to split Select or Download Code

Replies are listed 'Best First'.
Re: how to split by citromatik (Curate) on Dec 18, 2007 at 11:52 UTC
For those who are not familiar with this format, you are trying to parse a BLAST report. I don't know if you are familiar with BioPerl (specially SearchIO), but if you are unsure writing your own code, maybe you should take a look at it. You can find here a nice tutorial that will let you to work with BLAST reports citromatik	[reply]
Re: how to split by johngg (Canon) on Dec 18, 2007 at 10:27 UTC
An alternative method would be to maintain a state variable, reading the file line by line and detecting when you move between header and result sections. I also maintain an array reference which points to either the headers or lines arrays so we `push` onto the correct array. use strict; use warnings; use Data::Dumper; open my $inFH, q{<}, \ <<'END_OF_FILE' or die qq{open: $!\n}; >1gmz_B mol:protein length:122 PHOSPHOLIPASE A2 Length = 122 Score = 103 bits (233), Expect = 9e-23 Identities = 56/124 (45%), Positives = 67/124 (54%), Gaps = 12/124 (9 +%) Query: 2 LWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPVDDLDRCCQTHDNCYKQAKKLDS +C 61 LWQF MI K P + YGCYCG+GG G P D DRCC HD CY KL S +C Sbjct: 2 LWQFGKMI-LKETGKLPFPYYVTYGCYCGVGGRGGPKDATDRCCFVHDCCY---GKLTS +C 57 Query: 62 KVLVDNPYTNNYSYSCSNNEITCSSENNACEAFICNCDRNAAICFSK--VPYNKEHKNL +D 119 K P T+ YSYS + I C EN+ C IC CD+ AA+CF + YNK++ + + Sbjct: 58 K-----PKTDRYSYSRKDGTIVC-GENDPCRKEICECDKAAAVCFRENLDTYNKKYMSY +L 111 Query: 120 KKNC 123 K C Sbjct: 112 KSLC 115 >1b4w_A mol:protein length:122 PROTEIN (PHOSPHOLIPASE A2) Length = 122 Score = 95.7 bits (215), Expect = 2e-20 Identities = 46/105 (43%), Positives = 61/105 (58%), Gaps = 10/105 (9 +%) Query: 2 LWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPVDDLDRCCQTHDNCYKQAKKLDS +C 61 L QF MIK K+ EP++ + YGCYCG GG G P D DRCC HD CY +K+ +C Sbjct: 2 LLQFRKMIK-KMTGKEPVVSYAFYGCYCGSGGRGKPKDATDRCCFVHDCCY---EKVTG +C 57 Query: 62 KVLVDNPYTNNYSYSCSNNEITCSSENNACEAFICNCDRNAAICF 106 +P ++Y+YS N I C + + C+ +C CD+ AAICF Sbjct: 58 -----DPKWDDYTYSWKNGTIVCGGD-DPCKKEVCECDKAAAICF 96 END_OF_FILE my $inHeader; my $pushTo; my @headers = (); my @lines = (); while ( <$inFH> ) { chomp; if ( not $inHeader and m{^>} ) { $inHeader = 1; $pushTo = \ @headers; } if ( $inHeader and m{^Query:} ) { $inHeader = 0; $pushTo = \ @lines; } push @$pushTo, $_; } close $inFH or die qq{close: $!\n}; print Data::Dumper->Dumpxs( [ \ @headers, \ @lines ], [ qw{ headers lines } ], ); [download] Here's the output. Read more... (3 kB) Some good habits to get into Always put `use strict;` and `use warnings` at the top of your scripts. Always test for success when `open`'ing and `close`'ing files. Avoid using scalars `$a` and `$b` as they are special and reserved for use with sort. Indent your code in a meaningful way to make the logic clearer and use it consistently. Use meaningful variable names (without making the names too long). Regarding the best website to learn Perl, I think you've already found it! Cheers, JohnGG Update: The array ref. is not necessary as you can use a ternary to determine which array to `push` onto. `push @{ $inHeader ? @headers : @lines }, $_;` [download]	[reply] [d/l] [select]
Re: how to split by reasonablekeith (Deacon) on Dec 18, 2007 at 09:54 UTC
If you can guarantee that the greater-than sign won't appear in your data, then you might as well use that as your end-of-line marker (your example shows confused usage of this, so you might want to look up the 'slurp' idiom). Each read from the file gets one set of results (the first read is blank, because we've set '>' as an end of line marker, but we're checking our data so it's okay). This can then be split to header/line data and pushed to the array. Note that the whole thing is in curly brackets so that our value for $/ gets restored for any subsequent code (really, look up the slurp idiom.) `use strict; use warnings; use Data::Dumper; my (@header, @lin); { open FILE, "/home/guest/align.txt" or die("Error reading file: $!"); local $/ = '>'; while (my $result_set = <FILE>) { $result_set =~ s/>$//; # trim trailing EOL marker if ($result_set =~ m/^(.?)(Query:.)$/s) { push @header, $1; push @lin, $2; } } }; print Dumper(\@header); print Dumper(\@lin);` [download] --- my name's not Keith, and I'm not reasonable.	[reply] [d/l]
Re: how to split by naChoZ (Curate) on Dec 18, 2007 at 14:51 UTC
...and also tell me which is the best website to learn perl? You're already here, but for the specifics, woolfy wrote the best node for this question: Where and how to start learning Perl -- naChoZ Therapy is expensive. Popping bubble wrap is cheap. You choose.	[reply]
Re: how to split by apl (Monsignor) on Dec 18, 2007 at 10:47 UTC
Slightly off-topic... If your institute of higher learning has a Computer Center (I'm a dinosaur; they probably don't exist anymore) or Computer Science Department, you might consider contacting them and seeing if they have anyone Perl-savvy who'd be willing to write your software for you. It's something an undergraduate programmer can put on his resume; your department might be able to provide a Work Study job for the guy, etc. By the way, you're learning programming far better than I did bio-chemistry. 8-)	[reply]


Welcome to the Monastery
	PerlMonks