Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

how to split

by heidi (Sexton)
on Dec 18, 2007 at 08:51 UTC ( [id://657622]=perlquestion: print w/replies, xml ) Need Help??

heidi has asked for the wisdom of the Perl Monks concerning the following question:

i have an output like this having 2 sets of alignments.. in this, i want the first 4 lines of the 1st set of alignment and the first 4 lines of the 2nd set of alignment in an array. The same way, The alignment part of the 1st and the 2nd set in an array. The two sets of alignments are seperated by a '>' symbol.
>1gmz_B mol:protein length:122 PHOSPHOLIPASE A2 Length = 122 Score = 103 bits (233), Expect = 9e-23 Identities = 56/124 (45%), Positives = 67/124 (54%), Gaps = 12/124 (9 +%) Query: 2 LWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPVDDLDRCCQTHDNCYKQAKKLDS +C 61 LWQF MI K P + YGCYCG+GG G P D DRCC HD CY KL S +C Sbjct: 2 LWQFGKMI-LKETGKLPFPYYVTYGCYCGVGGRGGPKDATDRCCFVHDCCY---GKLTS +C 57 Query: 62 KVLVDNPYTNNYSYSCSNNEITCSSENNACEAFICNCDRNAAICFSK--VPYNKEHKNL +D 119 K P T+ YSYS + I C EN+ C IC CD+ AA+CF + YNK++ + + Sbjct: 58 K-----PKTDRYSYSRKDGTIVC-GENDPCRKEICECDKAAAVCFRENLDTYNKKYMSY +L 111 Query: 120 KKNC 123 K C Sbjct: 112 KSLC 115 >1b4w_A mol:protein length:122 PROTEIN (PHOSPHOLIPASE A2) Length = 122 Score = 95.7 bits (215), Expect = 2e-20 Identities = 46/105 (43%), Positives = 61/105 (58%), Gaps = 10/105 (9 +%) Query: 2 LWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPVDDLDRCCQTHDNCYKQAKKLDS +C 61 L QF MIK K+ EP++ + YGCYCG GG G P D DRCC HD CY +K+ +C Sbjct: 2 LLQFRKMIK-KMTGKEPVVSYAFYGCYCGSGGRGKPKDATDRCCFVHDCCY---EKVTG +C 57 Query: 62 KVLVDNPYTNNYSYSCSNNEITCSSENNACEAFICNCDRNAAICF 106 +P ++Y+YS N I C + + C+ +C CD+ AAICF Sbjct: 58 -----DPKWDDYTYSWKNGTIVCGGD-DPCKKEVCECDKAAAICF 96
i have now done this :
#!/usr/bin/perl open FILE, "/home/guest/align.txt"; my @arr123; local $/ = ''; @arr123 = <FILE>; #output file. print "<pre>"; #print @arr123; $array=join('',@arr123); @arr=split(/\n/,$array); foreach $a(@arr){ if($a=~/>/ || $a=~ /Length/ || $a=~ /Score/ || $a=~ /Identities/){ +push (@header,$a);} #header information. if($a=~/^Query/||$a=~/ /||$a=~/^Sbjct/){push(@lin,$a);} + } foreach(@header){print $_,"\n";} foreach(@lin){print $_,"\n";}
i m sure there are better ways to do it.....and also tell me which is the best website to learn perl?

Replies are listed 'Best First'.
Re: how to split
by citromatik (Curate) on Dec 18, 2007 at 11:52 UTC

    For those who are not familiar with this format, you are trying to parse a BLAST report. I don't know if you are familiar with BioPerl (specially SearchIO), but if you are unsure writing your own code, maybe you should take a look at it.

    You can find here a nice tutorial that will let you to work with BLAST reports

    citromatik
Re: how to split
by johngg (Canon) on Dec 18, 2007 at 10:27 UTC
    An alternative method would be to maintain a state variable, reading the file line by line and detecting when you move between header and result sections. I also maintain an array reference which points to either the headers or lines arrays so we push onto the correct array.

    use strict; use warnings; use Data::Dumper; open my $inFH, q{<}, \ <<'END_OF_FILE' or die qq{open: $!\n}; >1gmz_B mol:protein length:122 PHOSPHOLIPASE A2 Length = 122 Score = 103 bits (233), Expect = 9e-23 Identities = 56/124 (45%), Positives = 67/124 (54%), Gaps = 12/124 (9 +%) Query: 2 LWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPVDDLDRCCQTHDNCYKQAKKLDS +C 61 LWQF MI K P + YGCYCG+GG G P D DRCC HD CY KL S +C Sbjct: 2 LWQFGKMI-LKETGKLPFPYYVTYGCYCGVGGRGGPKDATDRCCFVHDCCY---GKLTS +C 57 Query: 62 KVLVDNPYTNNYSYSCSNNEITCSSENNACEAFICNCDRNAAICFSK--VPYNKEHKNL +D 119 K P T+ YSYS + I C EN+ C IC CD+ AA+CF + YNK++ + + Sbjct: 58 K-----PKTDRYSYSRKDGTIVC-GENDPCRKEICECDKAAAVCFRENLDTYNKKYMSY +L 111 Query: 120 KKNC 123 K C Sbjct: 112 KSLC 115 >1b4w_A mol:protein length:122 PROTEIN (PHOSPHOLIPASE A2) Length = 122 Score = 95.7 bits (215), Expect = 2e-20 Identities = 46/105 (43%), Positives = 61/105 (58%), Gaps = 10/105 (9 +%) Query: 2 LWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPVDDLDRCCQTHDNCYKQAKKLDS +C 61 L QF MIK K+ EP++ + YGCYCG GG G P D DRCC HD CY +K+ +C Sbjct: 2 LLQFRKMIK-KMTGKEPVVSYAFYGCYCGSGGRGKPKDATDRCCFVHDCCY---EKVTG +C 57 Query: 62 KVLVDNPYTNNYSYSCSNNEITCSSENNACEAFICNCDRNAAICF 106 +P ++Y+YS N I C + + C+ +C CD+ AAICF Sbjct: 58 -----DPKWDDYTYSWKNGTIVCGGD-DPCKKEVCECDKAAAICF 96 END_OF_FILE my $inHeader; my $pushTo; my @headers = (); my @lines = (); while ( <$inFH> ) { chomp; if ( not $inHeader and m{^>} ) { $inHeader = 1; $pushTo = \ @headers; } if ( $inHeader and m{^Query:} ) { $inHeader = 0; $pushTo = \ @lines; } push @$pushTo, $_; } close $inFH or die qq{close: $!\n}; print Data::Dumper->Dumpxs( [ \ @headers, \ @lines ], [ qw{ *headers *lines } ], );

    Here's the output.

    Some good habits to get into

    • Always put use strict; and use warnings at the top of your scripts.

    • Always test for success when open'ing and close'ing files.

    • Avoid using scalars $a and $b as they are special and reserved for use with sort.

    • Indent your code in a meaningful way to make the logic clearer and use it consistently.

    • Use meaningful variable names (without making the names too long).

    Regarding the best website to learn Perl, I think you've already found it!

    Cheers,

    JohnGG

    Update: The array ref. is not necessary as you can use a ternary to determine which array to push onto.

    push @{ $inHeader ? @headers : @lines }, $_;

Re: how to split
by reasonablekeith (Deacon) on Dec 18, 2007 at 09:54 UTC
    If you can guarantee that the greater-than sign won't appear in your data, then you might as well use that as your end-of-line marker (your example shows confused usage of this, so you might want to look up the 'slurp' idiom). Each read from the file gets one set of results (the first read is blank, because we've set '>' as an end of line marker, but we're checking our data so it's okay). This can then be split to header/line data and pushed to the array.

    Note that the whole thing is in curly brackets so that our value for $/ gets restored for any subsequent code (really, look up the slurp idiom.)

    use strict; use warnings; use Data::Dumper; my (@header, @lin); { open FILE, "/home/guest/align.txt" or die("Error reading file: $!"); local $/ = '>'; while (my $result_set = <FILE>) { $result_set =~ s/>$//; # trim trailing EOL marker if ($result_set =~ m/^(.*?)(Query:.*)$/s) { push @header, $1; push @lin, $2; } } }; print Dumper(\@header); print Dumper(\@lin);
    ---
    my name's not Keith, and I'm not reasonable.
Re: how to split
by naChoZ (Curate) on Dec 18, 2007 at 14:51 UTC

    ...and also tell me which is the best website to learn perl?

    You're already here, but for the specifics, woolfy wrote the best node for this question: Where and how to start learning Perl

    --
    naChoZ

    Therapy is expensive. Popping bubble wrap is cheap. You choose.

Re: how to split
by apl (Monsignor) on Dec 18, 2007 at 10:47 UTC
    Slightly off-topic... If your institute of higher learning has a Computer Center (I'm a dinosaur; they probably don't exist anymore) or Computer Science Department, you might consider contacting them and seeing if they have anyone Perl-savvy who'd be willing to write your software for you. It's something an undergraduate programmer can put on his resume; your department might be able to provide a Work Study job for the guy, etc.

    By the way, you're learning programming far better than I did bio-chemistry. 8-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://657622]
Approved by andreas1234567
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-25 10:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found