Perl Multiline Regex

joomanji has asked for the wisdom of the Perl Monks concerning the following question:

Dear all.. I'm very new to perl script, I'm able to read input from a file, and split them into column. The problem is I've no idea how compare line 1 and line 2 variable. For example :

scf7180000168636        1       40919   40986   Pseudo  TGT     0     
+  0       25.93
scf7180000168636        2       28112   28037   Asn     GTT     0     
+  0       86.57
scf7180000168636        3       27372   27295   Ile     GAT     0     
+  0       54.85
scf7180000168636        4       27287   27213   Gly     TCC     0     
+  0       45.71
scf7180000168636        5       839     753     Leu     CAG     0     
+  0       72.78
scf7180000168667        1       1672    1598    Arg     CCT     0     
+  0       72.32
scf7180000168683        1       509     584     Met     CAT     0     
+  0       90.27
scf7180000168700        1       1405    1329    Val     GAC     0     
+  0       91.08
scf7180000168709        1       11151   11067   Leu     CAA     0     
+  0       75.09
scf7180000168716        1       481     406     Thr     TGT     0     
+  0       81.30
scf7180000168723        1       999     923     Val     GAC     0     
+  0       90.22
scf7180000168736        1       813     888     Val     TAC     0     
+  0       93.03
scf7180000168736        2       912     988     Asp     GTC     0     
+  0       95.34
scf7180000168827        1       5375    5451    Val     GAC     0     
+  0       91.08
[download]

I know how to compare the value for each of the variable horizontally, but what if i want to compare something from line 1 and line 2? My current code :

#!/usr/local/bin/perl -w
#tRANSCAN-SE_extract
die "usage: $0 tRNASCAN-SE.list\n" unless @ARGV ==1;
while (<>) {
    my ($seq, $tRNA_no, $p1, $p2, $amino, $anticodon, $stat1, $stat2, 
+$percent) = split;
    $beg = $p1 if $p2 > $p1; 
    $beg = $p2 if $p2 < $p1;
    $end = $p1 if $p1 > $p2; 
    $end = $p2 if $p1 < $p2;
        system ("extractseq -sequence=$seq.seq -auto -stdout -separate
+ -reg=$beg..$end");
    }
[download]

I can print out line by line. By how can I print out only $p1 and $p2 if the $seq is the same? Something like "$seq $beg..$end,$beg..$end..." ? Thanks in advanced for helping me...tq!

Comment on Perl Multiline Regex Select or Download Code

Replies are listed 'Best First'.
Re: Perl Multiline Regex by ikegami (Patriarch) on May 29, 2009 at 17:07 UTC
You need to keep track of previously viewed rows. Assuming the data is sorted by sequence id, `my $last_seq; while (<>) { my ($seq, $p1, $p2) = (split)[0, 2, 3]; ($p2, $p1) = ($p1, $p2) if $p2 < $p1; if (defined($last_seq)) { if ($seq eq $last_seq) { print(","); } else { print("\n$seq "); } } print("$p1..$p2"); $last_seq = $seq; } print("\n") if defined($last_seq);` [download] Update: More like your code and less like your explanation: sub extractseq { my ($seq, $ranges) = @_; system(extractseq => "-sequence=$seq.seq", "-auto", "-stdout", "-separate", "-reg=" . join(',', map { "$_->[0]..$_->[1]" } @$ranges), ) and die("system: $?/$!\n"); } my $last_seq; my @ranges; while (<>) { my ($seq, $p1, $p2) = (split)[0, 2, 3]; ($p2, $p1) = ($p1, $p2) if $p2 < $p1; if (defined($last_seq) && $seq ne $last_seq) { extractseq($last_seq, \@ranges); @ranges = (); } $last_seq = $seq; push @ranges, [ $p1, $p2 ]; } extractseq($last_seq, \@ranges) if defined($last_seq); [download] Update: Finally, if the input isn't sorted or if you prefer something simpler (at the cost of using more memory), `my %ranges_by_seq; while (<>) { my ($seq, $p1, $p2) = (split)[0, 2, 3]; ($p2, $p1) = ($p1, $p2) if $p2 < $p1; push @{ $ranges_by_seq{$seq} }, [ $p1, $p2 ]; } for my $seq (keys(%ranges_by_seq)) { my $ranges = $ranges_by_seq{$seq}; system(extractseq => "-sequence=$seq.seq", "-auto", "-stdout", "-separate", "-reg=" . join(',', map { "$_->[0]..$_->[1]" } @$ranges), ) and die("system: $?/$!\n"); }` [download]	[reply] [d/l] [select]
Re^2: Perl Multiline Regex by joomanji (Acolyte) on May 29, 2009 at 18:24 UTC
Dear Ikegami, Thank you very much for your effort! I really did not expect some one to reply in such a short time and the solution actually worked! I just hit the F5 button and somebody just replied to my question and it was you! Before I could reply you came up with another update! I will definitely learn from the example you gave me and understand it thoroughly and applied on other script as well.I've modified the script and applied on other scripts!! Thank you! But the script you gave me was hang after the first input. Giving me the error message of "system: 0/Bad file descriptor". But when i commented out the line " or die("system: $?/$!\n"); it works just fine! Do you mind to explain this more?	[reply]
Re^3: Perl Multiline Regex by ikegami (Patriarch) on May 29, 2009 at 18:58 UTC
Oops, that should be "`and die`" instead of "`or die`". `system` is unusual in its return value. Fixed.	[reply] [d/l] [select]
Re^4: Perl Multiline Regex by joomanji (Acolyte) on May 29, 2009 at 20:13 UTC
Re: Perl Multiline Regex by MadraghRua (Vicar) on May 29, 2009 at 22:42 UTC
You might also check out BioPerl's tRNAScan module: `use Bio::Tools::tRNAscanSE; my $parser = Bio::Tools::tRNAscanSE->new(-file => 'result.tRNAscanS +E'); # parse the results while( my $gene = $parser->next_prediction ) { @exon_arr = $gene->get_seqFeatures(); }` [download] Ikegami's excellent posts can help you work with the data itself to do the item by item analysis. Learning BioPerl can be a very good thing... MadraghRua yet another biologist hacking perl....	[reply] [d/l]
Re: Perl Multiline Regex by locked_user sundialsvc4 (Abbot) on Jun 02, 2009 at 01:21 UTC
Categorically speaking, one of the most-important techniques in all of data processing is the notion of working with known-sorted files, in what is called “unit record processing.” It is literally how data processing was done before computers were invented. All you need to do is to remember what the (one...) previously-read record was. You can do a tremendous number of things, extremely quickly, just by comparing this record to the previous one.