joomanji has asked for the wisdom of the Perl Monks concerning the following question:

Dear all.. I'm very new to perl script, I'm able to read input from a file, and split them into column. The problem is I've no idea how compare line 1 and line 2 variable. For example :
scf7180000168636 1 40919 40986 Pseudo TGT 0 + 0 25.93 scf7180000168636 2 28112 28037 Asn GTT 0 + 0 86.57 scf7180000168636 3 27372 27295 Ile GAT 0 + 0 54.85 scf7180000168636 4 27287 27213 Gly TCC 0 + 0 45.71 scf7180000168636 5 839 753 Leu CAG 0 + 0 72.78 scf7180000168667 1 1672 1598 Arg CCT 0 + 0 72.32 scf7180000168683 1 509 584 Met CAT 0 + 0 90.27 scf7180000168700 1 1405 1329 Val GAC 0 + 0 91.08 scf7180000168709 1 11151 11067 Leu CAA 0 + 0 75.09 scf7180000168716 1 481 406 Thr TGT 0 + 0 81.30 scf7180000168723 1 999 923 Val GAC 0 + 0 90.22 scf7180000168736 1 813 888 Val TAC 0 + 0 93.03 scf7180000168736 2 912 988 Asp GTC 0 + 0 95.34 scf7180000168827 1 5375 5451 Val GAC 0 + 0 91.08
I know how to compare the value for each of the variable horizontally, but what if i want to compare something from line 1 and line 2? My current code :
#!/usr/local/bin/perl -w #tRANSCAN-SE_extract die "usage: $0 tRNASCAN-SE.list\n" unless @ARGV ==1; while (<>) { my ($seq, $tRNA_no, $p1, $p2, $amino, $anticodon, $stat1, $stat2, +$percent) = split; $beg = $p1 if $p2 > $p1; $beg = $p2 if $p2 < $p1; $end = $p1 if $p1 > $p2; $end = $p2 if $p1 < $p2; system ("extractseq -sequence=$seq.seq -auto -stdout -separate + -reg=$beg..$end"); }
I can print out line by line. By how can I print out only $p1 and $p2 if the $seq is the same? Something like "$seq $beg..$end,$beg..$end..." ? Thanks in advanced for helping me...tq!

Replies are listed 'Best First'.
Re: Perl Multiline Regex
by ikegami (Patriarch) on May 29, 2009 at 17:07 UTC

    You need to keep track of previously viewed rows. Assuming the data is sorted by sequence id,

    my $last_seq; while (<>) { my ($seq, $p1, $p2) = (split)[0, 2, 3]; ($p2, $p1) = ($p1, $p2) if $p2 < $p1; if (defined($last_seq)) { if ($seq eq $last_seq) { print(","); } else { print("\n$seq "); } } print("$p1..$p2"); $last_seq = $seq; } print("\n") if defined($last_seq);

    Update: More like your code and less like your explanation:

    sub extractseq { my ($seq, $ranges) = @_; system(extractseq => "-sequence=$seq.seq", "-auto", "-stdout", "-separate", "-reg=" . join(',', map { "$_->[0]..$_->[1]" } @$ranges), ) and die("system: $?/$!\n"); } my $last_seq; my @ranges; while (<>) { my ($seq, $p1, $p2) = (split)[0, 2, 3]; ($p2, $p1) = ($p1, $p2) if $p2 < $p1; if (defined($last_seq) && $seq ne $last_seq) { extractseq($last_seq, \@ranges); @ranges = (); } $last_seq = $seq; push @ranges, [ $p1, $p2 ]; } extractseq($last_seq, \@ranges) if defined($last_seq);

    Update: Finally, if the input isn't sorted or if you prefer something simpler (at the cost of using more memory),

    my %ranges_by_seq; while (<>) { my ($seq, $p1, $p2) = (split)[0, 2, 3]; ($p2, $p1) = ($p1, $p2) if $p2 < $p1; push @{ $ranges_by_seq{$seq} }, [ $p1, $p2 ]; } for my $seq (keys(%ranges_by_seq)) { my $ranges = $ranges_by_seq{$seq}; system(extractseq => "-sequence=$seq.seq", "-auto", "-stdout", "-separate", "-reg=" . join(',', map { "$_->[0]..$_->[1]" } @$ranges), ) and die("system: $?/$!\n"); }
      Dear Ikegami,

      Thank you very much for your effort! I really did not expect some one to reply in such a short time and the solution actually worked! I just hit the F5 button and somebody just replied to my question and it was you! Before I could reply you came up with another update!

      I will definitely learn from the example you gave me and understand it thoroughly and applied on other script as well.I've modified the script and applied on other scripts!! Thank you! But the script you gave me was hang after the first input. Giving me the error message of "system: 0/Bad file descriptor". But when i commented out the line " or die("system: $?/$!\n"); it works just fine!

      Do you mind to explain this more?
        Oops, that should be "and die" instead of "or die". system is unusual in its return value. Fixed.
Re: Perl Multiline Regex
by MadraghRua (Vicar) on May 29, 2009 at 22:42 UTC
    You might also check out BioPerl's tRNAScan module:

    use Bio::Tools::tRNAscanSE; my $parser = Bio::Tools::tRNAscanSE->new(-file => 'result.tRNAscanS +E'); # parse the results while( my $gene = $parser->next_prediction ) { @exon_arr = $gene->get_seqFeatures(); }
    Ikegami's excellent posts can help you work with the data itself to do the item by item analysis. Learning BioPerl can be a very good thing...

    MadraghRua
    yet another biologist hacking perl....

Re: Perl Multiline Regex
by locked_user sundialsvc4 (Abbot) on Jun 02, 2009 at 01:21 UTC

    Categorically speaking, one of the most-important techniques in all of data processing is the notion of working with known-sorted files, in what is called “unit record processing.” It is literally how data processing was done before computers were invented.

    All you need to do is to remember what the (one...) previously-read record was. You can do a tremendous number of things, extremely quickly, just by comparing this record to the previous one.