RegEx misbehaving?

pdotcdot has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks, I’m a newbie to the monastery and have what on the surface appears to be a very easy question, which has had me flummoxed for over a day so I decided to come to the light side to seek wisdom! I am trying to parse out results from a program that generates an output file like the one below.

Results:1582 1640 6 9.8 6 90 0 69 55 16 13 13 1.68 GACAAT GACAATGACAAT
+GACAATGACAATGACAGAGACAGTAACAATAACAATAACAATAACAA
"Results:5184 5214 6 5.2 6 96 0 55 16 0 45 38 1.47 TGGTGA TGGTGATGGTGA
+TGGTGATGGTGATGTTGAT";
[download]

The problem is that the code to take out lines beginning with “R “ and place results into arrays i have written to do this seems to skip either 1, 2, or 3 Results lines depending on how it feels! therfore out of 137 results lines it only ever picks out 69 or 74 lines. A section of the code is below from a larger program that i wrote to do the job, hence the commented out sections.

"TRID=0;
$SEQID=0;
#$PID=0;
$i=0;
#$line=<TR_INFILE>;
chomp $line;


while ($line =<TR_INFILE>) {
  
    if ($line =~/^R.*/) {
      $line=~s/^Results://g;
      
      #print "making TR arrays\n";
      print OUTFILE3 "$line";
      
      $trstart[$i] =  (split(/\s*/,$line))[0];
      $trend[$i] =    (split(/\s*/,$line))[1];
      $period[$i] =   (split(/\s*/,$line))[2];
      $copy[$i] =     (split(/\s*/,$line))[3];
      $consize[$i] =  (split(/\s*/,$line))[4];
      $matches[$i] =  (split(/\s*/,$line))[5];
      $indels[$i] =   (split(/\s*/,$line))[6];
      $score[$i] =    (split(/\s*/,$line))[7];
      $numa[$i] =     (split(/\s*/,$line))[8];
      $numc[$i] =     (split(/\s*/,$line))[9];
      $numg[$i] =     (split(/\s*/,$line))[10];
      $numt[$i] =     (split(/\s*/,$line))[11];
      $entropy[$i] =  (split(/\s*/,$line))[12];
      #$TR_consensus[$i]= (split(/\s*/,$line))[13];
      #$TR_sequence[$i]=  (split(/\s*/,$line))[14];
      $TRID++;
   }
   # elsif ($line =~/^P.*/){ 
   # print "Making Parameter  arrays\n";
   # $line =~s/\s/\./g;
   # $line =~s/^Parameters:\.//g;
   # $trparameters[$i] = ($line)[0];
   # $PID++;
  # } 
    elsif ($line =~ /^S.*/) {
     # print "Making seqeunce arrays \n";
      $line =~s/^Sequence:\s*//;
      $TR_Accession[$i] =  ($line)[0];
      $SEQID++;
      }
    else {
      }

    $i++;
    $line=<TR_INFILE>;
    chomp $line;
}
close TR_INFILE;"
[download]

I will be grateful for all advice! i am sure it has something to do withthe RegEx.Apologies for the bad layout. Thank you in advance, PC.

Comment on RegEx misbehaving? Select or Download Code

Replies are listed 'Best First'.
Re: RegEx misbehaving? (Not the regex) by BrowserUk (Patriarch) on Jul 18, 2003 at 10:11 UTC
Without more data to go on this is somewhat of a guess, but try commenting out the last two lines of the while loop `051 $line=<TR_INFILE>; 052 chomp $line;` [download] Each time around the loop, you are reading a line at the top of the while loop and processesing it. Then, when you reach these lines, you are reading another line from the file, chomping it and the looping back to the top to read another line. The line you read and chomped is therefore just discarded and never processed. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l]
Re: RegEx misbehaving? by broquaint (Abbot) on Jul 18, 2003 at 10:13 UTC
You're reading the file twice in your loop, first in the condition then again before the end of the loop, is this intentional? Also a few other suggestions - your match regex `^R.` can drop the `.` as it achieves nothing because it says match 'R' at the beginning of the string, optionally followed by 0 or more of anything, in fact it would be better suited with a simple `index`. You could also drop the indexing using `$i` and replace it with the `push` function e.g while ( $line = <TR_INFILE> ) { if ( index( $line, 'R' ) == 0 ) { ## removed /g as it's unnecessary $line =~ s/^Results://; print OUTFILE3 "$line"; ## the ' ' is special, see. perldoc -f split my @chunks = split ' ', $line; push @trstart, shift @chunks; push @trend, shift @chunks; push @period, shift @chunks; push @copy, shift @chunks; push @consize, shift @chunks; push @matches, shift @chunks; push @indels, shift @chunks; push @score, shift @chunks; push @numa, shift @chunks; push @numc, shift @chunks; push @numg, shift @chunks; push @numt, shift @chunks; push @entropy, shift @chunks; $TRID++; } elsif ( index($line, 'S') == 0 ) { $line =~ s/^Sequence:\s*//; push @TR_Accession, $line; $SEQID++; } } [download] Some of that code massaging is style but it will also be much faster than your current code as most of the fiddly stuff is now done by `perl` and it also saves a lot of hard-coding. See. `push`, `index`, `shift` for more info on the functions used above. HTH `_________ broquaint`	[reply] [d/l]
Re: Re: RegEx misbehaving? by Bilbo (Pilgrim) on Jul 18, 2003 at 10:45 UTC
Dropping the indexing using $i and replacing it with push wouldn't be exactly the same because at the moment $i is incremented on every line of the file, rather than just on those which begin with 'R'. In the original version the ith element of each array contained information about the ith line of the file, whereas using push would mean that ith element of each array contained information about the ith line to begin with 'R'.	[reply]
Re: RegEx misbehaving? by sgifford (Prior) on Jul 18, 2003 at 16:29 UTC
Your split statement is wrong. Look: $ perl -e 'print join(" * ",split(/\s/,"hi there")),"\n";' h i * t * h * e * r * e The problem is that `\s` matches zero* or more spaces. That matches between every character, and so splits the results into single characters. What you want to split on is either `/s+` or " ": $ perl -e 'print join(" * ",split(" ","hi there")),"\n";' hi * there Unrelated, you could simplify this code greatly if you used a different data structure, such as this array of hash references `@seq`: `@{$seq[$i]}{qw(trstart trend period copy consize matches indels score numa numc numg numt entropy)} = split(' ',$line);` [download] And has somebody has pointed out, your use of the diamond operator `<TR_INFILE>` is odd. You read a line at the end of the loop, then at the beginning of the loop you immediately discard this line and read a new one. You chomp the line at the end of the loop, but not the one at the beginning. This is probably not what you mean to do.	[reply] [d/l] [select]
Re: RegEx misbehaving? by pdotcdot (Acolyte) on Jul 18, 2003 at 11:24 UTC
Thanks for all your comments, i truely have been enlightened! i'll give the ideas a go as soon as i am off my windoze machine, and back to the penguin one with all the files on.	[reply]
Re: RegEx misbehaving? by hossman (Prior) on Jul 19, 2003 at 01:09 UTC
In addition to the other numerous things that have been pointed out, I can't help but notice that of the two same lines of input you provide, 1 of those don't acctually start with an 'R'... `Results:1582 1640 6 9.8 6 90 0 69 55 16 13 13 1.68 GACAAT GACAATGACAAT +GACAATGACAATGACAGAGACAGTAACAATAACAATAACAATAACAA "Results:5184 5214 6 5.2 6 96 0 55 16 0 45 38 1.47 TGGTGA TGGTGATGGTGA +TGGTGATGGTGATGTTGAT";` [download] That second line starts with the '"' character, which means this regexp... `if ($line =~/^R.*/) {` [download] ...will ignore that line, because the '^' says the line must start with the 'R'. Maybe this was jsut a typo in your post ... or maybe a bunch of lines of your data accutally look like that, and are causing problems.	[reply] [d/l] [select]
Re: RegEx misbehaving? by pdotcdot (Acolyte) on Jul 21, 2003 at 13:03 UTC
Thankyou all for the above comments, this morning i got two different versions of the programs working together combining all the suggestions above. Thanks again,PC	[reply]