pdotcdot has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks, I’m a newbie to the monastery and have what on the surface appears to be a very easy question, which has had me flummoxed for over a day so I decided to come to the light side to seek wisdom! I am trying to parse out results from a program that generates an output file like the one below.
Results:1582 1640 6 9.8 6 90 0 69 55 16 13 13 1.68 GACAAT GACAATGACAAT +GACAATGACAATGACAGAGACAGTAACAATAACAATAACAATAACAA "Results:5184 5214 6 5.2 6 96 0 55 16 0 45 38 1.47 TGGTGA TGGTGATGGTGA +TGGTGATGGTGATGTTGAT";
The problem is that the code to take out lines beginning with “R “ and place results into arrays i have written to do this seems to skip either 1, 2, or 3 Results lines depending on how it feels! therfore out of 137 results lines it only ever picks out 69 or 74 lines. A section of the code is below from a larger program that i wrote to do the job, hence the commented out sections.
"TRID=0; $SEQID=0; #$PID=0; $i=0; #$line=<TR_INFILE>; chomp $line; while ($line =<TR_INFILE>) { if ($line =~/^R.*/) { $line=~s/^Results://g; #print "making TR arrays\n"; print OUTFILE3 "$line"; $trstart[$i] = (split(/\s*/,$line))[0]; $trend[$i] = (split(/\s*/,$line))[1]; $period[$i] = (split(/\s*/,$line))[2]; $copy[$i] = (split(/\s*/,$line))[3]; $consize[$i] = (split(/\s*/,$line))[4]; $matches[$i] = (split(/\s*/,$line))[5]; $indels[$i] = (split(/\s*/,$line))[6]; $score[$i] = (split(/\s*/,$line))[7]; $numa[$i] = (split(/\s*/,$line))[8]; $numc[$i] = (split(/\s*/,$line))[9]; $numg[$i] = (split(/\s*/,$line))[10]; $numt[$i] = (split(/\s*/,$line))[11]; $entropy[$i] = (split(/\s*/,$line))[12]; #$TR_consensus[$i]= (split(/\s*/,$line))[13]; #$TR_sequence[$i]= (split(/\s*/,$line))[14]; $TRID++; } # elsif ($line =~/^P.*/){ # print "Making Parameter arrays\n"; # $line =~s/\s/\./g; # $line =~s/^Parameters:\.//g; # $trparameters[$i] = ($line)[0]; # $PID++; # } elsif ($line =~ /^S.*/) { # print "Making seqeunce arrays \n"; $line =~s/^Sequence:\s*//; $TR_Accession[$i] = ($line)[0]; $SEQID++; } else { } $i++; $line=<TR_INFILE>; chomp $line; } close TR_INFILE;"
I will be grateful for all advice! i am sure it has something to do withthe RegEx.Apologies for the bad layout. Thank you in advance, PC.

Replies are listed 'Best First'.
Re: RegEx misbehaving? (Not the regex)
by BrowserUk (Patriarch) on Jul 18, 2003 at 10:11 UTC

    Without more data to go on this is somewhat of a guess, but try commenting out the last two lines of the while loop

    051 $line=<TR_INFILE>; 052 chomp $line;

    Each time around the loop, you are reading a line at the top of the while loop and processesing it. Then, when you reach these lines, you are reading another line from the file, chomping it and the looping back to the top to read another line. The line you read and chomped is therefore just discarded and never processed.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Re: RegEx misbehaving?
by broquaint (Abbot) on Jul 18, 2003 at 10:13 UTC
    You're reading the file twice in your loop, first in the condition then again before the end of the loop, is this intentional? Also a few other suggestions - your match regex ^R.* can drop the .* as it achieves nothing because it says match 'R' at the beginning of the string, optionally followed by 0 or more of anything, in fact it would be better suited with a simple index. You could also drop the indexing using $i and replace it with the push function e.g
    while ( $line = <TR_INFILE> ) { if ( index( $line, 'R' ) == 0 ) { ## removed /g as it's unnecessary $line =~ s/^Results://; print OUTFILE3 "$line"; ## the ' ' is special, see. perldoc -f split my @chunks = split ' ', $line; push @trstart, shift @chunks; push @trend, shift @chunks; push @period, shift @chunks; push @copy, shift @chunks; push @consize, shift @chunks; push @matches, shift @chunks; push @indels, shift @chunks; push @score, shift @chunks; push @numa, shift @chunks; push @numc, shift @chunks; push @numg, shift @chunks; push @numt, shift @chunks; push @entropy, shift @chunks; $TRID++; } elsif ( index($line, 'S') == 0 ) { $line =~ s/^Sequence:\s*//; push @TR_Accession, $line; $SEQID++; } }
    Some of that code massaging is style but it will also be much faster than your current code as most of the fiddly stuff is now done by perl and it also saves a lot of hard-coding. See. push, index, shift for more info on the functions used above.
    HTH

    _________
    broquaint

      Dropping the indexing using $i and replacing it with push wouldn't be exactly the same because at the moment $i is incremented on every line of the file, rather than just on those which begin with 'R'. In the original version the ith element of each array contained information about the ith line of the file, whereas using push would mean that ith element of each array contained information about the ith line to begin with 'R'.

Re: RegEx misbehaving?
by sgifford (Prior) on Jul 18, 2003 at 16:29 UTC
    Your split statement is wrong. Look:
    $ perl -e 'print join(" * ",split(/\s*/,"hi there")),"\n";'
    h * i * t * h * e * r * e
    
    The problem is that \s* matches zero or more spaces. That matches between every character, and so splits the results into single characters. What you want to split on is either /s+ or " ":
    $ perl -e 'print join(" * ",split(" ","hi there")),"\n";' 
    hi * there
    

    Unrelated, you could simplify this code greatly if you used a different data structure, such as this array of hash references @seq:

    @{$seq[$i]}{qw(trstart trend period copy consize matches indels score numa numc numg numt entropy)} = split(' ',$line);

    And has somebody has pointed out, your use of the diamond operator <TR_INFILE> is odd. You read a line at the end of the loop, then at the beginning of the loop you immediately discard this line and read a new one. You chomp the line at the end of the loop, but not the one at the beginning. This is probably not what you mean to do.

Re: RegEx misbehaving?
by pdotcdot (Acolyte) on Jul 18, 2003 at 11:24 UTC
    Thanks for all your comments, i truely have been enlightened! i'll give the ideas a go as soon as i am off my windoze machine, and back to the penguin one with all the files on.
Re: RegEx misbehaving?
by hossman (Prior) on Jul 19, 2003 at 01:09 UTC

    In addition to the other numerous things that have been pointed out, I can't help but notice that of the two same lines of input you provide, 1 of those don't acctually start with an 'R'...

    Results:1582 1640 6 9.8 6 90 0 69 55 16 13 13 1.68 GACAAT GACAATGACAAT +GACAATGACAATGACAGAGACAGTAACAATAACAATAACAATAACAA "Results:5184 5214 6 5.2 6 96 0 55 16 0 45 38 1.47 TGGTGA TGGTGATGGTGA +TGGTGATGGTGATGTTGAT";

    That second line starts with the '"' character, which means this regexp...

    if ($line =~/^R.*/) {

    ...will ignore that line, because the '^' says the line must start with the 'R'.

    Maybe this was jsut a typo in your post ... or maybe a bunch of lines of your data accutally look like that, and are causing problems.

Re: RegEx misbehaving?
by pdotcdot (Acolyte) on Jul 21, 2003 at 13:03 UTC
    Thankyou all for the above comments, this morning i got two different versions of the programs working together combining all the suggestions above. Thanks again,PC