venkatr_n has asked for the wisdom of the Perl Monks concerning the following question:

I have output from a program that looks like this
# Output from 'compseq' # # The Expected frequencies are calculated on the (false) assumption th +at every # word has equal frequency. # # The input sequences are: # 2L Word size 1 Total count 48795086 # # Word Obs Count Obs Frequency Exp Frequency Obs/Exp Freque +ncy # A 13340410 0.2733966 0.2500000 1.0935 +864 C 10686861 0.2190151 0.2500000 0.8760 +604 G 10692025 0.2191209 0.2500000 0.8764 +838 T 13352116 0.2736365 0.2500000 1.0945 +460 Other 723674 0.0148309 0.0000000 10000000000.00 +00000
Sometimes the Other   ... line can be missing, and the output ends with the T .... line. I'm trying to extract four values from this:
(1) The value next to Total count
(2) The first column next to G
(3) The first column next to C
(4) The first column next to Other, if it exists.
My regular exprn looks like this:
$compseqOutput =~ m/Total\scount\s+(\d+).+? C\s+(\d+).+? G\s+(\d+).+? (?:Other\s+(\d+))?/sx
, where I'm trying to allow for the fact the Other... line might not be found in $compseqOutput. For some reason, it fails when I have this -- it works perfectly when the expression is
$compseqOutput =~ m/Total\scount\s+(\d+).+? C\s+(\d+).+? G\s+(\d+).+? Other\s+(\d+)/sx
, but obviously that doesnt do what I want it to. I know I can get around this in many ways, but why does this not work?

Replies are listed 'Best First'.
Re: Regular Expn Problem
by Roy Johnson (Monsignor) on May 10, 2004 at 03:35 UTC
    There's no reason to do this with one expression.
    my ($tot, $gcount, $ccount, $ocount); while (<>) { $tot=$1 if /^Total count\s+(\d+)/; $gcount = $1 if /^G\s+(\d+)/; $ccount = $1 if /^C\s+(\d+)/; $ocount = $1 if /^Other\s(\d+)/; }
    You could do a series of if..elsif..elsifs, if you preferred.

    The PerlMonk tr/// Advocate
      I realize I can get around the issue, but I wanted to know why this didnt work.
Re: Regular Expn Problem
by BrowserUk (Patriarch) on May 10, 2004 at 03:36 UTC

    The problem is that with (?:Other\s+(\d+))? all being optional, and preceded by .+?, the regex doesn't need to match the conditional last element as the preceding element happily matches to the end of the string.

    One way to ensure that the last element is matched if it exists, is to force the preceding element .+? to be terminated early if it it does.

    $text =~ m[ Total\scount\s+(\d+).+? C\s+(\d+).+? G\s+(\d+) .+?(?=Other|$) (?:Other\s+(\d+))? ]sx;

    Using the alternation in the lookahead, will ensure that if the "Other" line exists, the final element of the regex will be forced to match it.

    You'll still need to check the last capture for undef to decide whether the "other" line was present or not.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      ... the preceding element [.+?] happily matches to the end of the string.

      No, it matches one character. Then (?:Other\s+(\d+))? matches the empty string, and then we reach the end of the pattern.

      I was under the impression that the ? at the end of .+? makes it be "not greedy" and allows the conditional at the end to be matched, but now I'm confused. I can use the exact expression without the conditional at the end, and with .+ instead of .+? and the regexp works correctly. So what is the ? in .+? doing?

        Your right, the ? does make .+? non-greedy, but the (?:...)? say that you don't mind if the contained expression is missing, so as the .+? can match to the end (of any string), then no attempt is made to match the optional expression that follows it.

        Hmm. Maybe this makes more sense? The earlier expression does match to the end of string, and the later (rightmost) expression is optional, so no attempt is made to match the latter.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
Re: Regular Expn Problem
by graff (Chancellor) on May 10, 2004 at 03:41 UTC
    I'm not sure I can explain why this is, but the "?" at the end of the regex is unnecessary -- in fact, it is redundant, because it's doing the same thing as the "(?:...)" operator. Get rid of the final "?" and I believe it will work. (It seemed to make a difference on a small test string that I tried.)

    update: Here's a simple demonstration -- compare the outputs of these two commands:

    perl -e '$_ = "one two three four"; print $1,$2,$/ if (/(one) .+?(?: t +hree (four))/)' perl -e '$_ = "one two three four"; print $1,$2,$/ if (/(one) .+?(?: t +hree (four))?/)'
    For me, the first one prints "onefour", and the second just prints "one". (Anyway, I think I prefer Roy Johnson's approach.)
      (?:) simply means the parenthesis is non-capturing. It is not a conditional.
Re: Regular Expn Problem
by TilRMan (Friar) on May 10, 2004 at 08:07 UTC
    $compseqOutput =~ m/Total\scount\s+(\d+).+? C\s+(\d+).+? G\s+(\d+).+? (?:Other\s+(\d+).+)?$/sx

    You have to anchor the end of the string, or else the non-greedy match will happily ignore everything after "10692025", including the "Other" line.

Re: Regular Expn Problem
by Roy Johnson (Monsignor) on May 10, 2004 at 11:44 UTC
    You should include the last .+ in the grouping with Other:
    ... G\s+(\d+)(?:.+ Other\s(\d+))?/sx;
    Update: This problem is a good illustration of the difference between lazy and greedy. Perl regexps are both, and you can't fix a problem with one by addressing the other.

    Greediness is about gobbling up characters. Laziness is about backtracking. Perl will not backtrack unless it fails to match. So .+ is going to match clear to the end of the text (or line) unless there is something mandatory after it. .+? is going to match exactly one character unless there is something mandatory after it.

    Perl will not backtrack outside of an optional subexpression to try to match it, so any backtracking you want it to do should be inside the optional subexpression.


    The PerlMonk tr/// Advocate