Hena has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I'm having a slight failure to understand regex. The answer i'm getting is not exactly what i'm thinking it should be. So here the thing
#!/usr/bin/perl # # pattern match testing # use warnings; use strict; # use re 'debug'; my $str = 'triple hybrid: dehyd: triple hybrid: dehydrofolate reductas +e reconstruction ;MI:0111'; $str =~ s/(.+): \1/$1/; print "'$1'\n$str\n";
Now that code prints out what i expect it to:
'triple hybrid: dehyd'
triple hybrid: dehydrofolate reductase reconstruction ;MI:0111

But with this s/([()\w\s-]+): \1/$1/; output somehow is:
'd'
triple hybridehyd: triple hybrid: dehydrofolate reductase reconstruction ;MI:0111

Why with a given set of optional values it matches only 'd' but . matches all from beginning? If I add achoring '^' then the second one fails.

Ps. I'm going to go for a week long holiday from ~2-3 hours from now on, so I won't be able to comment after that. But I will read/comment after next week :).

Replies are listed 'Best First'.
Re: Understanding regex
by hv (Prior) on Apr 29, 2005 at 11:07 UTC

    The character class in your second example doesn't include ':' among the permitted characters, and this stops it being able to match the original duplicated substring. As a result it walks through the string looking for a duplicate that it can match, and 'd: d' is the first one it finds.

    To match the initial duplicate with the character class, add the colon to the character class:

    s/([:()\w\s-]+): \1/$1/;

    Hugo

      You know, some times one just feels like an idiot :D. Thanks.

        Install YAPE::Regex::Explain if you don't already have it, as it can break out a regex into a more verbose format that can help you spot things like that.

        $ perl -MYAPE::Regex::Explain -le 'print YAPE::Regex::Explain->new(qr/ +([()\w\s-]+): \1/ )->explain' The regular expression: (?-imsx:([()\w\s-]+): \1) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [()\w\s-]+ any character of: '(', ')', word characters (a-z, A-Z, 0-9, _), whitespace (\n, \r, \t, \f, and " "), '- ' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- : ': ' ---------------------------------------------------------------------- \1 what was matched by capture \1 ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
Re: Understanding regex
by deibyz (Hermit) on Apr 29, 2005 at 11:09 UTC
    As you can see in the first example, ":" is part of the matched string, and you're excluding it in the second. That's why it only matchs "d" on:

    triple hybrid: dehydrofolate ^--Here
    And substitutes the full "d: d" by only d. Try adding ":" to your character class (s/([()\w\s:-]+): \1/$1/).
Re: Understanding regex
by Animator (Hermit) on Apr 29, 2005 at 11:44 UTC

    I wonder... do you know what .+ means? (I'm taling about greedy vs non-greedy)...

    And why do you want to use another regex in the first place? if your first one prints what you expect?

    And why do you even botter posting a message before leaving?
    I doubt that any one will still be looking at the node when you come back from your vacation... which means that if you have other questions about it you need to repost it... and/or that others have to re-look at the problem. IMHO both of those things are bad.

    Update: to clear up my re-posting point: If you still have troubles (when you came back) with the same regex/the same code (or wanted to reply to a comment), then you would have to re-post the question, since almost noone (or atleast that's what I believe) looks back at nodes that are over a week old... But ofcourse all this is irrelevant now.

      I was helping out another person with his regex and he can check it out. Besides I got the answer before I left so... Also even if i had got an asnwer after I had left, I can come back to this node and see what people posted later that day. I don't need to repost it later.

      Using dot would've ruined another line that would have come along later (pattern is used in input file per line). I do have a ability to make insignificant mistakes, which tend to take hours to figure out and result always seems to besomeone else saying 'hey your missing a , in there'or something similar :).