jaco has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to create a regex, based off running another set of regexs on a given string.
And basically dynamically come up with a shortened regex that will match the original string.

The problem is i'm seeing an odd behavior.

#!/usr/bin/perl -w my $string = '/path/file.htm?849578345908543095364b892a898374aff849000 +1289384add90e93448a839457582993cde90239a90459c3849aa8374f783477346723 +f38923487dd8923847a892837f783746543ff89283a439823cd948399134452&acces +s_rights=1&mn_ord=yes&session_id=84'; $string =~ s/[^ a-zA-Z0-9_=&\-]/sprintf("\\%s", $&)/eg; # #escape char +s print "$string\n"; #debug $string =~ s/([A-Za-z0-9][^&\/\.]){10,}/sprintf(".*", $&)/eg; $string =~ s/([0-9]){2,}/sprintf("\\d+", $&)/eg; print "\n$string\n";#debug


The problem is it returns
\/path\/file\.htm\?.*2&access_rights=1&mn_ord=yes&session_id=\d+
I'm having difficultly understanding why it wouldn't return
\/path\/file\.htm\?.*&access_rights=1&mn_ord=yes&session_id=\d+

Replies are listed 'Best First'.
Re: help in understanding odd regex match
by jweed (Chaplain) on Feb 20, 2004 at 06:34 UTC
    There are a few things that are a bit screwey with this:
    • You can easily avoid using $& here, and since it is really taxing on all other regexen, just don't. I'll point out how to do it along the way.
    • Substitution 1: s/([^ a-zA-Z0-9_=&\-])/\\$1/g;. Benefits: No $& to deal with (added parens in front to compensate), as well as stopping an useless sprintf call.
    • Substitution 2: s/([A-Za-z0-9][^&\/\.]){10,}/.*/g;. Benefits: Stops the useless and wrong (arguments that you pass a screwey) sprintf call. But see below.
    • Substitution 3: $string =~ s/([0-9]){2,}/\\d+/g;. Benefits: Again, no screwey sprintf.
    There's a question about your second s/// though: Currently, you look for 10 or more pairs of one alphanumeric character and one non-special character. The reason why you have ?.*2 in your actual string is because you have an ODD number of characters. Your "pair" semantics leave one behind, therefore. We'll need more info before we can decide what you actually want here.

    HTH.



    Code is (almost) always untested.
    http://www.justicepoetic.net/
      Thank you,

      I see the mistake i'm making now in the second s///. all better.

      Also i was unaware that using $& was any worse the $1,$2.
      Thanks for pointing that out as well.

      I do appreciate the help.

        $& slows down every regular expression in your program where $1 and friends just slow down those regular expressions that use them. Same goes for $` and $'. This is documented (albeit sparsely) in perlvar

Re: help in understanding odd regex match
by ysth (Canon) on Feb 20, 2004 at 06:02 UTC
    I don't quite follow what you are trying to do, but I note that two of your calls to sprintf pass a second parameter but don't have a "%s" or other formatter that will use that second parameter.
      Thanks for pointing that out. It was just a laziness on my part.

      say i feed the script the $string. and then in the end it outputs a regex. At which point i can store that regex and run it against other strings to find matches which are similar. The regex produced gives me enough unique info to make an exact path match, rather then slightly similar ones. Bascially leave in the static elements, and .* out the dynamic elements.

      I'm well aware that it's a cludge, but i'm stuck with a large listing of these and i'm trying to make sorting(and sorting in the future) a little easier on me.

      I just have no idea why it won't match that last digit. Perhaps i've been looking at it too long.