help in understanding odd regex match

jaco has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to create a regex, based off running another set of regexs on a given string.
And basically dynamically come up with a shortened regex that will match the original string.

The problem is i'm seeing an odd behavior.

#!/usr/bin/perl -w

my $string = '/path/file.htm?849578345908543095364b892a898374aff849000
+1289384add90e93448a839457582993cde90239a90459c3849aa8374f783477346723
+f38923487dd8923847a892837f783746543ff89283a439823cd948399134452&acces
+s_rights=1&mn_ord=yes&session_id=84';

$string =~ s/[^ a-zA-Z0-9_=&\-]/sprintf("\\%s", $&)/eg; # #escape char
+s
print "$string\n"; #debug
$string =~ s/([A-Za-z0-9][^&\/\.]){10,}/sprintf(".*", $&)/eg;
$string =~ s/([0-9]){2,}/sprintf("\\d+", $&)/eg;
print "\n$string\n";#debug
[download]

The problem is it returns

\/path\/file\.htm\?.*2&access_rights=1&mn_ord=yes&session_id=\d+

I'm having difficultly understanding why it wouldn't return

\/path\/file\.htm\?.*&access_rights=1&mn_ord=yes&session_id=\d+

Comment on help in understanding odd regex match Download Code

Replies are listed 'Best First'.
Re: help in understanding odd regex match by jweed (Chaplain) on Feb 20, 2004 at 06:34 UTC
There are a few things that are a bit screwey with this: You can easily avoid using $& here, and since it is really taxing on all other regexen, just don't. I'll point out how to do it along the way. Substitution 1: `s/([^ a-zA-Z0-9_=&\-])/\\$1/g;`. Benefits: No $& to deal with (added parens in front to compensate), as well as stopping an useless sprintf call. Substitution 2: `s/([A-Za-z0-9][^&\/\.]){10,}/./g;`. Benefits: Stops the useless and wrong (arguments that you pass a screwey) sprintf call. But see below. Substitution 3: `$string =~ s/([0-9]){2,}/\\d+/g;`. Benefits: Again, no screwey sprintf. There's a question about your second s/// though: Currently, you look for 10 or more pairs of one alphanumeric character and one non-special character. The reason why you have ?.2 in your actual string is because you have an ODD number of characters. Your "pair" semantics leave one behind, therefore.** We'll need more info before we can decide what you actually want here. HTH. Code is (almost) always untested. http://www.justicepoetic.net/	[reply] [d/l] [select]
Re: Re: help in understanding odd regex match by jaco (Pilgrim) on Feb 20, 2004 at 07:08 UTC
Thank you, I see the mistake i'm making now in the second s///. all better. Also i was unaware that using $& was any worse the $1,$2. Thanks for pointing that out as well. I do appreciate the help.	[reply]
Re: Re: Re: help in understanding odd regex match by duff (Parson) on Feb 20, 2004 at 21:39 UTC
`$&` slows down every regular expression in your program where `$1` and friends just slow down those regular expressions that use them. Same goes for $` and `$'`. This is documented (albeit sparsely) in perlvar duff	[reply] [d/l] [select]
Re: help in understanding odd regex match by ysth (Canon) on Feb 20, 2004 at 06:02 UTC
I don't quite follow what you are trying to do, but I note that two of your calls to sprintf pass a second parameter but don't have a "%s" or other formatter that will use that second parameter.	[reply]
Re: Re: help in understanding odd regex match by jaco (Pilgrim) on Feb 20, 2004 at 06:17 UTC
Thanks for pointing that out. It was just a laziness on my part. say i feed the script the $string. and then in the end it outputs a regex. At which point i can store that regex and run it against other strings to find matches which are similar. The regex produced gives me enough unique info to make an exact path match, rather then slightly similar ones. Bascially leave in the static elements, and .* out the dynamic elements. I'm well aware that it's a cludge, but i'm stuck with a large listing of these and i'm trying to make sorting(and sorting in the future) a little easier on me. I just have no idea why it won't match that last digit. Perhaps i've been looking at it too long.	[reply]