mr_dont has asked for the wisdom of the Perl Monks concerning the following question:

O.K. Monks, I wrote a script that uses Find::File to recursively parse a directory and change parts of the files it finds using a reg/ex. I got some strange (at least to me) results from one particular substitution:

The guts of this substitution look like this:

$matchtext = "http://www.foo.bar/somepage.pl?"; $substitutetext = "http://www.new.site/somenewpage?"; s/$matchtext/$substitutetext/;

This code was giving me this result:
http://www.foo.bar/somepage.pl? was becoming
http://www.new.site/somenewpage.pl?? #note: 2 "?"'s

Oh yeah, I thought, i forgot about the "?" modifier for reg/ex. So I reran my script with this...

$matchtext = "http://www.new.site/somenewpage.pl\?\?"; $substitutetext = "http://www.new.site/somenewpage.pl\?"; s/$matchtext/$substitutetext/;

...thinking that I would simply escape out the "?" character.

After running this script, the orignal string:
http://www.new.site/somenewpage.pl?
turned into: http://www.new.site/somenewpage.pl??l??

So, I am not sure why I am getting foo.pl??l?? after the regex substitution. Also, any suggestions on how to change my blah.pl??l?? back into blah.pl?

Thanks Monks!

Replies are listed 'Best First'.
Re: Regular Expressions and Question Marks
by rbc (Curate) on Mar 13, 2002 at 23:44 UTC
    Why not just leave the ?'s out?

    Will this work for you?

    $matchtext = "http://www.foo.bar/somepage.pl"; $substitutetext = "http://www.new.site/somenewpage"; s/$matchtext/$substitutetext/;
      Actually, yes, I should have just left them out!
Re: Regular Expressions and Question Marks
by tadman (Prior) on Mar 13, 2002 at 23:40 UTC
    Like artist says, to handle wierdies like ?, you should use the \Q and \E features of the regular expression system.

    Also, keep in mind that with double quotes, you are not really escaping your question marks at all. To see this, try printing your $matchtext. It's not what you intended! Compare the differences between the following two things:
    $matchtext1 = "http://www.new.site/somenewpage.pl\?"; print "$matchtext1\n"; $matchtext2 = 'http://www.new.site/somenewpage.pl\?'; print "$matchtext2\n";
    Single quotes preserve your backslash, double quotes use it, putting, as you asked, a literal quotation mark in your string. This is different from single quotes which put in a backslash then a question mark when presented with the same string.

    Maybe that explains some of the mystery. For the rest, rememmber what ? does within a regex. You are in effect saying that the 'l' in "somepage.pl" is optional. Then you replace this text with something with an additional question mark. So, you get this:
    s/somenewpage.pl??/somenewpage.pl?/
    So, "somenewpage.p" turns into "somenewpage.pl?" and you still have the "l?" leftover. Hence, "somenewpage.pl?l?". Why does it do this? With a single question mark, it would be greedy and grab the 'l', since there is one there. The second question mark tells it to be conservative and use the shortest possible match, which in this case is the "0" part of "0 or 1 instances of" meaning of "?".

    End result? Try this:
    $matchtext = "http://www.foo.bar/somepage.pl?"; $substitutetext = "http://www.new.site/somenewpage?"; s/\Q$matchtext\E/$substitutetext/;
Re: Regular Expressions and Question Marks
by artist (Parson) on Mar 13, 2002 at 23:18 UTC
    Try this
    $matchtext = q(http://www.foo.bar/somepage.pl?); $substitutetext = "http://www.new.site/somenewpage\?"; s/\Q$matchtext\E/$substitutetext/;
    Artist
      According to perlre:
      \E end case modification (think vi) \Q quote (disable) pattern metacharacters till \E

      Depending on what sort of text you want entered in $matchtext, this could be great or a disaster. Assuming that there were a few lines of code between the regex and the initialization,it would be pretty easy to forget that metacharacters are disabled. So if you take this route, make sure to leave a note for yourself where you set the variable. Running the wrong regex to diddle a bunch of files can ruin your day. (How good are your backups?)


      TGI says moo

Re: Regular Expressions and Question Marks
by TGI (Parson) on Mar 13, 2002 at 23:52 UTC

    This all boils down to an interpolation problem. In your second run, your '\?' substrings are interpolated into mere '?'.

    Try perl -e print "WTF\?\?\n". You should wind up with a return of WTF??.

    So WTF?? does this have to do with the problem above? Well, $matchtext = "http://www.new.site/somenewpage.pl\?\?" interpolates to http://www.new.site/somenewpage.pl??. So you are running this regex: s#http://www.new.site/somenewpage.pl??#http://www.new.site/somenewpage.pl?:. Which isn't intended at all.

    Use a non-interpolating quoting mechanism like q// or '' when you specify your $matchstring and your $replacestring. Also note that things like ?.* are safe in the second half of a substitution, of course you'll still need to escape \ and whatever character you are using as a delimiter (typically /)


    TGI says moo