cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
  I've written a new cgi configuration module that generates the HTML config update forms, verifies the input and saves the data as perl variables.
  For the verification I use regexp (obvious). Regexp have never been my strong point, although recently they have been making a lot more sense. I understand $1, $`, $&, $', etc. But I'm not sure if I'm doing some things right. For example I have a regexp that checks for path for illegal characters:-
unless ($input->{$key} =~ /^[a-zA-Z]?:?[^\<\>\:\"\|\?\*]+$/) { die "Error"; }#unless
If I want to show the user what didn't match to cause the error I'm having to do another regexp:-
unless ($input->{$key} =~ /^[a-zA-Z]?:?[^\<\>\:\"\|\?\*]+$/) { $input->{$key} =~ /[\<\>\:\"\|\?\*]/; die "Error, found $&"; }#unless
Am I doing it right? The ^a-zA-Z?:? is at the beginning as paths my be full windows paths. Although I get the feeling I problem mean ^(a-zA-Z:)? I'll have to test to be sure. Thinking about it I need yet another regexp so the error doesn't complain about the : in a c:/...
unless ($input->{$key} =~ /^[a-zA-Z]?:?[^\<\>\:\"\|\?\*]+$/) { $input->{$key} =~ s/^[a-zA-Z]://; $input->{$key} =~ /[\<\>\:\"\|\?\*]/; die "Error, found $&"; }#unless
Sure there is an easier way to do it. Hoping one of you rexexp gurus will help.

Lyle

Replies are listed 'Best First'.
Re: Showing why a Regexp didn't match
by pc88mxer (Vicar) on Apr 12, 2008 at 00:52 UTC
    Generally you should avoid using $', $& and $'. Once they appear in your program they slow down all regular expression calls. You can achieve the same functionality by specifying your own captures, i.e.
    $input->{$key} =~ /[\<\>\:\"\|\?\*]/; die "Error, found $&";
    may be re-written as:
    $input->{$key} =~ /([\<\>\:\"\|\?\*])/; die "Error, found $1";
    Am I doing it right? The ^a-zA-Z?:? is at the beginning as paths my be full windows paths. Although I get the feeling I problem mean ^(a-zA-Z:)?
    I think you'll want ^([a-zA-Z]:)?. It depends on if :\some\path is a valid path (i.e. a leading colon without a drive letter.)

    The final method you've come up with is about the best you can do. For each error message that you can emit you have to develop a test for it. The only problem I see is that another way your validation test can fail is if there are no valid characters after the "drive:" prefix, e.g. "C:". In that case you'll want a different error message.

    Honestly, I don't see what's wrong with just saying:

    die "Invalid path: $input->{$key}\n";
    By including the offending path in the output it should be easy for the user to tell what the problem is.

      Generally you should avoid using $', $& and $'. Once they appear in your program they slow down all regular expression calls.
      This is true and good advice but if you are using perl 5.10.0 or later there is the option of using /P and the variables ${^PREMATCH}, {$^MATCH}, and ${^POSTMATCH} where the cost is confined to only the regular expressions that use them.

      In this case, though, there doesn't appear to be any advantage to using /P.

Re: Showing why a Regexp didn't match
by rhesa (Vicar) on Apr 12, 2008 at 01:32 UTC
    pc88mxer has already answered your specific questions. I thought I'd give you some tips to make using regexps a little easier on the eyes.

    1. You can create regexp objects with the qr operator. This allows you to build more complex regexps out of simple pieces, and to give the individual pieces names.
    2. Inside character classes, only ^ and - are special. There's no need to escape the rest, which removes a lot of your backslashes.

    Example:

    my $drivespec = qr{[a-zA-Z]:}; my $dir = qr{[^<>":|?*]*}; my $path = qr{^ $drivespec? $dir $}x; unless( $input->{$key} =~ $path ) { ... }

    This barely scratches the surface, of course. More reading at perlrequick, perlretut, perlre, and here in the Tutorials section.