Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have some code that accepts a string from a user to use as a regular expression in a filter. (E.g if they want the script to output a list of items beginning abc they would enter ^abc .)

This bit works well when they enter a valid regular expression. However I'm struggling with a way of validating the regexp prior to using it (ie make sure they've entered a valid regexp, not worried at this stage what it matches just that it's a valid string). The best I've come up with it

use strict; use warnings; print validate('a*') . "\n"; print validate('*') . "\n"; sub validate{ my $pat = shift; if (eval{qr/$pat/}){ return 1; }else{ return 0; } }

This works but feels like a rather ugly way to do this. Can anyone think of a better way?

I can't just use \Q and \E, I want them to be able to enter meta characters.

Thanks

Graham

Replies are listed 'Best First'.
Re: Validating Regular Expression
by davido (Cardinal) on Feb 15, 2011 at 16:29 UTC

    Eval is great for getting perl to tell you if something is technically legal or not. But I have a concern.

    The task is a little more complex than that. You will have to prevent the (?{code}) construct from finding its way into your user's input, lest they manage to inject some unsavory tidbit into your system. That's just the start. Will your users have the ability to input RE's that are expressed in /x terms? (In other words, where whitespace is not relevant.) They could always use (?x:.....), or any of the other legal options to sort of control the RE's destiny, possibly in a way that you weren't intending. I think that the greater task is not in validating the RE's viability, so much as evaluating its security risk.

    As for validating its viability, eval isn't such a bad tool. You would be able to let perl tell you if there's a problem. But perl cannot know if there's a security breech.

    I don't know all the possible risks that you should test for. But at minimum disallowing code blocks would be a start.


    Dave

      As mentioned, patterns that execute arbitrary code are disabled by default when interpolating. There is one thing to keep in mind though: It's easy to construct a pattern that will take longer than the life of the universe to (fail to) match.
      I could just be timid, but I would not trust myself to filter this type of input from an untrusted source. Nefarious people are sometimes smarter than me and always have more time and motivation, so I would not put this tool anywhere they could use it to elevate permissions. Sure you could swap my routine above to
      sub validate{ my $pat = shift; return if $pat =~ /\(\?{1,2}\{/; return eval{qr/$pat/}; }
      to filter out code evaluation and pattern code expressions (A bit of magic: executing Perl code in a regular expression), but I'm sure I'd miss some clever escape or exploit of the regex engine. This is the sort of thing where the feature is not worth the effort to secure it properly.

        Hi,

        I'm not too worried about them trying to attack using this. It's their own PC it's running on! Would be a pain if it was a web application or similar but this is just a local Tk app.

        If they can escalate privileges in any way then that's a fault in the OS or the IT setup. Either way, Not My Problem (Tm)!!

        Graham

      You will have to prevent the (?{code}) construct from finding its way into your user's input, lest they manage to inject some unsavory tidbit into your system.
      That's not more complicated than not typing use re 'eval';. That is, by default, Perl doesn't honour (?{code}) constructs in interpolated code.

      Hi,

      Thanks for the answers. I'm not too worried about security, it's a real application running on the users on PC (Tk GUI). Only thing the user's going to be able to break is their own PC, that's between them and their IT guys!! If it was for a website or similar public access I'd be more concerned.

      Graham

Re: Validating Regular Expression
by kennethk (Abbot) on Feb 15, 2011 at 16:25 UTC
    Assuming you don't want to implement a regular expression interpreter, then I'd bet your code is fairly close to optimal. You could simplify your test to
    use strict; use warnings; for ('a*', '*') { print validate($_) ? "Pass: '$_'\n" : "Fail: '$_'\n"; } sub validate{ my $pat = shift; return eval{qr/$pat/}; }
    where it returns the compiled regular expression if it succeeds (TRUE) or undef if it fails (FALSE). (In general IMHO, it's a good idea to just return; on failure, since return 0; will evaluate to TRUE in list context. will return a non-empty list if the sub is evaluated in list context)

    Blah blah, security, blah blah, trusted source, blah blah, injection attack, blah blah, perlsec.

    Update: In deference to ikegami's comment, I have revised the above; however I would point out that Perl Best Practices agrees with me (point 124 - Use a bare return to return failure). I think we can both agree that the interface you specify should depend on how you are going to use it.

      In general, it's a good idea to just return; on failure, since return 0; will evaluate to TRUE in list context.

      First, that's not true. You can't test the truthfulness of a something evaluated in list context, so how can it evaluate to true? I think you are talking about list assignment in scalar context.

      sub is_valid { 0 } is_valid() or die "invalid"; # Ok (dies) my $invalid = is_valid() or die "invalid"; # Ok (dies) my ($invalid) = is_valid() or die "invalid"; # XXX (Doesn't die!)

      The last is purely the coder doing something wrong.

      Secondly, I disagree. A boolean is expected, so return that. Otherwise, you create a function that needlessly returns differently when evaluated in scalar and list contexts. You create a function that returns something differently than expected (and documented?). You rely on the caller to create a boolean from the value you do return. For example, you are forcing people to write

      %data = ( foo => !!is_foo(), bar => !!is_bar(), );

      instead of

      %data = ( foo => is_foo(), bar => is_bar(), );

      Returning an empty list would cause the buggy assignment in scalar context to DWIM, but it breaks valid code.

      Update: Improved list assignment in scalar context example.

      I think we can both agree that the interface you specify should depend on how you are going to use it.

      That's entirely my point. One would use the result of validate as a boolean, so it should return a boolean. You suggested that it should return something that needs to be converted into a boolean first.

      Use a bare return to return failure

      Return an empty list can be useful to signal failure where a false value could be returned.

      sub get { $engine->connect() or return (); return $engine->get(); } my $ok = ($data) = get();

      That's not relevant here.

        It only needs to be converted to a boolean first if it is used in a list assignment. Any usage of the sub in scalar context will behave as expected and as the original spec. We are both making assumptions about how "[o]ne would use the result", and any conclusions are subjective and academic.