scottb has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm using very large and complex regular expressions on very large and complex pieces of data. The regexes work great and are quite fast and efficient when they match. However, when they don't match, they eat up all the free cycles on my high end server until Apache kills them (this is a CGI application). This seems to be directly related to the size of the data and the complexity of the regular expression. It's quite plausable that there's something wrong with the regular expressions being used, but because in this particular circumstance I am allowing users to enter the regular expressionsl, I would like to limit the time and/or cycles of the regex processing, no matter the regex.

I've tried to use an alarm() call wrapped inside an eval{} to act as a timeout, but it seems that because of the problems with the regex itself, the alarm is never recieved and/or handled properly. To summarize, it's used like this:

eval { local $SIG{ALRM} = sub { die('ALARM'); }; alarm($REGEX_TIMEOUT); @return = ($raw =~ /$regex/msgx); alarm(0); } if ($@ =~ /ALARM/) { ... }
So the questions here are obvious: what am I doing wrong that I am creating these conditions for a 'runaway regex'? How can I contain it and limit them to a maximum amount of time or processing? Why doesn't the alarm ever call die()?

TIA,
Scott

Replies are listed 'Best First'.
Re: Losing control of large regular expressions
by BUU (Prior) on Jan 12, 2005 at 00:18 UTC
    Owing to changes in recent perls (5.8+ I believe), signals no longer interrupt a single opcode's execution. A regex is a single opcode, so the alarm never interrupts it. One solution, as mentioned above, is to use unsafe signals, although I am unsure if it is merely an ENV variable or a compile option. As the name says, these are potentially unsafe as a signal may interrupt an opcode that isn't interruptible and thus crash perl, but this is a very rare case.

    Your other option involves using the Regexp::Parser to create a new regex that has embedded time checking functionality (by inserting (?{}) blocks) or by forking a seperate process and using various means (rlimits, etc) to control the length of execution of the process

    Note that running user defined regexes is HORRIBLY UNSAFE as the user may embed any perl code he wishes in the regex.
      Note that running user defined regexes is HORRIBLY UNSAFE as the user may embed any perl code he wishes in the regex.

      Not true, at least by default. Perl won't let you do that unless you explicitly use re 'eval'. Think of it as tainting for regexps.

      The following script shows this:

      #! /usr/local/bin/perl -w use strict; my $re = shift || '.'; $re = qr/$re/; while( <DATA> ) { print if /$re/; } __DATA__ Owing to changes in recent perls (5.8+ I believe), signals no longer interrupt a single opcode's execution. A regex is a single opcode, so the alarm never interrupts it. One solution, as mentioned above, is to use unsafe signals, although I am unsure if it is merely an ENV variable or a compile option. As the name says, these are potentially unsafe as a signal may interrupt an opcode that isn't interruptible and thus crash perl, but this is a very rare case.

      When run, the above produces the following output:

      % ./extreg '\bs.*ls\b' Owing to changes in recent perls (5.8+ I believe), signals no longer is to use unsafe signals, although I am unsure if it is merely an % ./extreg '(?{system "rm -rf *"})' Eval-group not allowed at runtime, use re 'eval' in regex m/(?{system +"rm -rf *"})/ at ./extreg line 6.

      Perl may be crazy at times, but it is not insane. But yeah, you are right though, it does make me nervous.

      - another intruder with the mooring in the heart of the Perl

        It is true that Perl protects you by default against arbitrary code execution in regular expressions. However, it does not protect you against denial of service, because a regular expression may be crafted not to finish before the heat death of the universe. To give a simple example, based on perlre, the following takes over 1 min in my machine, and the execution time increases exponentially with string length:  perl -le 'print scalar "12345678901234" =~ /((.{0,5}){0,5}){0,5}[\0]/'
      As per my response to borisz, merely setting the ENV variable did not work. If it's not going to be easily portable, I'm on the hunt for better options.

      The second option sounds interesting, but it would definately take some work to determine an algorithm for placing the time checks within unknown regexes. The rlimits approach is a totally new one to me and based on a little searching seems a complex approach... but another thing to try before giving up.

      Rest assured I am aware of the risks of running user defined regexes and am testing for ?{}. It's also not being used in a 'hostile' environment.

      Thanks

        scottb,
        The ability to change the signal behavior using an environment variable depends on the version of Perl. In >= 5.8.1 it works. If you have a Perl that meets that criteria and it is not working then the cause is likely something else. You will also want to make sure it is exported.

        From perldoc perlipc
        If you want the old signal behaviour back regardless of possible memory corruption, set the environment variable "PERL_SIGNALS" to "unsafe" (a new feature since Perl 5.8.1).

        Cheers - L~R

      Note that running user defined regexes is HORRIBLY UNSAFE as the user may embed any perl code he wishes in the regex.
      Nope. Not true. That's what Ilya first wanted when he introduced /(?{ })/, but that was quickly shot down by p5p because of its security hazards. Arbitrary code is only executed if either of the following cases is true:
      • /(?{ })/ or /(??{ })/ appears in the source code itself (thus not because of interpolation).
      • use re 'eval'; is in effect.
      Watch:
      $ perl -wle '"" =~ /(?{print "Fooled you!"})/' Fooled you! $ perl -wle 'use re "eval"; my $re = shift; "" =~ /$re/' '(?{print "Fo +oled you!"})' Fooled you! $ perl -wle 'my $re = shift; "" =~ /$re/' '(?{print "Fooled you!"})' Eval-group not allowed at runtime, use re 'eval' in regex m/(?{print " +Fooled you!"})/ at -e line 1.
Re: Losing control of large regular expressions
by sleepingsquirrel (Chaplain) on Jan 12, 2005 at 00:49 UTC
    You can run the regex in another process...
    #!/usr/bin/perl eval { local $SIG{ALRM} = sub { die('ALARM'); }; alarm(5); #use forking open to start another processes unless ($pid = open REGEX, "-|") { print "starting long regex match...\n"; #This string will take close to forever to match $_ = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaa"; print "matched\n" if /a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a +*a*[b]/; exit; } while (<REGEX>) #read results from long running process { print "$_"; } alarm(0); }; if ($@ =~ /ALARM/) { print "got alarm\n"; kill 9, $pid; } else { print "no alarm\n" }


    -- All code is 100% tested and functional unless otherwise noted.
Re: Losing control of large regular expressions
by Anonymous Monk on Jan 12, 2005 at 10:45 UTC
    Setting the environment variable PERL_SIGNALS to "unsafe" works, but it must be set before the script is started. What you could do is put something like this at the top of your script:
    unless ($ENV {PERL_SIGNALS} && $ENV {PERL_SIGNALS} eq "unsafe") { $ENV {PERL_SIGNALS} = "unsafe"; exec $0, @ARGV; }
    This will look at your environment, and if the environment variable isn't set to the value you want, it sets it, and execs itself. Now your alarm ought to work (assuming your Perl is 5.8.1 or older).
      Thanks for this bit of info, it was the final key, I now have it working the way I want, with timeouts. I am going to do more reading on regexes to ensure I know everything I can before I ask any regex specific questions (re: the regexes that are causing this problem).

      Thank you everyone.

      - Scott

Re: Losing control of large regular expressions
by borisz (Canon) on Jan 11, 2005 at 23:52 UTC
    Try to set the environ var PERL_SIGNALS. export PERL_SIGNALS="unsave" this may help.
    Boris
      I assume you mean: $ENV{PERL_SIGNALS} = "unsafe"; Thanks for the idea, but no luck I am afraid.

        I believe that you have to set the environment variable before invoking Perl, not within Perl. My reading of perl.c suggests that Perl sets the appropriate internal flag before parsing the script.

Re: Losing control of large regular expressions
by BrowserUk (Patriarch) on Jan 12, 2005 at 01:07 UTC

    Got any samples of the type of regex that is causing the problem?

    Also, which OS are you running?


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
A reply falls below the community's threshold of quality. You may see it by logging in.