Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Allowing regex entries in web form to search database: Risks or gotchas?

by Polyglot (Hermit)
on Aug 08, 2022 at 17:10 UTC ( #11146028=perlquestion: print w/replies, xml ) Need Help??

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

I have a research-oriented database, online, accessible via my own web interface and open to public use. The application is set up to allow read-only access to the database, the CGI script is hosted on a linux server, and the script is definitely not set as setuid. I am not allowing any use of nested executable code inside the regex, via the following sort of rules during the parsing of the query:

return "ERROR: For security and bandwidth reasons, query may not conta +in pure wildcards." if $SR_query =~ m/^[( ]*\.\s*(?:(?:\{\s*\d+\s*,?\ +s*\d*\s*})?|[*+?]*)[) ]*$/; return "ERROR: Regex containing code disallowed." if $SR_query =~ m[\( +\?\??\{];

Beyond these fundamental/basic protections against potential malicious actors, is there anything I might be blindly walking into by unleashing this capability in my website?

I have had to run a rather complicated subroutine on the query itself to prevent taint from objecting to it--even though the code is never "executed" other than being inserted into a m// to run against text drawn from the database prior to formatting the results for return to the browser. But this is a small price to pay for the very useful functionality of having regex-capable searches on the database.



  • Comment on Allowing regex entries in web form to search database: Risks or gotchas?
  • Download Code

Replies are listed 'Best First'.
Re: Allowing regex entries in web form to search database: Risks or gotchas?
by dave_the_m (Monsignor) on Aug 08, 2022 at 21:39 UTC
    Perl's regex engine has evolved over 30+ years; it's huge and crusty, with large chunks nobody quite understands any more. There are many ways of writing regexes that will consume effectively infinite CPU unless you kill it off. Until recent perl releases, there were many bugs in the regex compiler that would overflow integers and do strange things, e.g. in patterns like /((((foo){2000}){2000}){2000})/. And that's just the bugs we know about.

    So I wouldn't want to allow the general public the ability to supply arbitrary patterns to a web server.

    Not all is lost however. Perl allows other regex engines to be plugged in. In particular the module re::engine::RE2 allows perl to use Google's RE2 regex engine. This doesn't support as many features as the perl engine, but in this case that's a plus.



      How much effect would limiting nested parentheses to two and {##} numbers to two digits have on that CPU resource hogging? Would there be an effective way of mitigating against this?

      This is the sort of helpful tip I'm looking for. It does little good to say ever so meaningfully: "You would be ill-advised to do this...." I'm looking for rational support to such a statement; as in, why is it inadvisable.

      Once potential pitfalls are identified, only then can one hope to address them. And I do hope to make things safer, albeit, not completely foolproof.

      I'm reminded of a setting provided to server administrators in shorewall's firewall management tools....something like "ADMIN_IS_ABSENT_MINDED = 1". Hah! It was supposed to keep the current connection open in case of a firewall restart with ill-advised settings that might have inadvertently locked even the admin out! It's simply never possible to make something completely foolproof, and I don't intend to try. But I do want to make it, at the very least, secure from hacker penetration. CPU resources is one thing. Gaining server admin privileges through a security hole is another.



Re: Allowing regex entries in web form to search database: Risks or gotchas?
by choroba (Archbishop) on Aug 08, 2022 at 18:27 UTC
    A user can still sneak a wildcard in, e.g.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      Yes, you're right. I'm thinking about whether to shore that up a little more or not. I actually needed a wildcard myself today, and snuck one in in my own way with:


      The main reasons to guard against the wildcard privilege is simply to conserve server resources and to protect the client against a browser overload. Because there are multiple search fields that are interrelated (think of table joins), having a wildcard in one may actually be desirable so long as one of the correlated fields is sufficiently limiting. I could also address the issue, I suppose, by simply establishing a quota for max rows returned. I haven't really wanted to do that for several reasons, but, if server loading becomes problematic, that should solve it.

      Note, too, that those two rules are not the only ones in my list. I just picked them out for examples here.

      My main concern is server security. I would hate to inadvertently open myself up as a target for phishing websites, DoS attacks, etc. I've had to deal with such things in the past, but never yet from a perl script--and I'd hate to see this change. PHP, now, I haven't much good to say about its security.



Re: Allowing regex entries in web form to search database: Risks or gotchas?
by Jenda (Abbot) on Aug 08, 2022 at 19:12 UTC

    What's the underlying engine and who evaluates the regexps? The engine? If so you need to ask what's safe/unsafe for that engine, not for Perl.

    You should not look for dangerous stuff, you should check you only got safe stuff!

    1984 was supposed to be a warning,
    not a manual!


      I'm not entirely sure what you mean by "underlying engine." My script does the evaluation--I'm not depending on any third-party tools. This has as much to do with the fact that I can rarely understand how to implement others' modules as anything. (Object-oriented code baffles me.)

      The regex evaluation is fairly simple, and meant to allow virtually any arbitrary expression, with a few important exceptions such as not allowing the user to insert executable code into it. Giving the user freedom to enter his or her own regular expression is what makes the feature so attractive and powerful. There is no other way to properly find certain things without a good regex, and it would be impossible to pre-supply all potential regex forms that might be needed.

      Users have several simple options at their disposal that do not require the evaluation of a regular expression. For example, they may select for case sensitivity, the matching of whole words (i.e. \bWord\b), or to enter their own word/text delimiters. But these options will be ignored if the user chooses to use his or her own regular expression--in which case the matching of whole words, etc., would be left entirely to the user's own regex.

      As for "You should not look for dangerous stuff, you should check you only got safe stuff!", how would you propose to divide between these two? What defines "safe"? As with anything on this planet, even the safest of things can be made to be harmful when placed in the wrong hands. Because people could drown in water is no reason to withhold it and cause them to die of thirst!



        You wrote "database" so I assumed there's a database engine, say PostgreSQL, and that's where you store the data. If it were so you could either use the regexps provided by that database engine, use Perl within that engine or fetch all the data to be searched and evaluated the expressions within the script.

        It's you who defines safe and you need to decide what's safe for each individual use. The point is that instead of

        if ($input =~ /something I already know is dangerous/) { die 'I refuse + to handle this!'; }
        you should always write
        if ($input !~ /^only stuff I know is fine$/) { die 'I refuse to handle + this!'; }

        I can't give you a generic "this is unsafe" or a generic "this is safe" not knowing what happens to the $input afterwards. It's something you have to do. The thing is that it's much easier to forget to list something that's dangerous, than it is to accidentally allow something that's dangerous.

        1984 was supposed to be a warning,
        not a manual!

Re: Allowing regex entries in web form to search database: Risks or gotchas?
by LanX (Sage) on Aug 08, 2022 at 18:59 UTC

      A few things jumped out at me in looking at the CPAN page you linked.

      1. The regex rules provided are not the same as those of Perl (e.g. swapping ?/. and |/,);
      2. The full ruleset is either not attested or is extremely limited/restricted;
      3. The module seems a little old (maybe predates some of the newer regex features);
      4. The word "separator" was consistently misspelled.

      Seeing as the syntactical changes are unnecessary, I would prefer to stick with a pure PCRE. Why use commas in place of pipes and question marks in place of periods when the functionality for these remains exactly the same? And what happened to the zero-or-one match capability provided by the question mark? What replaces that?

      Back to doing it my own way...TMTOWTDI doesn't mean I will like every other way of doing it.



        > The word "separator" was consistently misspelled.

        Schocking! 🧐

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11146028]
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2022-10-07 08:21 GMT
Find Nodes?
    Voting Booth?
    My preferred way to holiday/vacation is:

    Results (29 votes). Check out past polls.