Allowing regex entries in web form to search database: Risks or gotchas?

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Allowing regex entries in web form to search database: Risks or gotchas? by dave_the_m (Monsignor) on Aug 08, 2022 at 21:39 UTC
Perl's regex engine has evolved over 30+ years; it's huge and crusty, with large chunks nobody quite understands any more. There are many ways of writing regexes that will consume effectively infinite CPU unless you kill it off. Until recent perl releases, there were many bugs in the regex compiler that would overflow integers and do strange things, e.g. in patterns like /((((foo){2000}){2000}){2000})/. And that's just the bugs we know about. So I wouldn't want to allow the general public the ability to supply arbitrary patterns to a web server. Not all is lost however. Perl allows other regex engines to be plugged in. In particular the module re::engine::RE2 allows perl to use Google's RE2 regex engine. This doesn't support as many features as the perl engine, but in this case that's a plus. Dave.	[reply]
Re^2: Allowing regex entries in web form to search database: Risks or gotchas? by Polyglot (Chaplain) on Aug 09, 2022 at 00:44 UTC
Dave, How much effect would limiting nested parentheses to two and {##} numbers to two digits have on that CPU resource hogging? Would there be an effective way of mitigating against this? This is the sort of helpful tip I'm looking for. It does little good to say ever so meaningfully: "You would be ill-advised to do this...." I'm looking for rational support to such a statement; as in, why is it inadvisable. Once potential pitfalls are identified, only then can one hope to address them. And I do hope to make things safer, albeit, not completely foolproof. I'm reminded of a setting provided to server administrators in shorewall's firewall management tools....something like "ADMIN_IS_ABSENT_MINDED = 1". Hah! It was supposed to keep the current connection open in case of a firewall restart with ill-advised settings that might have inadvertently locked even the admin out! It's simply never possible to make something completely foolproof, and I don't intend to try. But I do want to make it, at the very least, secure from hacker penetration. CPU resources is one thing. Gaining server admin privileges through a security hole is another. Blessings, ~Polyglot~	[reply]
Re: Allowing regex entries in web form to search database: Risks or gotchas? by choroba (Cardinal) on Aug 08, 2022 at 18:27 UTC
A user can still sneak a wildcard in, e.g. `(?:.)` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]`	[reply] [d/l] [select]
Re^2: Allowing regex entries in web form to search database: Risks or gotchas? by Polyglot (Chaplain) on Aug 08, 2022 at 18:42 UTC
Yes, you're right. I'm thinking about whether to shore that up a little more or not. I actually needed a wildcard myself today, and snuck one in in my own way with: `\b.\b` [download] The main reasons to guard against the wildcard privilege is simply to conserve server resources and to protect the client against a browser overload. Because there are multiple search fields that are interrelated (think of table joins), having a wildcard in one may actually be desirable so long as one of the correlated fields is sufficiently limiting. I could also address the issue, I suppose, by simply establishing a quota for max rows returned. I haven't really wanted to do that for several reasons, but, if server loading becomes problematic, that should solve it. Note, too, that those two rules are not the only ones in my list. I just picked them out for examples here. My main concern is server security. I would hate to inadvertently open myself up as a target for phishing websites, DoS attacks, etc. I've had to deal with such things in the past, but never yet from a perl script--and I'd hate to see this change. PHP, now, I haven't much good to say about its security. Blessings, ~Polyglot~*	[reply] [d/l]
Re: Allowing regex entries in web form to search database: Risks or gotchas? by Jenda (Abbot) on Aug 08, 2022 at 19:12 UTC
What's the underlying engine and who evaluates the regexps? The engine? If so you need to ask what's safe/unsafe for that engine, not for Perl. You should not look for dangerous stuff, you should check you only got safe stuff! Jenda 1984 was supposed to be a warning, not a manual!	[reply]
Re^2: Allowing regex entries in web form to search database: Risks or gotchas? by Polyglot (Chaplain) on Aug 09, 2022 at 00:20 UTC
Jenda, I'm not entirely sure what you mean by "underlying engine." My script does the evaluation--I'm not depending on any third-party tools. This has as much to do with the fact that I can rarely understand how to implement others' modules as anything. (Object-oriented code baffles me.) The regex evaluation is fairly simple, and meant to allow virtually any arbitrary expression, with a few important exceptions such as not allowing the user to insert executable code into it. Giving the user freedom to enter his or her own regular expression is what makes the feature so attractive and powerful. There is no other way to properly find certain things without a good regex, and it would be impossible to pre-supply all potential regex forms that might be needed. Users have several simple options at their disposal that do not require the evaluation of a regular expression. For example, they may select for case sensitivity, the matching of whole words (i.e. \bWord\b), or to enter their own word/text delimiters. But these options will be ignored if the user chooses to use his or her own regular expression--in which case the matching of whole words, etc., would be left entirely to the user's own regex. As for "You should not look for dangerous stuff, you should check you only got safe stuff!", how would you propose to divide between these two? What defines "safe"? As with anything on this planet, even the safest of things can be made to be harmful when placed in the wrong hands. Because people could drown in water is no reason to withhold it and cause them to die of thirst! Blessings, ~Polyglot~	[reply]
Re^3: Allowing regex entries in web form to search database: Risks or gotchas? by Jenda (Abbot) on Aug 10, 2022 at 20:32 UTC
You wrote "database" so I assumed there's a database engine, say PostgreSQL, and that's where you store the data. If it were so you could either use the regexps provided by that database engine, use Perl within that engine or fetch all the data to be searched and evaluated the expressions within the script. It's you who defines safe and you need to decide what's safe for each individual use. The point is that instead of `if ($input =~ /something I already know is dangerous/) { die 'I refuse + to handle this!'; }` [download] you should always write `if ($input !~ /^only stuff I know is fine$/) { die 'I refuse to handle + this!'; }` [download] I can't give you a generic "this is unsafe" or a generic "this is safe" not knowing what happens to the $input afterwards. It's something you have to do. The thing is that it's much easier to forget to list something that's dangerous, than it is to accidentally allow something that's dangerous. Jenda 1984 was supposed to be a warning, not a manual!	[reply] [d/l] [select]
Re^4: Allowing regex entries in web form to search database: Risks or gotchas? by LanX (Saint) on Aug 10, 2022 at 22:48 UTC
Re^5: Allowing regex entries in web form to search database: Risks or gotchas? by Anonymous Monk on Aug 11, 2022 at 09:00 UTC
Re^4: Allowing regex entries in web form to search database: Risks or gotchas? by Polyglot (Chaplain) on Aug 11, 2022 at 02:39 UTC
Re: Allowing regex entries in web form to search database: Risks or gotchas? by LanX (Saint) on Aug 08, 2022 at 18:59 UTC
We use Text::Glob as user API Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^2: Allowing regex entries in web form to search database: Risks or gotchas? by Polyglot (Chaplain) on Aug 09, 2022 at 01:41 UTC
A few things jumped out at me in looking at the CPAN page you linked. The regex rules provided are not the same as those of Perl (e.g. swapping ?/. and \|/,); The full ruleset is either not attested or is extremely limited/restricted; The module seems a little old (maybe predates some of the newer regex features); The word "separator" was consistently misspelled. Seeing as the syntactical changes are unnecessary, I would prefer to stick with a pure PCRE. Why use commas in place of pipes and question marks in place of periods when the functionality for these remains exactly the same? And what happened to the zero-or-one match capability provided by the question mark? What replaces that? Back to doing it my own way...TMTOWTDI doesn't mean I will like every other way of doing it. Blessings, ~Polyglot~	[reply]
Re^3: Allowing regex entries in web form to search database: Risks or gotchas? by LanX (Saint) on Aug 09, 2022 at 10:10 UTC
> The word "separator" was consistently misspelled. Schocking! 🧐 Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]


Just another Perl shrine
	PerlMonks