Regexp and metacharacters

Largins has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regexp and metacharacters by wrog (Friar) on Dec 27, 2011 at 00:06 UTC
if the point is to quote a string for use within a regexp so that metachars within the string are interpreted literally, you probably want to use \Q and \E, which will automatically take care of everything that is treated special in regexps However, it also looks like you're trying to simultaneously parse a string from some kind of quoted format and then make a regexp out of it.... ... which is going to get massively confusing if, as usually happens in quoted string formats, there are characters that will be starting out backslash-escaped (e.g., in most (but not all! cf. DOS command lines or Visual Basic...) double-quoted string formats, the double-quote (") character itself will be escaped, in which case you don't want to be escaping it again). Best way to preserve your sanity would be to treat the unescaping from double-quoted format and re-escaping for use within a regexp as two separate operations. I.e., strip the outer quotes and unescape everything inside that needs to be unescaped — this gets you to the actual raw string — then worry about getting it into whatever regexp you're using to match with. (Yes, this is slightly less efficient; but get it right first then worry about optimizing...). For the unescaping operation you want something like `die "not quoted?" unless $ptext =~ m/\A(["'])(.*)\1\z/s; $praw = $2; $praw =~ s/\\(.)/$1/sg` [download] (the trailing 's' so that newlines won't be given special treatment, if these are strings that can contain newlines), but again, this depends heavily on which quoted format you're parsing from — if C-style escapes like `\n` or `\t` are allowed, or if you have quote-doubling as in DOS/VB strings, then things get More Interesting. and then once you've got the actual `$praw` you can do `$pregexp = qr/\Q$praw\E/;` [download] plus whatever other junk you want in the regexp.	[reply] [d/l] [select]
Re^2: Regexp and metacharacters by Largins (Acolyte) on Dec 27, 2011 at 01:14 UTC
Hello I have random data coming in from many XML files which are being parsed and stuffed into a hash After an entire document has been read in, the desired items are extracted from the hash by key, and then inserted into a database. Since I don't know what I'm getting beforehand, I want to make sure perl doesn't interpret metacharacters in any way other than as literals. I was using dbh_quote with good results, but got random burps, sometimes after 3000 files, so went to regexp instead of dbh_quote My reasoning, perhaps flawed, for wanting a single (or as few as possible) lines of code was that it would run faster, maybe fewer passes per item Largins	[reply]
Re^3: Regexp and metacharacters by wrog (Friar) on Dec 27, 2011 at 03:11 UTC
I have random data coming in from many XML files which are being parsed and stuffed into a hash This vastly narrows things down, thanks (and also makes clear that quotemeta\|\Q\E is not what you should be doing...) Much still depends on where your quoted strings are really coming from, and, at this point, I see two possibilities (yell if it's not one of these): element body, e.g., `<foo ... >"Hi mom!"</foo>` which seems really unlikely to me, since in this context there's really no need to be quoting the string at all; there are already opening/closing tags (i.e., `<foo>` and `</foo>`) to bound things. But, on the off-chance that this is indeed the case for the XML documents you're getting, you would then need to consult the documentation for the particular XML Schema being used to find out what the actual format is for the text in `<foo>` bodies so that you know what sorts of escapes can possibly occur, because if you don't handle them correctly, you will be corrupting your data. And since XML allows people to do anything they want in element bodies, there isn't any alternative here; you have to find out what the schema expects/allows. attribute values, e.g., `<foo attr="Hi mom!" .../>` In this case, your XML parser should already be providing the strings to you in unquoted form with all escapes/character-entities properly resolved, meaning that if the incoming XML file has, e.g., `<foo attr="Jack & Jill" ... />` the parser should be returning this string to you as `"Jack & Jill"` by which I mean an eleven character string whose first character is `'J'`, whose sixth character is an actual honest-to-god ASCII-38 ampersand, and whose last character is `'l'`, with no double-quotes anywhere in sight. If your XML parser is not doing this for you, then your XML parser sucks and needs to be replaced; there are lots of choices out there. (...and sorry if you were rolling your own — this is fine for learning how XML works; but if you just want things to work and not have to deal with all of the weird, stupid crap that can come up in XML files (do you handle CDATA properly? what about wonky character encodings?), you need something that's been combat-tested...). CPAN has lots of good XML parsers with pretty much every parsing interface you could ever want; download one and save yourself a lot of grief. As for getting stuff into databases, you were right the first time: `$dbh->quote` is the right way to insert an arbitrary string value into an SQL statement — that regexp quote happened to work for you is more a matter of luck that both Perl regexp and (your database's version of) SQL (apparently) use backslashes in the same way (most of the time, except for those cases where they don't, which you won't find out about until stuff breaks...). But actually a better way to do this is to use parameterized queries, if you can. For example, instead of `$dbh->do('INSERT INTO mytab VALUES('. $dbh->quote($value) .', ...);');` [download] do `$dbh->do('INSERT INTO mytab VALUES (?, ...);', {}, $value, ...);` [download] Granted, you'll need to check what format for parameter placeholders your driver will accept ("?" is supposed to be universal, except where it isn't. I believe MSFT uses something else, but I forget...). And if your driver does not grok parameters, you may be able to choose another one that does (e.g., there may be an ODBC-based driver for your database...). While you didn't say what sorts of things your current setup was burping on, and while my current bet would be on the homegrown XML parser screwing up character entities or CDATA stuff, there is also the (perhaps remote, but maybe not) possibility that `$dbh->quote` isn't doing quite the right thing for your database's version of SQL. The point being that parameterized queries leave it up to the database driver to implement the quoting, and since the driver is specific to your database, it's a lot more likely to get the quoting right (i.e., if there are any dangerous corners in your database's version of SQL that DBI.pm doesn't know about). Bottom line here being that if you (1) use a proper XML parser and (2) have a reasonable database driver, you should not be having to do any quote-stripping or escaping/unescaping at all.	[reply] [d/l] [select]
Re: Regexp and metacharacters by AnomalousMonk (Archbishop) on Dec 26, 2011 at 23:15 UTC
Don't know why a single statement is preferable, but this may do the trick: `>perl -wMstrict -le "my $s = q{' * + [] {} '}; print qq{[[$s]]}; ;; $s =~ s{ (?: \A ['\"])? (.?) (?: [\"'] \z)? }{\Q$1}xmsg; print qq{[[$s]]}; " [[' + [] {} ']] [[\ \\ \+\ \[\]\ \{\}\ ]]` [download] Update:* Got rid of /e regex modifier and `quotemeta` call, used \Q string modifier instead; removed some runaway backslashes. Update: `s{ \A ['"]? (.*?) ["']? \z }{\Q$1}xms` is simpler, makes use of /g modifier unnecessary (after jwkrahn).	[reply] [d/l] [select]
Re: Regexp and metacharacters by TJPride (Pilgrim) on Dec 27, 2011 at 00:23 UTC
Well, you could do something like this: `use strict; use warnings; while (<DATA>) { chomp; s/^['"]+\|['"]+$//g; # Remove starting and ending quotes s/(['"{}])/\\$1/g; # Escape appropriate characters print "$_\n"; } __DATA__ "text with quotes" If it ain't broke, don't fix it. We all love {curly brackets}` [download] However, you haven't defined what you mean by "metacharacters" or why you're trying to do this exactly, so I can't really supply a comprehensive solution. The "why" is especially important, since what people are asking for often isn't what's actually needed.	[reply] [d/l]
Re^2: Regexp and metacharacters by Largins (Acolyte) on Dec 27, 2011 at 01:17 UTC
I like the simplicity of this, and will give it a go	[reply]
Re: Regexp and metacharacters by johngg (Canon) on Dec 27, 2011 at 00:26 UTC
Multiple stages might be easier to understand but you could reduce the steps using character classes and quotemeta. The following uses one statement but the map inside the quote construct really means there's two stages still being used. `use strict; use warnings; use 5.010; my $ptext = q{"dasj{ah'h'wjh}wcv'}; say $ptext; $ptext = quotemeta qq{@{ [ map { s{(?x) (?: ^ ["'] \| ["'] $ ) }{}g; $_; } $ptext ] }}; say $ptext;` [download] The output. `"dasj{ah'h'wjh}wcv' dasj\{ah\'h\'w\*jh\}wcv` [download] I hope this is helpful. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Regexp and metacharacters by Largins (Acolyte) on Dec 27, 2011 at 01:21 UTC
I like the looks of this Previous response as well Will try both and report back after attempting to run say 10000 files or so	[reply]
Re: Regexp and metacharacters by ikegami (Patriarch) on Dec 27, 2011 at 01:35 UTC
Turning your code into one statement is such a weird requirement, but it's easy to do `my $fixed = $ptext =~ s/^\"//r =~ s/\"$//r =~ s/^\'//r =~ s/\'$//r =~ s/\'/\\\'/gr =~ s/\"/\\\"/gr =~ s/\{/\\\{/gr;` [download] It's not how I'd do it, though.	[reply] [d/l]
Re^2: Regexp and metacharacters by Largins (Acolyte) on Dec 27, 2011 at 03:32 UTC
Ok, so how would you do it?	[reply]
Re^3: Regexp and metacharacters by ikegami (Patriarch) on Dec 27, 2011 at 06:28 UTC
I would go for readability instead. `my $fixed = $ptext; $fixed =~ s/^['"]//; $fixed =~ s/['"]\z//; $fixed = quotemeta($fixed);` [download] That's probably useless since your list of meta characters differs from quotemeta's. If so, then you can use: `my $fixed = $ptext; $fixed =~ s/^['"]//; $fixed =~ s/['"]\z//; $fixed =~ s/([\\'"...])/\\$1/g;` [download]	[reply] [d/l] [select]
Re: Regexp and metacharacters by jwkrahn (Abbot) on Dec 27, 2011 at 03:22 UTC
`$ptext =~ s/^(['"]?)(.*)\1$/\Q$2/s;` [download]	[reply] [d/l]
Re: Regexp and metacharacters by Anonymous Monk on Dec 27, 2011 at 04:10 UTC
Unless there was some massive justification for doing it all in one statement ... I would not. The odds are simply too great that, in the not so very distant future, you are going to need to change some small thing, and you may rue your cleverness then.	[reply]
Re^2: Regexp and metacharacters by Sewi (Friar) on Dec 27, 2011 at 21:09 UTC
Complex single-line RegExp's are often slower than multiple simple short lines. The CPAN Benchmark module or any kind of own benchmark should be able to tell you the situation for your case. Try out Padre - the free Perl IDE and visit my blog's Perl section.	[reply]