postgres reg expression quoting (again)

zdzieblo has asked for the wisdom of the Perl Monks concerning the following question:

in my previous node

we decided that for the queries like this:

my $sql = "SELECT a FROM table WHERE (a ~* (E\'.*\' || ?))";

quotemeta is enough to stop users sneaking in reg expression into the queries. and it probably is.

unfortunately quotemeta also quotes unicode characters what makes the situation unacceptable in majority cases:

DBD::Pg::st execute failed: ERROR:  invalid byte sequence for encoding
+ "UTF8": 0xd05c
HINT:  This error can also happen if the byte sequence does not match 
+the encoding expected by the server, which is controlled by "client_e
+ncoding". at
[download]

thus i think we need to come up with a better resolution.. ;-)

thanks

Comment on postgres reg expression quoting (again) Select or Download Code

Replies are listed 'Best First'.
Re: postgres reg expression quoting (again) by almut (Canon) on Apr 06, 2009 at 19:45 UTC
Are you sure your strings are text strings when you apply `quotemeta`? use Devel::Peek; use Encode '_utf8_off'; my $u = "abc\x{5555}abc"; my $qu = quotemeta $u; # text (UTF-8) Dump $u; Dump $qu; my $s = $u; _utf8_off($s); my $qs = quotemeta $s; # bytes Dump $s; Dump $qs; __END__ SV = PV(0x63d150) at 0x65eb90 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x6550f0 "abc\345\225\225abc"\0 [UTF8 "abc\x{5555}abc"] CUR = 9 LEN = 16 SV = PV(0x63d120) at 0x65eb30 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x662e60 "abc\345\225\225abc"\0 [UTF8 "abc\x{5555}abc"] CUR = 9 LEN = 16 SV = PVMG(0x697c58) at 0x65eb10 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) IV = 0 NV = 0 PV = 0x6aa700 "abc\345\225\225abc"\0 CUR = 9 LEN = 16 SV = PV(0x63d120) at 0x65eb00 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x66a030 "abc\\\345\\\225\\\225abc"\0 CUR = 12 LEN = 16 [download] As you can see, no quoting is being done except when the string is treated as bytes.	[reply] [d/l] [select]
Re: postgres reg expression quoting (again) by ikegami (Patriarch) on Apr 06, 2009 at 19:50 UTC
Just like you were treating plain text as a regexp pattern, you are treating binary data (encoded text) as text. Decode it first. You can use utf8::decode for UTF-8 or the more general Encode::decode. utf8 Encode Update: For example, `use Data::Dumper; $Data::Dumper::Useqq = 1; $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; my $s = "\x{E2}\x{99}\x{A0}"; # These are assumed to be print(Dumper(quotemeta($s)), "\n"); # iso-latin-1 by Perl utf8::decode($s); # ...until you decode them print(Dumper(quotemeta($s)), "\n");` [download] `"\\\342\\\231\\\240" "\x{2660}"` [download]	[reply] [d/l] [select]
Re^2: postgres reg expression quoting (again) by zdzieblo (Acolyte) on Apr 06, 2009 at 23:45 UTC
right, in that case it looks like the best bet is to use Encode::decode_utf8() over ~~utf8::decode()~~ as it will also put the 0xFFFD in place of the malformed characters eliminating peril of being fed with random data. thanks!	[reply]