zdzieblo has asked for the wisdom of the Perl Monks concerning the following question:

in my previous node

we decided that for the queries like this:

my $sql = "SELECT a FROM table WHERE (a ~* (E\'.*\' || ?))";

quotemeta is enough to stop users sneaking in reg expression into the queries. and it probably is.

unfortunately quotemeta also quotes unicode characters what makes the situation unacceptable in majority cases:

DBD::Pg::st execute failed: ERROR: invalid byte sequence for encoding + "UTF8": 0xd05c HINT: This error can also happen if the byte sequence does not match +the encoding expected by the server, which is controlled by "client_e +ncoding". at

thus i think we need to come up with a better resolution.. ;-)

thanks

Replies are listed 'Best First'.
Re: postgres reg expression quoting (again)
by almut (Canon) on Apr 06, 2009 at 19:45 UTC

    Are you sure your strings are text strings when you apply quotemeta?

    use Devel::Peek; use Encode '_utf8_off'; my $u = "abc\x{5555}abc"; my $qu = quotemeta $u; # text (UTF-8) Dump $u; Dump $qu; my $s = $u; _utf8_off($s); my $qs = quotemeta $s; # bytes Dump $s; Dump $qs; __END__ SV = PV(0x63d150) at 0x65eb90 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x6550f0 "abc\345\225\225abc"\0 [UTF8 "abc\x{5555}abc"] CUR = 9 LEN = 16 SV = PV(0x63d120) at 0x65eb30 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x662e60 "abc\345\225\225abc"\0 [UTF8 "abc\x{5555}abc"] CUR = 9 LEN = 16 SV = PVMG(0x697c58) at 0x65eb10 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) IV = 0 NV = 0 PV = 0x6aa700 "abc\345\225\225abc"\0 CUR = 9 LEN = 16 SV = PV(0x63d120) at 0x65eb00 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x66a030 "abc\\\345\\\225\\\225abc"\0 CUR = 12 LEN = 16

    As you can see, no quoting is being done except when the string is treated as bytes.

Re: postgres reg expression quoting (again)
by ikegami (Patriarch) on Apr 06, 2009 at 19:50 UTC
    Just like you were treating plain text as a regexp pattern, you are treating binary data (encoded text) as text. Decode it first. You can use utf8::decode for UTF-8 or the more general Encode::decode.

    utf8
    Encode

    Update: For example,

    use Data::Dumper; $Data::Dumper::Useqq = 1; $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; my $s = "\x{E2}\x{99}\x{A0}"; # These are assumed to be print(Dumper(quotemeta($s)), "\n"); # iso-latin-1 by Perl utf8::decode($s); # ...until you decode them print(Dumper(quotemeta($s)), "\n");
    "\\\342\\\231\\\240" "\x{2660}"

      right, in that case it looks like the best bet is to use Encode::decode_utf8() over utf8::decode() as it will also put the 0xFFFD in place of the malformed characters eliminating peril of being fed with random data.

      thanks!