Re: utf8, locale and regexp

Replies are listed 'Best First'.
Re^2: utf8, locale and regexp by Anonymous Monk on Apr 10, 2007 at 15:12 UTC
'É' in my script are utf8 encoded.	[reply]
Re^3: utf8, locale and regexp by ikegami (Patriarch) on Apr 10, 2007 at 15:32 UTC
There's indeed a problem. #!/usr/bin/perl use strict; use warnings; use encoding 'UTF-8'; #use encoding 'utf8'; #use utf8; use Devel::Peek qw( Dump ); my $word = "Ă‰tat"; # UTF-8 encoding of "État" my $char = "Ă‰"; # UTF-8 encoding of "É" Dump($word); print("String Length: ", length($word), "\n"); print("\n"); Dump($char); print("String Length: ", length($char), "\n"); print("\n"); if ($word =~ /$char/) { print "Matches\n"; } else { print "Does not match\n"; } if ($word =~ /\Q$char/) { print "Matches\n"; } else { print "Does not match\n"; } if (substr($word, 0, 1) eq $char) { print "Equal\n"; } else { print "Not equal\n"; } [download] `SV = PV(0x22608c) at 0x225f9c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) <-- That's good PV = 0x1822634 "\303\211tat"\0 [UTF8 "\x{c9}tat"] <-- That's good CUR = 5 LEN = 8 String Length: 4 <-- That's good SV = PV(0x2260a4) at 0x225f3c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) <-- That's good PV = 0x22f43c "\303\211"\0 [UTF8 "\x{c9}"] <-- That's good CUR = 2 LEN = 4 String Length: 1 <-- That's good Does not match <-- WTF? Does not match <-- WTF? Equal <-- That's good` [download] Replacing `use encoding 'UTF-8';` with `use encoding 'utf8';` yields the same results. Replacing `use encoding 'UTF-8';` with `use utf8;` produces the same dumps, but the matches succeed. My suggestion: Use `use utf8;` to treat the source as UTF-8. Use `binmode(STDOUT, ":utf8");` to output UTF-8.	[reply] [d/l] [select]
Re^4: utf8, locale and regexp by Krambambuli (Curate) on Apr 11, 2007 at 08:41 UTC
Hi Ikegami, your code seems to rather clearly bring to light a bug with the 'encoding' pragma. Did you maybe try to submit it to the author of the 'encoding.pm' module ? Seems to be Dan Kogai, an e-mail (don't know if valid) is available on CPAN.	[reply]
Re^4: utf8, locale and regexp by Anonymous Monk on Apr 10, 2007 at 16:47 UTC
Well... the thing is that I must keep `use encoding 'UTF-8'` with CGI and CGI:FastTemplate in order to make my utf8 templates properly managed... additionnaly I need `use locale;` in order to use my locale collation. For the utf8 templates I tried unsuccessfully the following:`use open ":utf8"; binmode(STDIN,":utf8"); binmode(STDOUT, ":utf8");` I suspect that the "automatic" string upgrade when concatenation occurs which is enabled with `use encoding 'utf8'` is mandatory for utf8 templates to work (haven't checked its code though).	[reply] [d/l] [select]
Re^5: utf8, locale and regexp by Joost (Canon) on Apr 10, 2007 at 19:48 UTC
Re^5: utf8, locale and regexp by ikegami (Patriarch) on Apr 10, 2007 at 17:08 UTC