Re^3: utf8, locale and regexp

There's indeed a problem.

#!/usr/bin/perl

use strict;
use warnings;

use encoding 'UTF-8';
#use encoding 'utf8';
#use utf8;

use Devel::Peek qw( Dump );

my $word = "État";  # UTF-8 encoding of "&Eacute;tat"
my $char = "É";     # UTF-8 encoding of "&Eacute;"

Dump($word);
print("String Length: ", length($word), "\n");
print("\n");

Dump($char);
print("String Length: ", length($char), "\n");
print("\n");

if ($word =~ /$char/) {
   print "Matches\n";
} else {
   print "Does not match\n";
}

if ($word =~ /\Q$char/) {
   print "Matches\n";
} else {
   print "Does not match\n";
}

if (substr($word, 0, 1) eq $char) {
   print "Equal\n";
} else {
   print "Not equal\n";
}
[download]

SV = PV(0x22608c) at 0x225f9c
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)               <-- That's good
  PV = 0x1822634 "\303\211tat"\0 [UTF8 "\x{c9}tat"]   <-- That's good
  CUR = 5
  LEN = 8
String Length: 4                                      <-- That's good

SV = PV(0x2260a4) at 0x225f3c
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)               <-- That's good
  PV = 0x22f43c "\303\211"\0 [UTF8 "\x{c9}"]          <-- That's good
  CUR = 2
  LEN = 4
String Length: 1                                      <-- That's good

Does not match                                        <-- WTF?
Does not match                                        <-- WTF?
Equal                                                 <-- That's good
[download]

Replacing use encoding 'UTF-8'; with use encoding 'utf8'; yields the same results.

Replacing use encoding 'UTF-8'; with use utf8; produces the same dumps, but the matches succeed.

My suggestion:

Use use utf8; to treat the source as UTF-8.
Use binmode(STDOUT, ":utf8"); to output UTF-8.

Comment on Re^3: utf8, locale and regexp Select or Download Code

Replies are listed 'Best First'.
Re^4: utf8, locale and regexp by Krambambuli (Curate) on Apr 11, 2007 at 08:41 UTC
Hi Ikegami, your code seems to rather clearly bring to light a bug with the 'encoding' pragma. Did you maybe try to submit it to the author of the 'encoding.pm' module ? Seems to be Dan Kogai, an e-mail (don't know if valid) is available on CPAN.	[reply]
Re^4: utf8, locale and regexp by Anonymous Monk on Apr 10, 2007 at 16:47 UTC
Well... the thing is that I must keep `use encoding 'UTF-8'` with CGI and CGI:FastTemplate in order to make my utf8 templates properly managed... additionnaly I need `use locale;` in order to use my locale collation. For the utf8 templates I tried unsuccessfully the following:`use open ":utf8"; binmode(STDIN,":utf8"); binmode(STDOUT, ":utf8");` I suspect that the "automatic" string upgrade when concatenation occurs which is enabled with `use encoding 'utf8'` is mandatory for utf8 templates to work (haven't checked its code though).	[reply] [d/l] [select]
Re^5: utf8, locale and regexp by Joost (Canon) on Apr 10, 2007 at 19:48 UTC
The only reason you need to use that encoding is because CGI::FastTemplate doesn't do what it should do with the UTF-8 encoded templates. If you want to use multi-byte encoded text files - and this includes templates - the files should be using the correct IO layer, which you can set via binmode. The only reason use utf8 / use encoding "utf8" fixes the problem is because eval STRING will then assume any string passed to it is utf-8 encoded even if the string that's eval()d isn't marked as utf-8. That is arguably not even correct behaviour - I would definitely argue it's a bug. #!/usr/bin/perl -w use strict; use utf8; my $str1 = <DATA>; # not utf8 my $str2 = eval "'".<DATA>."'"; # utf8 no utf8; my $str3 = <DATA>; # not utf8 my $str4 = eval "'".<DATA>."'"; # not utf8 binmode(DATA,":utf8"); # THIS is what you should do. my $str5 = <DATA>; # utf8 print "str1 is",utf8::is_utf8($str1) ? "" :" not"," utf8\n"; print "str2 is",utf8::is_utf8($str2) ? "" :" not"," utf8\n"; print "str3 is",utf8::is_utf8($str3) ? "" :" not"," utf8\n"; print "str4 is",utf8::is_utf8($str4) ? "" :" not"," utf8\n"; print "str5 is",utf8::is_utf8($str5) ? "" :" not"," utf8\n"; __DATA__ État État État État [download] update: all the above is interesting, but not correct since the CGI::FastTemplate documentation claims it doesn't use eval(). updat2: I still think there's NEVER any reason to use both utf8 and encoding "utf8" at the same time. They are more or less equivalent anyway. Please try using either one or the other and see if that fixes your problem. Using both does cause problems, as you can see. Also, if at all possible, the best solution would be to patch CGI::FastTemplate to open the template files with the correct IO layer instead of relying on this ugly hack - (maybe you can open the template files yourself, set the binmode and pass the filehandle/filecontent to CGI::FastTemplate yourself? that might also fix the issue) "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re^5: utf8, locale and regexp by ikegami (Patriarch) on Apr 10, 2007 at 17:08 UTC
I tried unsuccessfully the following:use open ":utf8"; binmode(STDIN,":utf8"); binmode(STDOUT, ":utf8"); Well, that's another problem. Maybe you should start a new thread instead of leaving it buried this deep in this thread? I suspect that the "automatic" string upgrade when concatenation occurs which is enabled with use encoding 'utf8' is mandatory for utf8 templates to work (haven't checked its code though). "which is enabled with use encoding 'utf8'" is untrue. `use encoding` has nothing to do with it. `use Encode qw( is_utf8 ); my $s1 = "abc"; my $s2 = chr(0x2660); my $s = $s1 . $s2; print("s1: ", is_utf8($s1)?1:0, "\n"); # 0 print("s2: ", is_utf8($s2)?1:0, "\n"); # 1 print("s: ", is_utf8($s )?1:0, "\n"); # 1` [download] additionnaly I need use locale; in order to use my locale collation. I only removed it because it was irrelevant to the question I was addressing. Feel free to re-add it.	[reply] [d/l] [select]