Re^2: Bug in Template?

Replies are listed 'Best First'.
Re^3: Bug in Template? by remiah (Hermit) on Mar 22, 2012 at 03:54 UTC
This seems not a problem of Template. I also want advice for this. “Séan”'s é may be 00E9 of unicode table http://www.utf8-chartable.de/unicode-utf8-table.pl. I thought decode it to perl internal utf8 and pass them to Template encoding it utf8 will work. But it is not work. Without Template, there is strange behavior. #!/usr/bin/perl use strict; use warnings; use Encode qw(is_utf8 encode decode); use Template; my(@raw, @decoded_internal_utf8,@encoded_raw_utf8,@encoded_internal_ut +f8); my @chars=hex('00C0') .. hex('00F0'); #target characters #my @chars=hex('3041') .. hex('3096'); #hiragana foreach my $code ( @chars ){ my($raw, $chr); $raw =chr($code); if ( is_utf8($raw) ){ $chr=$raw; } else { $chr=decode('utf8',$raw); } push @raw, $raw; push @decoded_internal_utf8, $chr; push @encoded_raw_utf8 , encode('utf8', $raw); push @encoded_internal_utf8, encode('utf8', $chr); } print "======================\n"; print "perl=$^X : version=$]\n"; print "1.###raw\n"; print "#$_#\n" for @raw; print "2.###decoded_intenal_utf8\n"; #print "#$_#\n" for @decoded_internal_utf8; print "3.###encoded_raw_utf8\n"; print "#$_#\n" for @encoded_raw_utf8; print "4.###encoded_internal_utf8\n"; print "#$_#\n" for @encoded_internal_utf8; [download] It is strange No3 only works at this case. I usualy print characters with No 4. Japanese characters like "hiragana" seems to have no problem( for example,'3041' .. '3096'). I saw similar problem at Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?. At that time, I didn't understand well and thought newer version would have no problem... Is this the same trouble? I tried with 5.012002 and 5.014002. They print exact same output except version number.	[reply] [d/l]
Re^4: Bug in Template? by Anonymous Monk on Mar 22, 2012 at 08:27 UTC
I'm confused by your code, what is it supposed to demonstrate? perlunitut: Unicode in Perl warns against using is_utf8, so I wouldn't use it Consider `$ perl -le " print chr hex q/C0/ " \| od -tx1 0000000 c0 0d 0a 0000003` [download] when viewed as Windows-1252 it is À And this `$ perl -le " binmode STDOUT , q/:utf8/; print chr hex q/C0/ " \| od -tx +1 0000000 c3 80 0d 0a 0000004` [download] when viewed as Windows-1252 it is Ã€ but viewed as UTF-8 it is À And this `$ perl -MEncode -le " print decode(q/utf8/, chr hex q/C0/ )" \| od -tx1 Wide character in print at -e line 1. 0000000 ef bf bd 0d 0a 0000005` [download] when viewed as Windows-1252 it is ï¿½ but viewed as UTF-8 it is � If you search for ef bf bd you'll see lots of questions about this erroneous conversion So if you want to treat chr 192 ( `perl -le " print hex q/C0/ "` ) as unicode you have to encode it, because characters 0 to 255 are also valid Latin-1, they are not utf8 `$ perl -le " print chr hex q/C0/ " \|od -tx1 0000000 c0 0d 0a 0000003 $ perl -le " print chr 255 " \|od -tx1 0000000 ff 0d 0a 0000003 $ perl -le " print chr 256 " \|od -tx1 Wide character in print at -e line 1. 0000000 c4 80 0d 0a 0000004` [download] Or, if you want chr 192 to return unicode, use encoding pragma ( utf8 pragma doesn't affect chr ) `$ perl -le " use encoding q/utf8/; print chr 192 " \|od -tx1 0000000 c3 80 0a 0000003` [download]	[reply] [d/l] [select]
Re^5: Bug in Template? by Anonymous Monk on Mar 22, 2012 at 08:33 UTC
You could also use utf8::encode `$ perl -le " $f = chr hex q/C0/; utf8::encode( $f ); print $f " \|od -t +x1 0000000 c3 80 0d 0a 0000004` [download] Like http://www.utf8-chartable.de/unicode-utf8-table.pl?start=192 shows, unicode code point U+00C0 encoded as UTF-8 is c3 80	[reply] [d/l]
Re^5: Bug in Template? by remiah (Hermit) on Mar 22, 2012 at 10:58 UTC
Thanks for reply. I will read perlunitut and found sites that explains unicode in perl precisely when googled with "ef bf bd". I am printing now... When the characer comes from outside of perl, We have to decode the bytes to perl's internal utf8, as perlunitut says. Especially when you want to know the length of characer. For example, cgi's param() will return bytes and when I want to know the length of the word, I decode it. My question in short, here comes two character '00E9' and '3041'. They must be two character in utf8. How do you substring the second character and print it? I agree my example clumsy. Is this clear? I guess this is OP's problem.	[reply]