in reply to Re^3: Bug in Template?
in thread Bug in Template?

I'm confused by your code, what is it supposed to demonstrate? perlunitut: Unicode in Perl warns against using is_utf8, so I wouldn't use it

Consider

$ perl -le " print chr hex q/C0/ " | od -tx1 0000000 c0 0d 0a 0000003
when viewed as Windows-1252 it is À

And this

$ perl -le " binmode STDOUT , q/:utf8/; print chr hex q/C0/ " | od -tx +1 0000000 c3 80 0d 0a 0000004
when viewed as Windows-1252 it is À but viewed as UTF-8 it is 
And this

$ perl -MEncode -le " print decode(q/utf8/, chr hex q/C0/ )" | od -tx1 Wide character in print at -e line 1. 0000000 ef bf bd 0d 0a 0000005
when viewed as Windows-1252 it is � but viewed as UTF-8 it is �

If you search for ef bf bd you'll see lots of questions about this erroneous conversion

So if you want to treat chr 192 (  perl -le " print  hex q/C0/ " ) as unicode you have to encode it, because characters 0 to 255 are also valid Latin-1, they are not utf8

$ perl -le " print chr hex q/C0/ " |od -tx1 0000000 c0 0d 0a 0000003 $ perl -le " print chr 255 " |od -tx1 0000000 ff 0d 0a 0000003 $ perl -le " print chr 256 " |od -tx1 Wide character in print at -e line 1. 0000000 c4 80 0d 0a 0000004

Or, if you want chr 192 to return unicode, use encoding pragma ( utf8 pragma doesn't affect chr )

$ perl -le " use encoding q/utf8/; print chr 192 " |od -tx1 0000000 c3 80 0a 0000003

Replies are listed 'Best First'.
Re^5: Bug in Template?
by Anonymous Monk on Mar 22, 2012 at 08:33 UTC
Re^5: Bug in Template?
by remiah (Hermit) on Mar 22, 2012 at 10:58 UTC

    Thanks for reply. I will read perlunitut and found sites that explains unicode in perl precisely when googled with "ef bf bd". I am printing now...

    When the characer comes from outside of perl, We have to decode the bytes to perl's internal utf8, as perlunitut says. Especially when you want to know the length of characer. For example, cgi's param() will return bytes and when I want to know the length of the word, I decode it.

    My question in short, here comes two character '00E9' and '3041'. They must be two character in utf8. How do you substring the second character and print it?

    I agree my example clumsy. Is this clear? I guess this is OP's problem.