Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Why does perl's internal utf8 seem to allow single-byte latin1?

by ikegami (Patriarch)
on Mar 24, 2010 at 03:30 UTC ( [id://830440]=note: print w/replies, xml ) Need Help??


in reply to Why does perl's internal utf8 seem to allow single-byte latin1?

What I expected from the script was the two byte sequence in all cases.

Your expectations are wrong for ustring1. There's nothing that caused it to be changed to the less efficient storage format.

utf8::is_utf8 pointed this out, and pointed out your expectations were accurate for ustring2 and ustring3.

print_chrcode doesn't look at the internal format. It looks at the content of the string. That's why it didn't tell you anything.

( The previous paragraph is wrong if you happen to use the buggy version of Perl the OP is using. I didn't notice the OP had included the output of this program. With 5.10, you get

Ein <#d6>konomisches Modell Ein <#d6>konomisches Modell 1 Ein <#d6>konomisches Modell 1 Ein <#c3><#96>konomisches Modell Ein <#c3><#96>konomisches Modell Ein <#c3><#96>konomisches Modell
)

How can I force the internal perl representation to be two-byte utf-8

utf8::upgrade and utf8::downgrade are used to switch between the two internal formats.

use Devel::Peek qw( Dump ); my $s1 = "Ein Ökonomisches Modell"; my $s2 = "Ein \326konomisches Modell"; Dump($s1); Dump($s2); utf8::upgrade( $s1 ); utf8::upgrade( $s2 ); Dump($s1); Dump($s2);
SV = PV(0x2369cc) at 0x182a354 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x23fcc4 "Ein \326konomisches Modell"\0 CUR = 23 LEN = 24 SV = PV(0x2369dc) at 0x182a384 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x23fd9c "Ein \326konomisches Modell"\0 CUR = 23 LEN = 24 SV = PV(0x2369cc) at 0x182a354 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x182430c "Ein \303\226konomisches Modell"\0 [UTF8 "Ein \x{d6}k +onomisches Modell"] CUR = 24 LEN = 25 SV = PV(0x2369dc) at 0x182a384 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1832744 "Ein \303\226konomisches Modell"\0 [UTF8 "Ein \x{d6}k +onomisches Modell"] CUR = 24 LEN = 25

All that being said, I have no idea what you are trying to accomplish. Sounds very very wrong.

Replies are listed 'Best First'.
Re^2: Why does perl's internal utf8 seem to allow single-byte latin1?
by brycen (Monk) on Mar 24, 2010 at 04:36 UTC
    The given output was from perl v5.10.0, from Debian stable. I see the flaw in print_chrcode, and have added 'use bytes' to get it to display the true internal format (matching Devel::Peek). Using "export PERL_UNICODE=SAD" or "export PERL_UNICODE=IE" switches the behavior of the script.
      oh, your source file is encoded using UTF-8 despite the no utf8;.

      Using "export PERL_UNICODE=SAD" or "export PERL_UNICODE=IE" switches the behavior of the script.

      Not they don't. You don't use @ARGV, you don't use STD* for anything but 7-bit chars, and you don't open any file handles. They have no effect whatsoever.

      Again, what are you trying to do? Whatever it is, you seem to be taking the worst possible approach.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://830440]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-04-24 18:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found