Re: Why does perl's internal utf8 seem to allow single-byte latin1?

What I expected from the script was the two byte sequence in all cases.

Your expectations are wrong for ustring1. There's nothing that caused it to be changed to the less efficient storage format.

utf8::is_utf8 pointed this out, and pointed out your expectations were accurate for ustring2 and ustring3.

print_chrcode doesn't look at the internal format. It looks at the content of the string. That's why it didn't tell you anything.

( The previous paragraph is wrong if you happen to use the buggy version of Perl the OP is using. I didn't notice the OP had included the output of this program. With 5.10, you get

Ein <#d6>konomisches Modell
Ein <#d6>konomisches Modell 1
Ein <#d6>konomisches Modell 1

Ein <#c3><#96>konomisches Modell
Ein <#c3><#96>konomisches Modell
Ein <#c3><#96>konomisches Modell
[download]

)

How can I force the internal perl representation to be two-byte utf-8

utf8::upgrade and utf8::downgrade are used to switch between the two internal formats.

use Devel::Peek qw( Dump );

my $s1 = "Ein Ökonomisches Modell";
my $s2 = "Ein \326konomisches Modell";

Dump($s1);
Dump($s2);

utf8::upgrade( $s1 );
utf8::upgrade( $s2 );

Dump($s1);
Dump($s2);
[download]

SV = PV(0x2369cc) at 0x182a354
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x23fcc4 "Ein \326konomisches Modell"\0
  CUR = 23
  LEN = 24
SV = PV(0x2369dc) at 0x182a384
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x23fd9c "Ein \326konomisches Modell"\0
  CUR = 23
  LEN = 24
SV = PV(0x2369cc) at 0x182a354
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x182430c "Ein \303\226konomisches Modell"\0 [UTF8 "Ein \x{d6}k
+onomisches Modell"]
  CUR = 24
  LEN = 25
SV = PV(0x2369dc) at 0x182a384
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1832744 "Ein \303\226konomisches Modell"\0 [UTF8 "Ein \x{d6}k
+onomisches Modell"]
  CUR = 24
  LEN = 25
[download]

All that being said, I have no idea what you are trying to accomplish. Sounds very very wrong.

Comment on Re: Why does perl's internal utf8 seem to allow single-byte latin1? Select or Download Code

Replies are listed 'Best First'.
Re^2: Why does perl's internal utf8 seem to allow single-byte latin1? by brycen (Monk) on Mar 24, 2010 at 04:36 UTC
The given output was from perl v5.10.0, from Debian stable. I see the flaw in print_chrcode, and have added 'use bytes' to get it to display the true internal format (matching Devel::Peek). Using "export PERL_UNICODE=SAD" or "export PERL_UNICODE=IE" switches the behavior of the script.	[reply]
Re^3: Why does perl's internal utf8 seem to allow single-byte latin1? by ikegami (Patriarch) on Mar 24, 2010 at 05:08 UTC
oh, your source file is encoded using UTF-8 despite the `no utf8;`.	[reply] [d/l]
Re^3: Why does perl's internal utf8 seem to allow single-byte latin1? by ikegami (Patriarch) on Mar 24, 2010 at 05:15 UTC
Using "export PERL_UNICODE=SAD" or "export PERL_UNICODE=IE" switches the behavior of the script. Not they don't. You don't use @ARGV, you don't use STD* for anything but 7-bit chars, and you don't open any file handles. They have no effect whatsoever. Again, what are you trying to do? Whatever it is, you seem to be taking the worst possible approach.	[reply]