in reply to possible missunderstanding of package Encode

.. which tells me, that this the internal Perl representation of the string in UTF-8 encoding. What I am doing wrong!

Its unclear what you think is gong on

See perlunitut: Unicode in Perl

The default encoding is something like latin-1, its not utf-8, so you start with some latin-1 string, encode it as latin 1 (nothing changes), then you use length, and you're confused :)

See this , only once you "decode" do you have actual perl "unicode string" , until then its "binary" (latin1)

#!/usr/bin/perl -- use strict; use warnings; use Devel::Peek; use Data::Dump; use Encode; our $f = "K\366ln"; sub ff { dd($f); Dump($f); } ff ; $f = encode('iso-8859-1', $f); # bytes encoded as latin1 ff ; $f = encode('UTF-8', $f); # bytes encoded as utf8 ff ; $f = decode('UTF-8', $f); # unicode string ff ; __END__ "K\xF6ln" SV = PVNV(0xb18114) at 0x99b8f4 REFCNT = 1 FLAGS = (POK,pIOK,pNOK,pPOK) IV = 0 NV = 0 PV = 0xada504 "K\366ln"\0 CUR = 4 LEN = 12 "K\xF6ln" SV = PVNV(0xb18114) at 0x99b8f4 REFCNT = 1 FLAGS = (POK,pIOK,pNOK,pPOK) IV = 0 NV = 0 PV = 0xb2f424 "K\366ln"\0 CUR = 4 LEN = 12 "K\xC3\xB6ln" SV = PVNV(0xb18114) at 0x99b8f4 REFCNT = 1 FLAGS = (POK,pIOK,pNOK,pPOK) IV = 0 NV = 0 PV = 0xb2f3dc "K\303\266ln"\0 CUR = 5 LEN = 12 "K\xF6ln" SV = PVMG(0xacf3cc) at 0x99b8f4 REFCNT = 1 FLAGS = (SMG,POK,pIOK,pNOK,pPOK,UTF8) IV = 0 NV = 0 PV = 0xb26374 "K\303\266ln"\0 [UTF8 "K\x{f6}ln"] CUR = 5 LEN = 12 MAGIC = 0xae478c MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 4

Replies are listed 'Best First'.
Re^2: possible missunderstanding of package Encode
by toohoo (Beadle) on Oct 20, 2015 at 10:20 UTC

    Hello,

    I might possibly have not expressed correctly. The first value that should be used is:

    'Köln'

    .. as written there in single quotes. There might or might not be a further assingment to the scalar variable from database or elsewhere. But the first assignment should work as good as the further. When Perl tells me, that the length is 5, then this is in my eyes not correct iso-8859-1 because in this case it should be only 4 characters. This means independent from what I have in this variable at runtime, the encode should transfer it to the ANSI or ASCII representation. And yes, I know that there is a difference beetween these two. But character 'ö' should be only one byte and not 2. I hope I did express more correctly now.

    thanks

    The last version of my test-script so far:

    #!/usr/bin/perl use v5.10; use Encode; use Data::Dumper; my $temp = encode( "iso-8859-1", 'Köln' ); say Dumper "========== encode string =========="; say $temp, "(", length($temp), ")"; my $VUOrt0 = 'Köln'; $temp = encode( "iso-8859-1", $VUOrt0 ); say Dumper "========== encode scalar variable =========="; say $temp, "(", length($temp), ")";

      I might possibly have not expressed correctly. The first value that should be used is: 'Köln'

      ;) Thats the exact value I used, all of the values produced by encode/decode in my program are exactly 'Köln', the latin1 and binary version and the utf8 version, they're all 'Köln'

      When Perl tells me, that the length is 5, then this is in my eyes not correct iso-8859-1 because in this case it should be only 4 characters.... say Dumper "========== encode string ==========";

      Why are you looking at "length" at all?

      You start with unknown bytes (either utf8 or latin1), perl treats it as bytes or latin1, whether its 4 or 5, it doesn't matter, its not a "unicode string" its a binary string or a latin1 string

      Then you encode this string to latin1 explicitly, now its bytes for sure, this time it makes no sense to look at length -- its the length of the bytes, whatever they are, since you don't know what you started with the new length doesn't matter

      Also , if you're going to Dumper anything it should be data, not banners

      I/O flow (the actual 5 minute tutorial)

        Thanks and short answer: I put the string in an XML and send this to the webservice on the other side. The webservice (which is requiering iso-8859-1) tells me that I have delivered 'Köln' instead of 'Köln' and that he is not able to identify this correctly.

        regards