I suppose the question is: why can't I use decode_utf8() on output from XML::LibXML?

Because https://metacpan.org/pod/XML::LibXML#ENCODINGS-SUPPORT-IN-XML::LIBXML already gives you "characters" instead of bytes/octets/raw

... yields ...

See replies by McA and farang , thanks monks :)

Also, terminals lie, even browsers lie, the bytes do not lie ... see comments in this code (and the code, and keep in mind the 5min tutorial)

#!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd /; use Encode qw/ encode decode /; use Path::Tiny qw/ path /; my $tmpfile = path( 'deleteme.txt' ); ## CHARACTERS HERE #~ ordinal= ord( chr( 115 ) ) alias \N{U+0073} alias \163 alias LATIN +SMALL LETTER S alias s #~ ordinal= ord( chr( 195 ) ) alias \N{U+00C3} alias \303 alias LATIN +CAPITAL LETTER A TILDE alias à #~ ordinal= ord( chr( 188 ) ) alias \N{U+00BC} alias \274 alias FRACTI +ON ONE QUARTER alias ¼ #~ ordinal= ord( chr( 195 ) ) alias \N{U+00C3} alias \303 alias LATIN +CAPITAL LETTER A TILDE alias à #~ ordinal= ord( chr( 159 ) ) alias \N{U+009F} alias \237 alias APPLIC +ATION PROGRAM COMMAND alias Ÿ #~ ordinal= ord( chr( 101 ) ) alias \N{U+0065} alias \145 alias LATIN +SMALL LETTER E alias e my $ords = join q{}, map { chr $_ } ( 115, 195, 188, 195, 159, 101 ); $tmpfile->spew_raw( $ords ); dd( { ords => $ords, raw => $tmpfile->slurp_raw, utf8 => $tmpfile->slurp_utf8 } ); #~ { #~ ords => "s\xC3\xBC\xC3\x9Fe", #~ raw => "s\xC3\xBC\xC3\x9Fe", #~ utf8 => "s\xFC\xDFe", #~ } ## when you write raw without encoding ## when read that stuff as utf8, you get a surprise #~ ordinal= ord( chr( 223 ) ) alias \N{U+00DF} alias \337 alias LATIN +SMALL LETTER SHARP S alias ß #~ ordinal= ord( chr( 252 ) ) alias \N{U+00FC} alias \374 alias LATIN +SMALL LETTER U DIAERESIS alias ü ## >>>> OUTPUT encoded, the raw bytes change $tmpfile->spew_utf8( $ords ); dd( { ords => $ords, raw => $tmpfile->slurp_raw, utf8 => $tmpfile->slurp_utf8 } ); #~ { #~ ords => "s\xC3\xBC\xC3\x9Fe", #~ raw => "s\xC3\x83\xC2\xBC\xC3\x83\xC2\x9Fe", #~ utf8 => "s\xC3\xBC\xC3\x9Fe", #~ } ## utf8 is an encoding, representing characters (ordinals) $tmpfile->spew_raw( encode 'UTF-8', $ords ); dd( { ords => $ords, raw => $tmpfile->slurp_raw, utf8 => $tmpfile->slurp_utf8 } ); #~ { #~ ords => "s\xC3\xBC\xC3\x9Fe", #~ raw => "s\xC3\x83\xC2\xBC\xC3\x83\xC2\x9Fe", #~ utf8 => "s\xC3\xBC\xC3\x9Fe", #~ } ## decode raw bytes to get characters ## encode characters to get raw bytes/octets dd( { ords => $ords, decode_utf8_raw => decode( 'UTF-8', $tmpfile->slurp_raw ), utf8 => $tmpfile->slurp_utf8, } ); #~ { #~ decode_utf8_raw => "s\xC3\xBC\xC3\x9Fe", #~ ords => "s\xC3\xBC\xC3\x9Fe", #~ utf8 => "s\xC3\xBC\xC3\x9Fe", #~ } ## hooray $tmpfile->remove; __END__

_Is_ any of this covered in perlunitut, and if so, under what section?

You can start with I/O flow (the actual 5 minute tutorial)

You should read the whole thing and the links it links

Also download tarball from Perl Unicode Essentials: OSCON 2011 - O'Reilly Conferences, July 25 - 29, 2011, Portland, OR for even more unicode info


In reply to Re^3: Encode double encoding? by Anonymous Monk
in thread Encode double encoding? by DrkShadow

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.