perl has to assume a character encoding.
Not at all

Of course it has to. My example used a binary operation (IO), and then a text operation. Since the text operation implies character context, the byte needs to be interpreted in some way. And this interpretation happens to Latin-1.

"\x{E4}" =~ /\w/

A string literal is not the same as IO; my explanation only applies to my example, not yours.

In your example, the string is generate from inside perl, and can thus be treated transparently to any encoding. When the string is coming from the outside, it is transported as a stream of bytes (because STDIN is byte stream on UNIX platforms), and when Perl treats it as a text string, some interpretation has to happen.

To come back to my previous example, executed in bash:

# | the UNIX pipe transports bytes, not # | codepoints. So Perl sees the byte E4 $ echo -e "\xE4"|perl -wE 'say <> ~~ /\w/' # ^^^^^^^ a text operation # sees the codepoint U+00E4

So, at one point we have a byte, and later a codepoint. The mapping from bytes code codepoints is what an encoding does, so Perl needs to use one, and it uses ISO-8859-1. Implicitly, because I never said decode('ISO-8859-1', ...)

So I cannot see why you insist that Perl never implicitly uses ISO-8859-1, when I've provided an example that demonstrates just that.

Or what do you think it is, if not ISO-8859-1?
A Unicode code point, regardless of the state of the UTF8 flag.

But it was a byte at level of the UNIX pipe. Now it is a code point. What mechanism changed it from a byte to a codepoint, if not (implicit) decoding as ISO-8859-1?

Since ISO-8859-1 provides a trivial mapping between the first 255 bytes and code points, it's really more of an interpretation than an actual decoding step, but it's there nonetheless.


In reply to Re^5: How does the built-in function length work? by moritz
in thread How does the built-in function length work? by PerlOnTheWay

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.