in reply to Re^2: How does the built-in function length work?
in thread How does the built-in function length work?

(Formerly: Expected Unicode code points or ASCII depending on UTF8 flag.)

I guess that what you call "Unicode code point" is what I call "ISO-8859-1". ISO-8859-1 is simply the encoding that maps the byte values from 0 to 255 to the Unicode codepoints from 0 to 255, in that order.

Perl never assumes or expects iso-8859-1.
$ echo -e "\xE4"|perl -wE 'say <> ~~ /\w/' 1 $ # this a perl 5.14.1

Since no decoding step happened here, and <> is a binary operation, and the regex match a text operation, perl has to assume a character encoding. And that happens to be ISO-8859-1. Or what do you think it is, if not ISO-8859-1?

Replies are listed 'Best First'.
Re^4: How does the built-in function length work?
by JavaFan (Canon) on Dec 02, 2011 at 21:27 UTC
    Or what do you think it is, if not ISO-8859-1?
    EBCDIC? Binary? ISO-8859-15?
Re^4: How does the built-in function length work?
by ikegami (Patriarch) on Dec 02, 2011 at 21:48 UTC

    perl has to assume a character encoding.

    Not at all. If it must assume an encoding, and that encoding is iso-8859-1 for

    "\x{E4}" =~ /\w/

    then what encoding is assumed for the following?

    "\x{2660}" =~ /\w/

    It never deals with any encoding. It always deals with string elements (characters). And those string elements (characters) are assumedrequired to be Unicode code points.

    • Character E4 is taken as Unicode code point E4, not some byte produced by iso-8859-1.
    • Character 2660 is taken as Unicode code point 2660, not some byte produced by iso-8859-1.

    It's entirely up to you to create a string with the right elements, which may or may not involve character encodings.

    Or what do you think it is, if not ISO-8859-1?

    A Unicode code point, regardless of the state of the UTF8 flag.

    • Character E4 (UTF8=0) is taken as Unicode code point E4, not some byte produced by iso-8859-1.
    • Character E4 (UTF8=1) is taken as Unicode code point E4, not some byte produced by iso-8859-1.

    In short, you're over complicating things. It's NOT:

    Each character is expected to be an iso-8859-1 byte if UTF8=0 or a Unicode code point if UTF8=1.

    It's simply:

    Each character is expected to be a Unicode code point.

      perl has to assume a character encoding.
      Not at all

      Of course it has to. My example used a binary operation (IO), and then a text operation. Since the text operation implies character context, the byte needs to be interpreted in some way. And this interpretation happens to Latin-1.

      "\x{E4}" =~ /\w/

      A string literal is not the same as IO; my explanation only applies to my example, not yours.

      In your example, the string is generate from inside perl, and can thus be treated transparently to any encoding. When the string is coming from the outside, it is transported as a stream of bytes (because STDIN is byte stream on UNIX platforms), and when Perl treats it as a text string, some interpretation has to happen.

      To come back to my previous example, executed in bash:

      # | the UNIX pipe transports bytes, not # | codepoints. So Perl sees the byte E4 $ echo -e "\xE4"|perl -wE 'say <> ~~ /\w/' # ^^^^^^^ a text operation # sees the codepoint U+00E4

      So, at one point we have a byte, and later a codepoint. The mapping from bytes code codepoints is what an encoding does, so Perl needs to use one, and it uses ISO-8859-1. Implicitly, because I never said decode('ISO-8859-1', ...)

      So I cannot see why you insist that Perl never implicitly uses ISO-8859-1, when I've provided an example that demonstrates just that.

      Or what do you think it is, if not ISO-8859-1?
      A Unicode code point, regardless of the state of the UTF8 flag.

      But it was a byte at level of the UNIX pipe. Now it is a code point. What mechanism changed it from a byte to a codepoint, if not (implicit) decoding as ISO-8859-1?

      Since ISO-8859-1 provides a trivial mapping between the first 255 bytes and code points, it's really more of an interpretation than an actual decoding step, but it's there nonetheless.

        Since the text operation implies character context, the byte needs to be interpreted in some way.

        Yes, as a Unicode code point.

        A string literal is not the same as IO; my explanation only applies to my example, not yours.

        Both readline and the string literal create the same string, so that only makes sense if you say that readline is the one that does the iso-8859-1 decoding. Is that what you're saying?

        (I hope not, cause it's preposterous to say that copying bytes from disk to memory is a decoding operation. In binmode no less!)

        But it was a byte at level of the UNIX pipe. Now it is a code point. What mechanism changed it from a byte to a codepoint, if not (implicit) decoding as ISO-8859-1?

        None. There's no now and then; it's always a code point, and it was always stored in a byte.

        The mapping from bytes code codepoints is what an encoding does

        I don't call the following iso-8859-1 decoding:

        UV codepoint = s[i];