in reply to regex for utf-8

The /e modifier on your substitution says that the substitute string is to be evaluated as a perl expression. The '<<' operator is 'shift left', corresponding to multiplication by a power of two. The '|' operator is 'bitwise or', not alternation. See perlop for details.

Update: The & operator is 'bitwise and'.

After Compline,
Zaxo

Replies are listed 'Best First'.
Re: Re: regex for utf-8
by jjohhn (Scribe) on Feb 27, 2003 at 23:30 UTC
    Hmm. Thank you. Somehow these operations are converting an 8-bit character representaion to a multibyte (UTF-8) representation, for those values that are greater than 7F, and leaving the values alone when they fall within ASCII range. Probably it converts to Latin-1 since I think you would need a lookup table to convert to one of the other encodings. I am still a little over my head here, but I am swimming upwards. What is the "&" doing here? Thanks again
      >> Somehow these operations are converting an 8-bit character representaion to a multibyte (UTF-8) representation

      Actually, the other way around. It converts UTF-8 encoded characters to plain 8-bit numbers, for numbers in the range 0x80 through 0xFF inclusive. It ignores anything outside that range—anything lower is already ASCII, and anything higher is left unchanged, and would leave incorrect stuff in the string.

      Yes, the output is Latin-1, because Unicode's first 256 code points are identical to Latin-1.

        I understood that Latin-1 is an 8-bit extension to ASCII, and that any code points >= \x80 are represented in multiple bytes. Does the "code points are identical" mean identical once the leading high bit is taken away? Plese explain, I am understanding this but slowly. John