in reply to Matching/replacing a unicode character only works after decode()

Why?! Before decoding the utf8 string, how could the string go from input to output unchanged but fail to match the regex?
Basically, that's because Perl by default thinks that a binary string is in Latin-1, rather then UTF-8. And that's a problem - every string in any encoding (UTF-8 or anything else) is valid Latin-1.

Charater \xb5 is one byte in Latin-1, but two bytes in UTF-8. And \x3bc is just too big for a one byte encoding.

Why do I need to decode the utf8 string to match an utf8 character
If you have some string in UTF-8, and want to apply regexes to it, or get it's length in characters, etc... you always have to do that. Because backwards compatibility. Perl is old. Other languages (Python, Ruby) broke compatiblity to get better Unicode. Perl didn't.
  • Comment on Re: Matching/replacing a unicode character only works after decode()

Replies are listed 'Best First'.
Re^2: Matching/replacing a unicode character only works after decode()
by Your Mother (Archbishop) on Jul 25, 2014 at 15:32 UTC

    It would only a backwards compatibility issue if you accept that UTF-8 is the ONLY encoding used in computing. It’s not even a common default yet. Ruby’s Unicode support was terrible and is only made passable by installing some specific gems and even then it’s not as good as Perl. Here’s an overview from tchrist: http://dheeb.files.wordpress.com/2011/07/gbu.pdf.

    Christiansen also once published a Yes/No style table of all the languages and Perl was by far the best among Java/Python/Ruby/PHP. I’m sorry I could not find this table again to link.

      It would only a backwards compatibility issue if you accept that UTF-8 is the ONLY encoding used in computing.
      In the year 2014 UTF-8 is a more useful default than Latin-1, I'd say. BUT, the real problem is implicit upgrading from / downgrading to Latin-1. This is very similar to what Perl does with numbers / numeric-looking strings. The difference is not all strings look like numbers, but absolutely any binary string looks like Latin-1 (and some Unicode strings can be downgraded to Latin-1 without warnings).

      Consider this:

      perl -MDevel::Peek -wE 'my $r = qr/\x{03bc}/; Dump $r' ... FLAGS = (OBJECT,FAKE,UTF8) PV = 0x10eff20 "(?^u:\\x{03bc})" [UTF8 "(?^u:\\x{03bc})"]
      Now, what happens when UTF-8 regex meets a binary string? My guess is that the string gets upgraded to (Perl's internal) UTF-8... FROM (what Perl thinks is) Latin-1, like it happens in other situations. Which is a wrong thing to do.
      Ruby’s Unicode support was terrible
      It's still terribad. But at least, Ruby default to UTF-8 in it's source, for example.

      There is a big difference between excellent Unicode support (which Perl has, of course) and convenient Unicode support. You know, something that is not a pain in the ass. For example: what can go wrong with

      open my $file, '<', '/bogus_file' or die "Can't open: $!\n";

      ?
Re^2: Matching/replacing a unicode character only works after decode()
by ikegami (Patriarch) on Jul 27, 2014 at 04:03 UTC

    No, it's because the regex engine expects Unicode code points (the result of decode), not UTF-8 (what MySQL returned).

    It has nothing to do with backwards compatibility, or with being an old language. e.g. Java's regex library similarly expects chars, not bytes.