in reply to Re: Matching/replacing a unicode character only works after decode()
in thread Matching/replacing a unicode character only works after decode()

It would only a backwards compatibility issue if you accept that UTF-8 is the ONLY encoding used in computing. It’s not even a common default yet. Ruby’s Unicode support was terrible and is only made passable by installing some specific gems and even then it’s not as good as Perl. Here’s an overview from tchrist: http://dheeb.files.wordpress.com/2011/07/gbu.pdf.

Christiansen also once published a Yes/No style table of all the languages and Perl was by far the best among Java/Python/Ruby/PHP. I’m sorry I could not find this table again to link.

  • Comment on Re^2: Matching/replacing a unicode character only works after decode()

Replies are listed 'Best First'.
Re^3: Matching/replacing a unicode character only works after decode()
by Anonymous Monk on Jul 25, 2014 at 20:15 UTC
Re^3: Matching/replacing a unicode character only works after decode()
by Anonymous Monk on Jul 25, 2014 at 16:45 UTC
    It would only a backwards compatibility issue if you accept that UTF-8 is the ONLY encoding used in computing.
    In the year 2014 UTF-8 is a more useful default than Latin-1, I'd say. BUT, the real problem is implicit upgrading from / downgrading to Latin-1. This is very similar to what Perl does with numbers / numeric-looking strings. The difference is not all strings look like numbers, but absolutely any binary string looks like Latin-1 (and some Unicode strings can be downgraded to Latin-1 without warnings).

    Consider this:

    perl -MDevel::Peek -wE 'my $r = qr/\x{03bc}/; Dump $r' ... FLAGS = (OBJECT,FAKE,UTF8) PV = 0x10eff20 "(?^u:\\x{03bc})" [UTF8 "(?^u:\\x{03bc})"]
    Now, what happens when UTF-8 regex meets a binary string? My guess is that the string gets upgraded to (Perl's internal) UTF-8... FROM (what Perl thinks is) Latin-1, like it happens in other situations. Which is a wrong thing to do.
    Ruby’s Unicode support was terrible
    It's still terribad. But at least, Ruby default to UTF-8 in it's source, for example.

    There is a big difference between excellent Unicode support (which Perl has, of course) and convenient Unicode support. You know, something that is not a pain in the ass. For example: what can go wrong with

    open my $file, '<', '/bogus_file' or die "Can't open: $!\n";

    ?