Re: Re: Re: s/.// increases length - bug or badly documented feature

I'm reading from 'perlunicode' here, in my Perl V5.6.0 documentation:

Important Caveat

WARNING: The implementation of Unicode support in Perl is incomplete.

The following areas need further work.

Input and Output Disciplines
There is currently no easy way to mark data read from a file or other external source as being utf8. This will be one of the major areas of focus in the near future.

Regular Expressions
The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination on whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters, and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode.

use utf8 still needed to enable a few features
The utf8 pragma implements the tables used for Unicode support. These tables are automatically loaded on demand, so the utf8 pragma need not normally be used.
However, as a compatibility measure, this pragma must be explicitly used to enable recognition of UTF-8 encoded literals and identifiers in the source text.

"a little extra white space can make your code a lot easier to read"
Heh! I thought that was easy to read. I explicitly made the code snippet 'readable'. I even put in parentheses around the strings to be printed.
Anyway, TMTOWTDI and "style" is just a question of... Well, style. ;-)

Everything will go worng!

Comment on Re: Re: Re: s/.// increases length - bug or badly documented feature

Replies are listed 'Best First'.
Re: Re: Re: Re: s/.// increases length - bug or badly documented feature by Juerd (Abbot) on Mar 01, 2002 at 19:32 UTC
There is currently no easy way to mark data read from a file or other external source as being utf8. So adding broken unicode-support in a way rendered Perl unusable for external string input. Great! Now we have realy great and fast programming language that can handle text very well, but not if the text has unicode and the utf8 pragma has not been used. Is the moral of this story: "don't just always use strict, always use utf8 too"? `sub byte_length { # depends on bugs no utf8; my ($string) = @_; my $counter; $counter++ while $string =~ s/.//s; return $counter; } sub has_multibytes { my ($string) = @_; return length($string) != byte_length($string); }` [download] Alternatives for these subs are welcome, of course. `Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :) -- Whreq` [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Re: Re: s/.// increases length - bug or badly documented feature
by Juerd (Abbot) on Mar 01, 2002 at 19:32 UTC

There is currently no easy way to mark data read from a file or other external source as being utf8.

sub byte_length {
    # depends on bugs
    no utf8;
    my ($string) = @_;
    my $counter;
    $counter++ while $string =~ s/.//s;
    return $counter;
}

sub has_multibytes {
    my ($string) = @_;
    return length($string) != byte_length($string);
}
[download]

Lbh ebgngrq guvf grkg naq abj lbh pna ernq vg. Fb jung? :)   -- Whreq
[download]

[reply]
[d/l]
[select]