Re: utf8, locale and regexp
by Joost (Canon) on Apr 10, 2007 at 12:19 UTC
|
1. I'm fairly sure it's use encoding 'utf8'; not UTF-8. But that should be equivalent to use utf8; which you already use. I wouldn't be too confident that using both "use encoding" AND "use utf8" will work as expected.
2. Both pragmas only influence the encoding of your code. That means that a) your code file should now be utf-8 encoded or you'll get all kinds of errors. b) your output and input encoding is not affected at all so it's still possible that input and output will be in some other encoding.
3. perl does not understand unicode locales. In fact the documentation specifically warns against using both locales and unicode. You're probably better off specifically setting a binmode(STDOUT,":utf8"); instead of using the locale.
| [reply] [d/l] [select] |
|
|
Regarding your first point, UTF-8 is the standard, UTF8 is Perl-specific and means "assume it's valid UTF-8". "UTF-8" would therefore be better. (They're case-insensitive.)
| [reply] |
|
|
That's what I understood from the doc.
What seems to happen:use encoding 'UTF-8'; is screwing the regexp engine in the configuration above.
| [reply] [d/l] |
|
|
I need the use encoding 'utf8' plumbering to make CGI and CGI::FastTemplate work properly with utf8 pages.
I need the locale support since I'm using my locale collation.
But:
#!/usr/bin/perl
use strict;
use utf8;
use encoding 'utf8';
my $test = "État";
my $test2 = "É";
if ($test =~ /$test2/) {print "$test matches $test2\n"} else {print "d
+oes not matches\n"}
does not work either. use encoding 'utf8'; seems enought to screw the regexp engine. | [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
Re: utf8, locale and regexp
by ruoso (Curate) on Apr 11, 2007 at 10:35 UTC
|
I usually like to say: The correct way of handling encodings in Perl is not caring about. If you're caring too much, you're doing the wrong way...
The only two things you need to do to work properly with whatever-encoding in Perl is:
- Tell Perl the encoding of your Inputs and of your Outputs.
- Tell Perl the encoding of your source file.
The match of accented characters in regexps doesn't have nothing to do with encoding at all, just with locale, so, if your locale is set correctly, then the match will work, in whatever-encoding.
This way, the code you sent would be like the following (I included some more CGI code to exemplify your case).
use strict;
use warnings;
use CGI;
# this tells my source file is UTF-8
use utf8;
# the latin accented characters are valid
# for this locale, for instance.
BEGIN { $ENV{LC_CTYPE} = 'pt_BR' }
# tell Perl I want it to consider that
use locale;
# The good thing about CGI is that it already
# honor the input encoding, so you don't need
# to care.
my $cgi = CGI->new();
my $string = q( éáaíóúÁAÉÍÓÚ );
# this match works because of the use locale,
# not because of encodings...
$string =~ s/Á/b/g;
# now two important things:
# the first is to tell Perl that your STDOUT
# is utf8 (this may not be the default depending
# on the operating system, the environment and a
# lot of other stuff). So it's better to do it
# explicitly.
binmode STDOUT, ':utf8';
# The second is to properly say that to the browser
# (this is actually HTTP specific, not exactly Perl
# related, but, as you said you're working with CGI
# I decided to cite here).
print $cgi->header(-type => 'text/plain', -charset => 'utf-8');
# then the string will be printed correctly
print $string;
Hope this helps...
Update: I missed "-type => " in the first version...
| [reply] [d/l] |
|
|
The correct way of handling encodings in Perl is not caring about.
If you're caring too much, you're doing the wrong way...
I wish I could agree with this statement... but I'm afraid I can't.
During the last few months at work, I've been involved in a number
of Perl projects in Japanese and Chinese environments, where correct
handling of encodings is of paramount importance (in particular on
Windows, with its unholy mixture of encodings, like UCS-2, UTF-8 and
various legacy codepages.) During that time, I've run into several
encoding issues, where you just have to "care too much" (to use
your words), or else things simply won't work.
For one, Perl doesn't (yet) provide any convenient abstraction
layer for handling file names (as opposed to file contents),
which means you have to take care of everything yourself manually (by
writing wrapper functions, using Encode::(en|de)code explicitly,
etc.). In case you're interested in the details, look here
for the kind of things I'm having in mind.
This isn't the only problem, though. There are a few "borderline"
bugs, like the one I posted recently, in the hope to
get some feedback on whether other people would also consider this a
bug. (Didn't work out, btw. Not a single reply -- which makes me
conclude that, with respect to unicode issues, there's not exactly an
overwhelming amount of interest in the Perl community. Kind of a pity,
but such is life.). Anyway, what I mean to say is that, having to
figure out that you need to specify :raw:encoding(ucs-2le):crlf:utf8 to
read/write ordinary UCS-2 files (as frequently encountered on Windows
platforms) is just a bit "having to care too much" for my taste...
Not to forget the bug revealed in this thread, and
other oddities related to subtle differences between use utf8
and use encoding 'utf8', for example.
Of course, whether something is a bug, always is kind of subjective, as
it largely depends on your expectations of how things should
work, but I think we're not doing ourselves a favor to pretend that
everything encoding-related in Perl is working without hassles...
Sorry for the rant, and don't get me wrong. I'm a big fan of Perl, and
I would surely advocate Perl wherever appropriate. However, in one of
the projects mentioned above, I've had a rather hard time convincing my
clients to stick with Perl, and not switch to some other language
altogether. This involved investing quite a few unpaid hours on my side
(spent on debugging and working around various peculiarities) to keep
the price competitive.
Hope you can forgive the somewhat emotional tone of this post. In any
case it's not meant to attack you personally, ruoso. Just needed to
vent a little... and I'm feeling better now :)
| [reply] [d/l] [select] |
|
|
Perl's unicode support is far from complete - especially when you consider outside-the-base-distro modules that everyone relies on. I've been running into bugs in DBD::mysql myself. I even supplied a couple of patches. Right now I would say that perl's unicode is better than most programming languages, if you only look at the base language.
I believe perl's internal distinguishing between the 8bit (latin-1) endocing and internal, multibyte (utf8) representation is the right choice for a language that has to keep strings == bytearray backward compatibility. It also keeps C <-> perl translation relatively straightforward.
Also, I must say I've not run into any unicode bugs in perl since 7 months ago, when started working on a fairly large multi-language system.
But like I said, it's not quite like that when you consider modules. Most modules on CPAN aren't under the kind of scrutiny that the base perl distro is under. Right now I'm examining a DBD::mysql bug that seems to not affect the system I'm working on, but I can't figure out why it doesn't. <--- that means; no one's going to pay me for fixing it, probably. :-)
| [reply] |
|
|
Actually, your post was very much informative. Thank you. And yes, I do believe the points you made are about bugs (specially the crlf issue).
As to filenames, I think this is a wishlist bug to File::Spec, as far as I understand, File::Spec should be able to deal with the encoding used in the operating system also (or at least be able to receive the information about which encoding to use).
| [reply] |
Re: utf8, locale and regexp
by ikegami (Patriarch) on Apr 10, 2007 at 13:42 UTC
|
Your directives tell perl that your source code is UTF-8, but the snippet you presented to us is not valid UTF-8. | [reply] [d/l] |
|
|
'É' in my script are utf8 encoded.
| [reply] |
|
|
There's indeed a problem.
#!/usr/bin/perl
use strict;
use warnings;
use encoding 'UTF-8';
#use encoding 'utf8';
#use utf8;
use Devel::Peek qw( Dump );
my $word = "État"; # UTF-8 encoding of "État"
my $char = "É"; # UTF-8 encoding of "É"
Dump($word);
print("String Length: ", length($word), "\n");
print("\n");
Dump($char);
print("String Length: ", length($char), "\n");
print("\n");
if ($word =~ /$char/) {
print "Matches\n";
} else {
print "Does not match\n";
}
if ($word =~ /\Q$char/) {
print "Matches\n";
} else {
print "Does not match\n";
}
if (substr($word, 0, 1) eq $char) {
print "Equal\n";
} else {
print "Not equal\n";
}
SV = PV(0x22608c) at 0x225f9c
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) <-- That's good
PV = 0x1822634 "\303\211tat"\0 [UTF8 "\x{c9}tat"] <-- That's good
CUR = 5
LEN = 8
String Length: 4 <-- That's good
SV = PV(0x2260a4) at 0x225f3c
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) <-- That's good
PV = 0x22f43c "\303\211"\0 [UTF8 "\x{c9}"] <-- That's good
CUR = 2
LEN = 4
String Length: 1 <-- That's good
Does not match <-- WTF?
Does not match <-- WTF?
Equal <-- That's good
Replacing use encoding 'UTF-8'; with use encoding 'utf8'; yields the same results.
Replacing use encoding 'UTF-8'; with use utf8; produces the same dumps, but the matches succeed.
My suggestion:
- Use use utf8; to treat the source as UTF-8.
- Use binmode(STDOUT, ":utf8"); to output UTF-8.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
Re: utf8, locale and regexp
by syphilis (Archbishop) on Apr 10, 2007 at 14:45 UTC
|
Hmmm ... on Windows I can run:
use strict;
use warnings;
my $test = "État";
my $test2 = "É";
if ($test =~ /$test2/) {print "matches\n"} else {print "does not match
+es\n"}
which results in the output of matches.
Not sure if that helps ... perhaps not.
Cheers, Rob | [reply] [d/l] [select] |
|
|
Nope: your script must be utf8. Namely the char 'É' should be utf8 encoded.
| [reply] |
Re: utf8, locale and regexp
by Juerd (Abbot) on Jun 13, 2007 at 19:52 UTC
|
"use encoding" is broken in several ways, and there will not be a fix soon. Stop using it if you can. And I'm quite sure that you can.
If your source code is UTF-8 encoded, tell Perl by adding "use utf8;". If your input and output must be UTF-8 encoded, tell Perl by adding "binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)';".
Just don't "use encoding", please.
| [reply] |