utf8, locale and regexp

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: utf8, locale and regexp by Joost (Canon) on Apr 10, 2007 at 12:19 UTC
1. I'm fairly sure it's `use encoding 'utf8';` not UTF-8. But that should be equivalent to `use utf8;` which you already use. I wouldn't be too confident that using both "use encoding" AND "use utf8" will work as expected. 2. Both pragmas only influence the encoding of your code. That means that a) your code file should now be utf-8 encoded or you'll get all kinds of errors. b) your output and input encoding is not affected at all so it's still possible that input and output will be in some other encoding. 3. perl does not understand unicode locales. In fact the documentation specifically warns against using both locales and unicode. You're probably better off specifically setting a binmode(STDOUT,":utf8"); instead of using the locale. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^2: utf8, locale and regexp by ikegami (Patriarch) on Apr 10, 2007 at 13:44 UTC
Regarding your first point, UTF-8 is the standard, UTF8 is Perl-specific and means "assume it's valid UTF-8". "UTF-8" would therefore be better. (They're case-insensitive.)	[reply]
Re^3: utf8, locale and regexp by Anonymous Monk on Apr 10, 2007 at 15:14 UTC
That's what I understood from the doc. What seems to happen:`use encoding 'UTF-8';` is screwing the regexp engine in the configuration above.	[reply] [d/l]
Re^2: utf8, locale and regexp by Anonymous Monk on Apr 10, 2007 at 15:08 UTC
I need the `use encoding 'utf8'` plumbering to make CGI and CGI::FastTemplate work properly with utf8 pages. I need the locale support since I'm using my locale collation. But: `#!/usr/bin/perl use strict; use utf8; use encoding 'utf8'; my $test = "État"; my $test2 = "É"; if ($test =~ /$test2/) {print "$test matches $test2\n"} else {print "d +oes not matches\n"}` [download] does not work either. `use encoding 'utf8';` seems enought to screw the regexp engine.	[reply] [d/l] [select]
Re^3: utf8, locale and regexp by Joost (Canon) on Apr 10, 2007 at 15:51 UTC
I need the use encoding 'utf8' plumbering to make CGI and CGI::FastTemplate work properly with utf8 pages. As I pointed out above, there should be no need to use either utf8 or encoding "utf8" if your code and string literals are not utf-8 encoded - if all the code opens the template files with the right settings. I'm going to blame CGI::FastTemplate, since I highly doubt it opens unicode templates using the utf-8 layer. If it did there would be no need to use any of the encoding modules. `use utf8; use encoding 'utf8';` [download] There should be no need to use encoding "utf8" when you already do a use "utf8"; As I said above. In fact they seem to break each other. As far as I know "use utf8" is the more standard of the two. If you want to have utf8-encoded scripts, you should probably use (only) utf8;. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re: utf8, locale and regexp by ruoso (Curate) on Apr 11, 2007 at 10:35 UTC
I usually like to say: The correct way of handling encodings in Perl is not caring about. If you're caring too much, you're doing the wrong way... The only two things you need to do to work properly with whatever-encoding in Perl is: Tell Perl the encoding of your Inputs and of your Outputs. Tell Perl the encoding of your source file. The match of accented characters in regexps doesn't have nothing to do with encoding at all, just with locale, so, if your locale is set correctly, then the match will work, in whatever-encoding. This way, the code you sent would be like the following (I included some more CGI code to exemplify your case). use strict; use warnings; use CGI; # this tells my source file is UTF-8 use utf8; # the latin accented characters are valid # for this locale, for instance. BEGIN { $ENV{LC_CTYPE} = 'pt_BR' } # tell Perl I want it to consider that use locale; # The good thing about CGI is that it already # honor the input encoding, so you don't need # to care. my $cgi = CGI->new(); my $string = q( éáaíóúÁAÉÍÓÚ ); # this match works because of the use locale, # not because of encodings... $string =~ s/Á/b/g; # now two important things: # the first is to tell Perl that your STDOUT # is utf8 (this may not be the default depending # on the operating system, the environment and a # lot of other stuff). So it's better to do it # explicitly. binmode STDOUT, ':utf8'; # The second is to properly say that to the browser # (this is actually HTTP specific, not exactly Perl # related, but, as you said you're working with CGI # I decided to cite here). print $cgi->header(-type => 'text/plain', -charset => 'utf-8'); # then the string will be printed correctly print $string; [download] Hope this helps... Update: I missed "-type => " in the first version... daniel	[reply] [d/l]
Re^2: utf8, locale and regexp by almut (Canon) on Apr 11, 2007 at 17:39 UTC
The correct way of handling encodings in Perl is not caring about. If you're caring too much, you're doing the wrong way... I wish I could agree with this statement... but I'm afraid I can't. During the last few months at work, I've been involved in a number of Perl projects in Japanese and Chinese environments, where correct handling of encodings is of paramount importance (in particular on Windows, with its unholy mixture of encodings, like UCS-2, UTF-8 and various legacy codepages.) During that time, I've run into several encoding issues, where you just have to "care too much" (to use your words), or else things simply won't work. For one, Perl doesn't (yet) provide any convenient abstraction layer for handling file names (as opposed to file contents), which means you have to take care of everything yourself manually (by writing wrapper functions, using `Encode::(en\|de)code` explicitly, etc.). In case you're interested in the details, look here for the kind of things I'm having in mind. This isn't the only problem, though. There are a few "borderline" bugs, like the one I posted recently, in the hope to get some feedback on whether other people would also consider this a bug. (Didn't work out, btw. Not a single reply -- which makes me conclude that, with respect to unicode issues, there's not exactly an overwhelming amount of interest in the Perl community. Kind of a pity, but such is life.). Anyway, what I mean to say is that, having to figure out that you need to specify `:raw:encoding(ucs-2le):crlf:utf8` to read/write ordinary UCS-2 files (as frequently encountered on Windows platforms) is just a bit "having to care too much" for my taste... Not to forget the bug revealed in this thread, and other oddities related to subtle differences between `use utf8` and `use encoding 'utf8'`, for example. Of course, whether something is a bug, always is kind of subjective, as it largely depends on your expectations of how things should work, but I think we're not doing ourselves a favor to pretend that everything encoding-related in Perl is working without hassles... Sorry for the rant, and don't get me wrong. I'm a big fan of Perl, and I would surely advocate Perl wherever appropriate. However, in one of the projects mentioned above, I've had a rather hard time convincing my clients to stick with Perl, and not switch to some other language altogether. This involved investing quite a few unpaid hours on my side (spent on debugging and working around various peculiarities) to keep the price competitive. Hope you can forgive the somewhat emotional tone of this post. In any case it's not meant to attack you personally, ruoso. Just needed to vent a little... and I'm feeling better now :)	[reply] [d/l] [select]
Re^3: utf8, locale and regexp by Joost (Canon) on Apr 14, 2007 at 01:39 UTC
Perl's unicode support is far from complete - especially when you consider outside-the-base-distro modules that everyone relies on. I've been running into bugs in DBD::mysql myself. I even supplied a couple of patches. Right now I would say that perl's unicode is better than most programming languages, if you only look at the base language. I believe perl's internal distinguishing between the 8bit (latin-1) endocing and internal, multibyte (utf8) representation is the right choice for a language that has to keep strings == bytearray backward compatibility. It also keeps C <-> perl translation relatively straightforward. Also, I must say I've not run into any unicode bugs in perl since 7 months ago, when started working on a fairly large multi-language system. But like I said, it's not quite like that when you consider modules. Most modules on CPAN aren't under the kind of scrutiny that the base perl distro is under. Right now I'm examining a DBD::mysql bug that seems to not affect the system I'm working on, but I can't figure out why it doesn't. <--- that means; no one's going to pay me for fixing it, probably. :-) "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^3: utf8, locale and regexp by ruoso (Curate) on Apr 12, 2007 at 10:53 UTC
Actually, your post was very much informative. Thank you. And yes, I do believe the points you made are about bugs (specially the crlf issue). As to filenames, I think this is a wishlist bug to File::Spec, as far as I understand, File::Spec should be able to deal with the encoding used in the operating system also (or at least be able to receive the information about which encoding to use). daniel	[reply]
Re: utf8, locale and regexp by ikegami (Patriarch) on Apr 10, 2007 at 13:42 UTC
Your directives tell `perl` that your source code is UTF-8, but the snippet you presented to us is not valid UTF-8.	[reply] [d/l]
Re^2: utf8, locale and regexp by Anonymous Monk on Apr 10, 2007 at 15:12 UTC
'É' in my script are utf8 encoded.	[reply]
Re^3: utf8, locale and regexp by ikegami (Patriarch) on Apr 10, 2007 at 15:32 UTC
There's indeed a problem. #!/usr/bin/perl use strict; use warnings; use encoding 'UTF-8'; #use encoding 'utf8'; #use utf8; use Devel::Peek qw( Dump ); my $word = "Ã‰tat"; # UTF-8 encoding of "État" my $char = "Ã‰"; # UTF-8 encoding of "É" Dump($word); print("String Length: ", length($word), "\n"); print("\n"); Dump($char); print("String Length: ", length($char), "\n"); print("\n"); if ($word =~ /$char/) { print "Matches\n"; } else { print "Does not match\n"; } if ($word =~ /\Q$char/) { print "Matches\n"; } else { print "Does not match\n"; } if (substr($word, 0, 1) eq $char) { print "Equal\n"; } else { print "Not equal\n"; } [download] `SV = PV(0x22608c) at 0x225f9c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) <-- That's good PV = 0x1822634 "\303\211tat"\0 [UTF8 "\x{c9}tat"] <-- That's good CUR = 5 LEN = 8 String Length: 4 <-- That's good SV = PV(0x2260a4) at 0x225f3c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) <-- That's good PV = 0x22f43c "\303\211"\0 [UTF8 "\x{c9}"] <-- That's good CUR = 2 LEN = 4 String Length: 1 <-- That's good Does not match <-- WTF? Does not match <-- WTF? Equal <-- That's good` [download] Replacing `use encoding 'UTF-8';` with `use encoding 'utf8';` yields the same results. Replacing `use encoding 'UTF-8';` with `use utf8;` produces the same dumps, but the matches succeed. My suggestion: Use `use utf8;` to treat the source as UTF-8. Use `binmode(STDOUT, ":utf8");` to output UTF-8.	[reply] [d/l] [select]
Re^4: utf8, locale and regexp by Krambambuli (Curate) on Apr 11, 2007 at 08:41 UTC
Re^4: utf8, locale and regexp by Anonymous Monk on Apr 10, 2007 at 16:47 UTC
Re^5: utf8, locale and regexp by Joost (Canon) on Apr 10, 2007 at 19:48 UTC
Re^5: utf8, locale and regexp by ikegami (Patriarch) on Apr 10, 2007 at 17:08 UTC
Re: utf8, locale and regexp by syphilis (Archbishop) on Apr 10, 2007 at 14:45 UTC
Hmmm ... on Windows I can run: `use strict; use warnings; my $test = "État"; my $test2 = "É"; if ($test =~ /$test2/) {print "matches\n"} else {print "does not match +es\n"}` [download] which results in the output of `matches`. Not sure if that helps ... perhaps not. Cheers, Rob	[reply] [d/l] [select]
Re^2: utf8, locale and regexp by Anonymous Monk on Apr 10, 2007 at 15:11 UTC
Nope: your script must be utf8. Namely the char 'É' should be utf8 encoded.	[reply]
Re: utf8, locale and regexp by Juerd (Abbot) on Jun 13, 2007 at 19:52 UTC
"use encoding" is broken in several ways, and there will not be a fix soon. Stop using it if you can. And I'm quite sure that you can. If your source code is UTF-8 encoded, tell Perl by adding "use utf8;". If your input and output must be UTF-8 encoded, tell Perl by adding "binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)';". Just don't "use encoding", please. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]