in reply to utf8, locale and regexp

1. I'm fairly sure it's use encoding 'utf8'; not UTF-8. But that should be equivalent to use utf8; which you already use. I wouldn't be too confident that using both "use encoding" AND "use utf8" will work as expected.

2. Both pragmas only influence the encoding of your code. That means that a) your code file should now be utf-8 encoded or you'll get all kinds of errors. b) your output and input encoding is not affected at all so it's still possible that input and output will be in some other encoding.

3. perl does not understand unicode locales. In fact the documentation specifically warns against using both locales and unicode. You're probably better off specifically setting a binmode(STDOUT,":utf8"); instead of using the locale.

Replies are listed 'Best First'.
Re^2: utf8, locale and regexp
by ikegami (Patriarch) on Apr 10, 2007 at 13:44 UTC
    Regarding your first point, UTF-8 is the standard, UTF8 is Perl-specific and means "assume it's valid UTF-8". "UTF-8" would therefore be better. (They're case-insensitive.)
      That's what I understood from the doc. What seems to happen:use encoding 'UTF-8'; is screwing the regexp engine in the configuration above.
Re^2: utf8, locale and regexp
by Anonymous Monk on Apr 10, 2007 at 15:08 UTC
    I need the use encoding 'utf8' plumbering to make CGI and CGI::FastTemplate work properly with utf8 pages. I need the locale support since I'm using my locale collation. But:
    #!/usr/bin/perl use strict; use utf8; use encoding 'utf8'; my $test = "État"; my $test2 = "É"; if ($test =~ /$test2/) {print "$test matches $test2\n"} else {print "d +oes not matches\n"}
    does not work either. use encoding 'utf8'; seems enought to screw the regexp engine.
      I need the use encoding 'utf8' plumbering to make CGI and CGI::FastTemplate work properly with utf8 pages.
      As I pointed out above, there should be no need to use either utf8 or encoding "utf8" if your code and string literals are not utf-8 encoded - if all the code opens the template files with the right settings.

      I'm going to blame CGI::FastTemplate, since I highly doubt it opens unicode templates using the utf-8 layer. If it did there would be no need to use any of the encoding modules.

      use utf8; use encoding 'utf8';
      There should be no need to use encoding "utf8" when you already do a use "utf8"; As I said above. In fact they seem to break each other. As far as I know "use utf8" is the more standard of the two. If you want to have utf8-encoded scripts, you should probably use (only) utf8;.