cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
  I'm updating one of my scripts so that it's language packable. Swapping all the outputted text for variables that are stored in a separate file that's required in. So translations of this separate language file are all that's needed for the output to be a new language.

I'm happy with that, but I haven't worked with Unicode before and I'm worried and foreign symbol languages such as Chinese and Japanese. Will it be as simple as having use utf8; at the top of the language file? Will this mess up all my regexps or sprintfs? What about handling these languages as input from HTML forms?

Any advice welcome, thanks.

Lyle

Replies are listed 'Best First'.
Re: Coding for multiple Languages
by ikegami (Patriarch) on Jan 03, 2008 at 14:15 UTC

    use utf8 affects string literals in the source (and STDIN and STDOUT?). Strings stored in a data file won't be affected by it.

    Simply specify the encoding of the data file when opening the data file.

    open(my $fh, '<:encoding(UTF-8)', $language_qfn) or die; while (<$fh>) { ... }

    If the you have a handle and not a file name, use binmode.

    binmode(STDIN, ':encoding(UTF-8)'); while (<STDIN>) { ... }

    Or decode the string explicitly

    use Encode qw( decode ); open(my $fh, '<', $language_qfn) or die; while (<$fh>) { $_ = decode('UTF-8', $_); ... }

    Output is very similar. Set the encoding on the output handle or use encode.

    open(my $fh, '>:encoding(UTF-8)', $output_fqn) or die; print $fh ($text);
    binmode(STDOUT, ':encoding(UTF-8)'); print($text);
    use Encode qw( encode ); open(my $fh, '>', $output_fqn) or die; print $fh (encode('UTF-8', $text));

    Update: Added output info.

      (and STDIN and STDOUT?)
      It doesn't. For that you need use open qw/:utf8 :std/;
      The separate file is a perl file such as:-
      $stringhello = 'Hello'; $stringlogin = 'Login';...
      that's brought in with require lang.pl; I was referring to adding use utf8; to the top of that file. Wouldn't that do the same?

        Yes. It can even be in other encodings using use encoding '...';.

        You still need to handle the output, though. The strings of characters need to encoded into strings of bytes using one of the methods I posted.

Re: Coding for multiple Languages
by ikegami (Patriarch) on Jan 03, 2008 at 14:31 UTC
    On the other hand, there's likely a module on CPAN that will handle this. What you are doing is called "internationalization", or "i18n" for short.