manni has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks!

At work, we are finally moving to a new (to us) platform. We are going to leave perl 5.8.5 behind us and move our home-grown webapp-framework to 5.10.1. That also means exchanging mod_perl 1 on Apache 1.3 with mod_perl 2 on Apache 2.2. Our initial tests are looking good, but it seems that every utf8 string is broken in pages rendered in the new environment.

Some debugging tells me that my raw strings (which come from a variety of sources) don't have the utf8-flag set. When I run them through Encode::decode_utf8 or utf8::decode, I get broken UTF8. When I do a utf8::upgrade, they come out just fine. Needless to say, that our code base has lots and lots of calls to Encode::decode_utf8.

I've been reading through perldelta and everything referenced there, but I couldn't find anything that would really fit.

Of course, all the CPAN modules we use also get updated to their latest versions, but installing some candidates (HTML::Template, e.g.) on the old test system did not replicate the behavior we are seeing on the new one.

Can anyone help and direct me towards that silver bullet?

Update: solved, see below.
  • Comment on UTF-8 trouble moving from perl 5.8.5 to 5.10.1

Replies are listed 'Best First'.
Re: UTF-8 trouble moving from perl 5.8.5 to 5.10.1
by zentara (Cardinal) on Sep 09, 2011 at 20:03 UTC
    Maybe Help needed understanding unicode in perl would help with your Perl 5.10 code. It's too bad you couldn't move up to at least Perl 5.12 for better unicode handling. See OSCON Perl Unicode Slides and check out the slideshow for Unicode::Tussle. Although they require Perl 5.12, the scripts may yield some clues on how to handle things.
    # yikes!!! use v5.12; # minimal for unicode_strings feature use v5.14; # optimal for unicode_strings feature

    Maybe you could install Perl 5.14 in a home directory and use it to process the files?


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: UTF-8 trouble moving from perl 5.8.5 to 5.10.1
by Corion (Patriarch) on Sep 10, 2011 at 07:47 UTC

    "Somewhere" along the path in your new environment, at least once component changed the way it handles encoding/decoding data.

    You will need to track down every border between all your components and make sure that all data is in the format you expect. Preferrably, you transfer all data encoded as UTF-8 between your components, and decode to Unicode on input/retrieval, and encode on output/web page.

    The checklist is roughly:

    1. Find out how the data is stored in the database
    2. Find out how the database driver delivers the data. Preferrably make it deliver the data encoded as UTF-8 and have the DBD decode it to Unicode.
    3. Find out how the data is stored in text files on disk. Preferrably encode them as UTF-8.
    4. Find out how the data is read from the files. Preferrably decode it to unicode.
    5. Find out how the data is converted/concatenated with other data (for example, templates). Either encode to the target character set or decode to Unicode.
    6. Find out how the data is written. Make sure that the encoding used for the data matches the encoding used for the headers and the encoding stated in the HTML page.
Re: UTF-8 trouble moving from perl 5.8.5 to 5.10.1
by moritz (Cardinal) on Sep 11, 2011 at 08:44 UTC
    Some debugging tells me that my raw strings (which come from a variety of sources) don't have the utf8-flag set. When I run them through Encode::decode_utf8 or utf8::decode, I get broken UTF8. When I do a utf8::upgrade, they come out just fine.

    Which means they are in stored in Latin-1, not UTF-8. The proper solution is to run them through Encode::decode('ISO-8859-1', $yourstring), or to recode them to UTF-8 in the storage location.

      Thank you all for your input

      After a little more debugging, I now have found the silver bullet and it seems that all that was missing was a single line.

      We use Encode::decode everywhere we should, we're fine in that department. But we never told Perl how we would like our output.

      All that was missing was:

      binmode STDOUT, ':utf8';

      I guess the question was not why Unicode was broken on the new system, but rather why it worked on the old one.

        I guess the question was not why Unicode was broken on the new system, but rather why it worked on the old one.

        probably locale related, ie export LC_CTYPE=de_DE.UTF-8 or some such

        or set via perlrun#-C