adsb has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to fix a bug in a script. The section of code in question converts a string supplied on the command line to a format suitable for a changelog (which is required to be utf-8). The reduced script below works for ASCII text, but not when the string contains high-bit or multi-byte characters. For example, passing "l'été sera chaud" produces
* l'été sera chaud ud
The characters on the second line are always the final n characters from the first, where n is the number of non-ASCII characters in the string; I'm therefore guessing this is a bytes vs characters issues. Could anyone suggest a solution?
#!/usr/bin/perl use strict; use warnings; use open ':utf8'; use Encode 'decode_utf8'; my $CHGLINE = decode_utf8(join(" ", @ARGV)); open O, ">testout"; format O = * ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $CHGLINE ~~ ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $CHGLINE . write O; close O;

Replies are listed 'Best First'.
Re: Format eating too few characters with utf-8?
by ikegami (Patriarch) on Apr 01, 2008 at 20:00 UTC

    I can reproduce* your problem in 5.8.8 and 5.10.0. It seems to be a bug in Perl.

    I have heard people speak highly of Perl6::Form, a module providing Perl6 format syntax to Perl5. Perhaps that would be a viable alternative.

    * — I used the following to populate @ARGV:

    use Encode qw( encode ); use HTML::Entities qw( decode_entities ); @ARGV = encode('UTF-8', decode_entities("l'&eacute;t&eacute; sera chau +d"));
Re: Format eating too few characters with utf-8?
by jdporter (Paladin) on Apr 01, 2008 at 23:16 UTC

    Did you try outputting the strings by some other means (e.g. print), rather than format? If other indicators say that the input data is hosed, then there could be something wrong with the command line interface. What shell are you doing this under? Is it really necessary to pass the string on the command line? Could you feed it in via stdin?

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
      Yes, the data really has to be passed on the command line - this is a several year old and established script that I can't change the semantics of. ikegami's reply populated @ARGV inside the script rather than passing the text on the command line, which should avoid any issues there.

      print()ing the string before passing it to decode_utf8 gives "l'été sera chaud". Decoding it gives "l'�t� sera chaud".

      I've found a solution that seems to fix the problem, although it is rather hacky. Appending n spaces to the string before write()ing it, where n is the number of non-ASCII characters inside the string, seems to fix the issue without polluting the eventual output.