Format eating too few characters with utf-8?

adsb has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to fix a bug in a script. The section of code in question converts a string supplied on the command line to a format suitable for a changelog (which is required to be utf-8). The reduced script below works for ASCII text, but not when the string contains high-bit or multi-byte characters. For example, passing "l'été sera chaud" produces

  * l'été sera chaud
    ud
[download]

The characters on the second line are always the final n characters from the first, where n is the number of non-ASCII characters in the string; I'm therefore guessing this is a bytes vs characters issues. Could anyone suggest a solution?

#!/usr/bin/perl

use strict;
use warnings;
use open ':utf8';
use Encode 'decode_utf8';

my $CHGLINE = decode_utf8(join(" ", @ARGV));
open O, ">testout";

format O =
  * ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$CHGLINE
 ~~ ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$CHGLINE
.

write O;
close O;
[download]

Comment on Format eating too few characters with utf-8? Select or Download Code

Replies are listed 'Best First'.
Re: Format eating too few characters with utf-8? by ikegami (Patriarch) on Apr 01, 2008 at 20:00 UTC
I can reproduce* your problem in 5.8.8 and 5.10.0. It seems to be a bug in Perl. I have heard people speak highly of Perl6::Form, a module providing Perl6 format syntax to Perl5. Perhaps that would be a viable alternative. * — I used the following to populate `@ARGV`: `use Encode qw( encode ); use HTML::Entities qw( decode_entities ); @ARGV = encode('UTF-8', decode_entities("l'été sera chau +d"));` [download]	[reply] [d/l] [select]
Re: Format eating too few characters with utf-8? by jdporter (Paladin) on Apr 01, 2008 at 23:16 UTC
Did you try outputting the strings by some other means (e.g. print), rather than format? If other indicators say that the input data is hosed, then there could be something wrong with the command line interface. What shell are you doing this under? Is it really necessary to pass the string on the command line? Could you feed it in via stdin? A word spoken in Mind will reach its own level, in the objective world, by its own weight	[reply]
Re^2: Format eating too few characters with utf-8? by adsb (Initiate) on Apr 02, 2008 at 06:05 UTC
Yes, the data really has to be passed on the command line - this is a several year old and established script that I can't change the semantics of. ikegami's reply populated @ARGV inside the script rather than passing the text on the command line, which should avoid any issues there. print()ing the string before passing it to decode_utf8 gives "l'été sera chaud". Decoding it gives "l'�t� sera chaud". I've found a solution that seems to fix the problem, although it is rather hacky. Appending n spaces to the string before write()ing it, where n is the number of non-ASCII characters inside the string, seems to fix the issue without polluting the eventual output.	[reply]