in reply to Re: One bird, two Unicode names
in thread One bird, two Unicode names

The files are OOorg format *.ods
So I open thus
my $zip = Archive::Zip->new( $infile ); my $content = $zip->contents('content.xml'); my $workbook_ref = read_xml_string($content); foreach my $sheet ( sort keys %$workbook_ref ) { foreach my $row( @{$$workbook_ref{$sheet}} ) { foreach my $cell_contents (@{$row}){ next unless defined( $cell_contents ); $cell_contents = replace_higher_unicode_code_points($cell_conten +ts); etc
How would I apply your Decode the file in this instance?
Thanks in advance
RichardH

Replies are listed 'Best First'.
Re^3: One bird, two Unicode names
by ikegami (Patriarch) on Mar 11, 2011 at 08:38 UTC
    hmmmm, the XML parser used by read_xml_string should decode the text. What's that function?
      The xml parser comes with
      use Spreadsheet::ReadSXC qw(read_xml_string);
      My file:///C:/Perl/html/site/lib/Spreadsheet/ReadSXC.html suggests this
      use Unicode::String qw(utf8); print utf8(" '$cell_contents'")->as_string;
      That correctly forces most of the file that seems to be in latin-1 into UTF-8, at least for the lower code points, for example
      Rougequeue de Güldenstädt => 'Rougequeue de Güldenstädt'
      But it fails on the higher code points e.g. "'" in latin-1 does not (unsurprisingly) turn into RIGHT SINGLE QUOTATION MARK (8217 )
      Instead, the latin-1 turns into this
      Güldenstädt's Redstart => 'Güldenstädt's Redstart'
      which does not equal the name of the same bird in the UTF-8 coded file
      Richard H

        Spreadsheet::ReadSXC uses XML::Parser which properly decodes.

        $ perl -CSDA -MEncode -MXML::Parser -E'XML::Parser->new(Handlers => { +Char => sub { print "$_[1]" } })->parse(encode($ARGV[0], qq{<?xml ver +sion="1.0" encoding="$ARGV[0]"?><root>\xC9ric\n</root>}));' iso-8859- +1 Éric $ perl -CSDA -MEncode -MXML::Parser -E'XML::Parser->new(Handlers => { +Char => sub { print "$_[1]" } })->parse(encode($ARGV[0], qq{<?xml ver +sion="1.0" encoding="$ARGV[0]"?><root>\xC9ric\n</root>}));' UTF-8 Éric

        Could you provide me the output from either of the following

        use Devel::Peek; Dump($s);

        or

        { use Data::Dumper; local $Data::Dumper::Useqq = 1; print(Dumper($s)); }

        (preferably the former) for both versions of the string?

        Update: Looks like you already did. I followed up there.