Re^2: One bird, two Unicode names

The files are OOorg format *.ods
So I open thus

my $zip = Archive::Zip->new( $infile );
my $content = $zip->contents('content.xml');
my $workbook_ref = read_xml_string($content);
foreach my $sheet ( sort keys %$workbook_ref ) {
  foreach my $row( @{$$workbook_ref{$sheet}} ) {
    foreach my $cell_contents (@{$row}){
      next unless defined( $cell_contents );
      $cell_contents = replace_higher_unicode_code_points($cell_conten
+ts);
etc
[download]

How would I apply your Decode the file in this instance?
Thanks in advance
RichardH

Comment on Re^2: One bird, two Unicode names Download Code

Replies are listed 'Best First'.
Re^3: One bird, two Unicode names by ikegami (Patriarch) on Mar 11, 2011 at 08:38 UTC
hmmmm, the XML parser used by `read_xml_string` should decode the text. What's that function?	[reply] [d/l]
Re^4: One bird, two Unicode names by Anonymous Monk on Mar 11, 2011 at 10:11 UTC
The xml parser comes with `use Spreadsheet::ReadSXC qw(read_xml_string);` [download] My `file:///C:/Perl/html/site/lib/Spreadsheet/ReadSXC.html` suggests this `use Unicode::String qw(utf8); print utf8(" '$cell_contents'")->as_string;` [download] That correctly forces most of the file that seems to be in latin-1 into UTF-8, at least for the lower code points, for example `Rougequeue de Güldenstädt => 'Rougequeue de GÃ¼ldenstÃ¤dt'` [download] But it fails on the higher code points e.g. "'" in latin-1 does not (unsurprisingly) turn into RIGHT SINGLE QUOTATION MARK (8217 ) Instead, the latin-1 turns into this `Güldenstädt's Redstart => 'GÃ¼ldenstÃ¤dt's Redstart'` [download] which does not equal the name of the same bird in the UTF-8 coded file Richard H	[reply] [d/l] [select]
Re^5: One bird, two Unicode names by ikegami (Patriarch) on Mar 11, 2011 at 20:10 UTC
Spreadsheet::ReadSXC uses XML::Parser which properly decodes. `$ perl -CSDA -MEncode -MXML::Parser -E'XML::Parser->new(Handlers => { +Char => sub { print "$_[1]" } })->parse(encode($ARGV[0], qq{<?xml ver +sion="1.0" encoding="$ARGV[0]"?><root>\xC9ric\n</root>}));' iso-8859- +1 Éric $ perl -CSDA -MEncode -MXML::Parser -E'XML::Parser->new(Handlers => { +Char => sub { print "$_[1]" } })->parse(encode($ARGV[0], qq{<?xml ver +sion="1.0" encoding="$ARGV[0]"?><root>\xC9ric\n</root>}));' UTF-8 Éric` [download] Could you provide me the output from either of the following `use Devel::Peek; Dump($s);` [download] or `{ use Data::Dumper; local $Data::Dumper::Useqq = 1; print(Dumper($s)); }` [download] (preferably the former) for both versions of the string? Update: Looks like you already did. I followed up there.	[reply] [d/l] [select]
Re^6: One bird, two Unicode names by RCH (Sexton) on Mar 14, 2011 at 16:02 UTC