in reply to Re^3: One bird, two Unicode names
in thread One bird, two Unicode names

The xml parser comes with
use Spreadsheet::ReadSXC qw(read_xml_string);
My file:///C:/Perl/html/site/lib/Spreadsheet/ReadSXC.html suggests this
use Unicode::String qw(utf8); print utf8(" '$cell_contents'")->as_string;
That correctly forces most of the file that seems to be in latin-1 into UTF-8, at least for the lower code points, for example
Rougequeue de Güldenstädt => 'Rougequeue de Güldenstädt'
But it fails on the higher code points e.g. "'" in latin-1 does not (unsurprisingly) turn into RIGHT SINGLE QUOTATION MARK (8217 )
Instead, the latin-1 turns into this
Güldenstädt's Redstart => 'Güldenstädt's Redstart'
which does not equal the name of the same bird in the UTF-8 coded file
Richard H

Replies are listed 'Best First'.
Re^5: One bird, two Unicode names
by ikegami (Patriarch) on Mar 11, 2011 at 20:10 UTC

    Spreadsheet::ReadSXC uses XML::Parser which properly decodes.

    $ perl -CSDA -MEncode -MXML::Parser -E'XML::Parser->new(Handlers => { +Char => sub { print "$_[1]" } })->parse(encode($ARGV[0], qq{<?xml ver +sion="1.0" encoding="$ARGV[0]"?><root>\xC9ric\n</root>}));' iso-8859- +1 Éric $ perl -CSDA -MEncode -MXML::Parser -E'XML::Parser->new(Handlers => { +Char => sub { print "$_[1]" } })->parse(encode($ARGV[0], qq{<?xml ver +sion="1.0" encoding="$ARGV[0]"?><root>\xC9ric\n</root>}));' UTF-8 Éric

    Could you provide me the output from either of the following

    use Devel::Peek; Dump($s);

    or

    { use Data::Dumper; local $Data::Dumper::Useqq = 1; print(Dumper($s)); }

    (preferably the former) for both versions of the string?

    Update: Looks like you already did. I followed up there.

      I'm not sure if you still want this or not. Here, fwiw, is output of your use Devel::Peek
      1 of 2 Here comes $s, the contents of cell at row 636, column 2 of fil +e .../BirdLists_in_english/AERC WPlist July 2010 version 2.0.ods $s = Güldenstädt’s Redstart SV = PV(0x201601c) at 0x2020ca0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x468ffe4 "G\303\274ldenst\303\244dt\342\200\231s Redstart"\0 [ +UTF8 "G\x{fc}ldenst\x{e4}dt\x{2019}s Redstart"] CUR = 26 LEN = 27 SV = PVMG(0x460a77c) at 0x2020ca0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x45ff86c "G\303\274ldenst\303\244dt\342\200\231s Redstart"\0 [ +UTF8 "G\x{fc}ldenst\x{e4}dt\x{2019}s Redstart"] CUR = 26 LEN = 175 MAGIC = 0x418ce64 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 22 2 of 2 Here comes $s, the contents of cell at row 763, column 8 of fil +e .../BirdLists_in_both_languages/53174_Liste_Pal_OccO2008.ods $s = Güldenstädt's Redstart SV = PVMG(0x460a77c) at 0x2020ca0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x27bd014 "G\303\274ldenst\303\244dt's Redstart"\0 [UTF8 "G\x{f +c}ldenst\x{e4}dt's Redstart"] CUR = 24 LEN = 779 MAGIC = 0x280913c MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 22

      RichardH
      Update
      Summary
      Here are the summarized results of
      "use Devel::Peek;"

      WITH:- use open ':std', ':encoding(cp1252)';
      File AERC*.ods :-
      PV = 0x34b5cf4 "G\303\274ldenst\303\244dt's Redstart"\0 [UTF8 "G\x{fc}ldenst\x{e4}dt's Redstart"]

      File Pal_*.ods :-
      PV = 0x34b5cf4 "G\303\274ldenst\303\244dt's Redstart"\0 [UTF8 "G\x{fc}ldenst\x{e4}dt's Redstart"]

      WITHOUT:-
      File AERC*.ods :-
      PV = 0x45fc33c "G\303\274ldenst\303\244dt\342\200\231s Redstart"\0 [UTF8 "G\x{fc}ldenst\x{e4}dt\x{2019}s Redstart"]

      File Pal_*.ods :-
      PV = 0x4660024 "G\303\274ldenst\303\244dt's Redstart"\0 [UTF8 "G\x{fc}ldenst\x{e4}dt's Redstart"]

      Conclusion
      To remove differences between OOorg codings
      include the line
      "use open ':std', ':encoding(cp1252)';"