in reply to Re^2: One bird, two Unicode names
in thread One bird, two Unicode names

hmmmm, the XML parser used by read_xml_string should decode the text. What's that function?

Replies are listed 'Best First'.
Re^4: One bird, two Unicode names
by Anonymous Monk on Mar 11, 2011 at 10:11 UTC
    The xml parser comes with
    use Spreadsheet::ReadSXC qw(read_xml_string);
    My file:///C:/Perl/html/site/lib/Spreadsheet/ReadSXC.html suggests this
    use Unicode::String qw(utf8); print utf8(" '$cell_contents'")->as_string;
    That correctly forces most of the file that seems to be in latin-1 into UTF-8, at least for the lower code points, for example
    Rougequeue de Güldenstädt => 'Rougequeue de Güldenstädt'
    But it fails on the higher code points e.g. "'" in latin-1 does not (unsurprisingly) turn into RIGHT SINGLE QUOTATION MARK (8217 )
    Instead, the latin-1 turns into this
    Güldenstädt's Redstart => 'Güldenstädt's Redstart'
    which does not equal the name of the same bird in the UTF-8 coded file
    Richard H

      Spreadsheet::ReadSXC uses XML::Parser which properly decodes.

      $ perl -CSDA -MEncode -MXML::Parser -E'XML::Parser->new(Handlers => { +Char => sub { print "$_[1]" } })->parse(encode($ARGV[0], qq{<?xml ver +sion="1.0" encoding="$ARGV[0]"?><root>\xC9ric\n</root>}));' iso-8859- +1 Éric $ perl -CSDA -MEncode -MXML::Parser -E'XML::Parser->new(Handlers => { +Char => sub { print "$_[1]" } })->parse(encode($ARGV[0], qq{<?xml ver +sion="1.0" encoding="$ARGV[0]"?><root>\xC9ric\n</root>}));' UTF-8 Éric

      Could you provide me the output from either of the following

      use Devel::Peek; Dump($s);

      or

      { use Data::Dumper; local $Data::Dumper::Useqq = 1; print(Dumper($s)); }

      (preferably the former) for both versions of the string?

      Update: Looks like you already did. I followed up there.

        I'm not sure if you still want this or not. Here, fwiw, is output of your use Devel::Peek
        1 of 2 Here comes $s, the contents of cell at row 636, column 2 of fil +e .../BirdLists_in_english/AERC WPlist July 2010 version 2.0.ods $s = Güldenstädt’s Redstart SV = PV(0x201601c) at 0x2020ca0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x468ffe4 "G\303\274ldenst\303\244dt\342\200\231s Redstart"\0 [ +UTF8 "G\x{fc}ldenst\x{e4}dt\x{2019}s Redstart"] CUR = 26 LEN = 27 SV = PVMG(0x460a77c) at 0x2020ca0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x45ff86c "G\303\274ldenst\303\244dt\342\200\231s Redstart"\0 [ +UTF8 "G\x{fc}ldenst\x{e4}dt\x{2019}s Redstart"] CUR = 26 LEN = 175 MAGIC = 0x418ce64 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 22 2 of 2 Here comes $s, the contents of cell at row 763, column 8 of fil +e .../BirdLists_in_both_languages/53174_Liste_Pal_OccO2008.ods $s = Güldenstädt's Redstart SV = PVMG(0x460a77c) at 0x2020ca0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x27bd014 "G\303\274ldenst\303\244dt's Redstart"\0 [UTF8 "G\x{f +c}ldenst\x{e4}dt's Redstart"] CUR = 24 LEN = 779 MAGIC = 0x280913c MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 22

        RichardH
        Update
        Summary
        Here are the summarized results of
        "use Devel::Peek;"

        WITH:- use open ':std', ':encoding(cp1252)';
        File AERC*.ods :-
        PV = 0x34b5cf4 "G\303\274ldenst\303\244dt's Redstart"\0 [UTF8 "G\x{fc}ldenst\x{e4}dt's Redstart"]

        File Pal_*.ods :-
        PV = 0x34b5cf4 "G\303\274ldenst\303\244dt's Redstart"\0 [UTF8 "G\x{fc}ldenst\x{e4}dt's Redstart"]

        WITHOUT:-
        File AERC*.ods :-
        PV = 0x45fc33c "G\303\274ldenst\303\244dt\342\200\231s Redstart"\0 [UTF8 "G\x{fc}ldenst\x{e4}dt\x{2019}s Redstart"]

        File Pal_*.ods :-
        PV = 0x4660024 "G\303\274ldenst\303\244dt's Redstart"\0 [UTF8 "G\x{fc}ldenst\x{e4}dt's Redstart"]

        Conclusion
        To remove differences between OOorg codings
        include the line
        "use open ':std', ':encoding(cp1252)';"