in reply to One bird, two Unicode names

Don't work with the data in its encoded form. Decode the file when you read them.
open(my $fh1, '<:encoding(cp1252)', 'file1.txt') open(my $fh2, '<:encoding(UTF-8)', 'file2.txt')

Guessing at the encoding of the file since you didn't specify.

Replies are listed 'Best First'.
Re^2: One bird, two Unicode names
by RCH (Sexton) on Mar 11, 2011 at 08:19 UTC
    The files are OOorg format *.ods
    So I open thus
    my $zip = Archive::Zip->new( $infile ); my $content = $zip->contents('content.xml'); my $workbook_ref = read_xml_string($content); foreach my $sheet ( sort keys %$workbook_ref ) { foreach my $row( @{$$workbook_ref{$sheet}} ) { foreach my $cell_contents (@{$row}){ next unless defined( $cell_contents ); $cell_contents = replace_higher_unicode_code_points($cell_conten +ts); etc
    How would I apply your Decode the file in this instance?
    Thanks in advance
    RichardH
      hmmmm, the XML parser used by read_xml_string should decode the text. What's that function?
        The xml parser comes with
        use Spreadsheet::ReadSXC qw(read_xml_string);
        My file:///C:/Perl/html/site/lib/Spreadsheet/ReadSXC.html suggests this
        use Unicode::String qw(utf8); print utf8(" '$cell_contents'")->as_string;
        That correctly forces most of the file that seems to be in latin-1 into UTF-8, at least for the lower code points, for example
        Rougequeue de Güldenstädt => 'Rougequeue de Güldenstädt'
        But it fails on the higher code points e.g. "'" in latin-1 does not (unsurprisingly) turn into RIGHT SINGLE QUOTATION MARK (8217 )
        Instead, the latin-1 turns into this
        Güldenstädt's Redstart => 'Güldenstädt's Redstart'
        which does not equal the name of the same bird in the UTF-8 coded file
        Richard H