RCH has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks
The following unicode problem has me baffled. (I'm a biologist, not a computer person)
I've got two "authoritative" lists of Palearctic birds.
I want to write one consolidated list, with notes on the various differences between list 1 and list 2.

Both lists are in OOorg spreadsheet format.
I'm using Spreadsheet::ReadSXC qw(read_xml_string) to read each list.
Then I examine differences between names, etc.
But I'm getting a lot of false differences, due to differences in the way that the same accented letter is represented in the two files.
For example one file has this
Güldenstädt's Redstart
The second file has this
Güldenstädtâ??s Redstart
for the same species

I've tried to replace UTF-8 chars by ISO 8859-1 thus:-
$string =~ s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
And I've tried
use Unicode::Normalize 'normalize';
And
use Unicode::String qw(utf8 latin1);
No joy

So I'm doing this for every string retrieved from each spreadsheet
use Unicode::UCD 'charinfo'; # Look for codepoints not in Basic Latin while ( $string =~ s/(\P{InBasic_Latin})// ) { my $U_char = $1; # e.g. U_char = ü my $U_codepoint = ord($U_char); # so U_codepoint = ord(ü) = 252 $string =~ s/$U_char/$subs{$U_codepoint}/; # and $subs{252} = ü }
The hash %subs was made by
foreach my $i (126 ... 255) { $subs{$i} = chr($i); }
This works, but seems ugly and suboptimal
Your help much appreciated
Richard H

Replies are listed 'Best First'.
Re: One bird, two Unicode names
by ikegami (Patriarch) on Mar 11, 2011 at 07:23 UTC
    Don't work with the data in its encoded form. Decode the file when you read them.
    open(my $fh1, '<:encoding(cp1252)', 'file1.txt') open(my $fh2, '<:encoding(UTF-8)', 'file2.txt')

    Guessing at the encoding of the file since you didn't specify.

      The files are OOorg format *.ods
      So I open thus
      my $zip = Archive::Zip->new( $infile ); my $content = $zip->contents('content.xml'); my $workbook_ref = read_xml_string($content); foreach my $sheet ( sort keys %$workbook_ref ) { foreach my $row( @{$$workbook_ref{$sheet}} ) { foreach my $cell_contents (@{$row}){ next unless defined( $cell_contents ); $cell_contents = replace_higher_unicode_code_points($cell_conten +ts); etc
      How would I apply your Decode the file in this instance?
      Thanks in advance
      RichardH
        hmmmm, the XML parser used by read_xml_string should decode the text. What's that function?
Re: One bird, two Unicode names
by vkon (Curate) on Mar 11, 2011 at 08:08 UTC
    Unicode::String is obsolete and deprecated, it was used in perl 5.6.0 when perl had weak internal unicode support.

    Use Encode, also it is more convenient because it comes with perl

      Could you explain how to apply <Use Encode> in my particular case? The relevant bit of program is like this
      my $zip = Archive::Zip->new( $infile ); my $content = $zip->contents('content.xml'); my $workbook_ref = read_xml_string($content); foreach my $sheet ( sort keys %$workbook_ref ) { foreach my $row( @{$$workbook_ref{$sheet}} ) { foreach my $cell_contents (@{$row}){ next unless defined( $cell_contents ); DECODE $cell_contents SOMEHOW
      Thanks in advance
      RichardH

      P.S. For
      DECODE $cell_contents SOMEHOW
      I tried this
      use Encode; my $octets = encode("iso-8859-1", $string); return( $octets );

      Result of the above
      Güldenstädt’s Redstart => Güldenstädt?s Redstart
      i.e. "ü" for "ü" (good)
      and "ä" for "ä" (good)
      but "?" for "’" (Bad)
      "’" is RIGHT SINGLE QUOTATION MARK, says charinfo(8217 )
Re: One bird, two Unicode names
by Eliya (Vicar) on Mar 11, 2011 at 16:14 UTC

    The first step to debugging Unicode/encoding issues is to check what you actually have to start with.

    So, use Devel::Peek to print (Dump) the original $string, and look at the PV entry.

      That's very helpful
      use Devel::Peek; Dump $cell_contents;
      shows that problem boils down to the difference between
      UTF8 "G\x{fc}ldenst\x{e4}dt\x{2019}s Redstart"
      and UTF8 "G\x{fc}ldenst\x{e4}dt's Redstart"

      and also makes me think that perhaps 1/2 my difficulties are due to the fact that I'm printing to STDOUT which I visualise in my (non-Unicode-aware) programmer's editor :-(
      RichardH

        aha! As I was starting to suspect, the problem is with your output! You should have been getting "Wide character in print" warnings, though. The following will fix your display problems if you're using STDOUT.

        use open ':std', ':locale';

        As for the handling the fancy quotes, you could fix characters individually (e.g. s/\x{2019}/'/g;) or you could use Text::Unidecode.

        So, one way to make the two strings equal would be to replace the Unicode apostrophe U+2019 found in the first string with the a ASCII single quote used in the second string:

        $s1 =~ s/\x{2019}/'/g;

        (just in case it's not obvious...)

Re: One bird, two Unicode names
by JavaFan (Canon) on Mar 11, 2011 at 11:14 UTC
    You could upgrade or downgrade all strings before comparing them. Note that downgrading isn't going to work if they contain characters with code points exceeding 256 - but then such strings don't have LATIN-1 equivalents anyway.

    See utf8; (But if you go this way, don't put a use utf8; in your code).

      I tried utf8::downgrade with FAIL_OK set to 1
      my $success = utf8::downgrade($string, 1); return($string);
      It didn't seem to do anything
      given Güldenstädt’s Redstart
      it returned Güldenstädt’s Redstart

      If I set FAIL_OK to 0, it died, with
      Wide character in subroutine entry

      RichardH