One bird, two Unicode names

RCH has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks
The following unicode problem has me baffled. (I'm a biologist, not a computer person)
I've got two "authoritative" lists of Palearctic birds.
I want to write one consolidated list, with notes on the various differences between list 1 and list 2.

Both lists are in OOorg spreadsheet format.
I'm using Spreadsheet::ReadSXC qw(read_xml_string) to read each list.
Then I examine differences between names, etc.
But I'm getting a lot of false differences, due to differences in the way that the same accented letter is represented in the two files.
For example one file has this

            Güldenstädt's Redstart
[download]

The second file has this

            GÃ¼ldenstÃ¤dtâ??s Redstart
[download]

for the same species

I've tried to replace UTF-8 chars by ISO 8859-1 thus:-

    $string =~ 
s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
[download]

And I've tried

    use Unicode::Normalize 'normalize';
[download]

And

    use Unicode::String qw(utf8 latin1);
[download]

No joy

So I'm doing this for every string retrieved from each spreadsheet

  use Unicode::UCD 'charinfo';

  # Look for codepoints not in Basic Latin 
  while ( $string  =~ s/(\P{InBasic_Latin})// ) {     
        my $U_char = $1;                              
          # e.g. U_char = Ã¼  
        my $U_codepoint = ord($U_char);               
          # so U_codepoint = ord(Ã¼)  = 252
        $string =~ s/$U_char/$subs{$U_codepoint}/;    
          # and $subs{252} = ü
  }
[download]

The hash %subs was made by

    foreach my $i (126 ... 255) {
        $subs{$i} = chr($i);
    }
[download]

This works, but seems ugly and suboptimal
Your help much appreciated
Richard H

Comment on One bird, two Unicode names Select or Download Code

Replies are listed 'Best First'.
Re: One bird, two Unicode names by ikegami (Patriarch) on Mar 11, 2011 at 07:23 UTC
Don't work with the data in its encoded form. Decode the file when you read them. `open(my $fh1, '<:encoding(cp1252)', 'file1.txt') open(my $fh2, '<:encoding(UTF-8)', 'file2.txt')` [download] Guessing at the encoding of the file since you didn't specify.	[reply] [d/l]
Re^2: One bird, two Unicode names by RCH (Sexton) on Mar 11, 2011 at 08:19 UTC
The files are OOorg format *.ods So I open thus `my $zip = Archive::Zip->new( $infile ); my $content = $zip->contents('content.xml'); my $workbook_ref = read_xml_string($content); foreach my $sheet ( sort keys %$workbook_ref ) { foreach my $row( @{$$workbook_ref{$sheet}} ) { foreach my $cell_contents (@{$row}){ next unless defined( $cell_contents ); $cell_contents = replace_higher_unicode_code_points($cell_conten +ts); etc` [download] How would I apply your Decode the file in this instance? Thanks in advance RichardH	[reply] [d/l]
Re^3: One bird, two Unicode names by ikegami (Patriarch) on Mar 11, 2011 at 08:38 UTC
hmmmm, the XML parser used by `read_xml_string` should decode the text. What's that function?	[reply] [d/l]
Re^4: One bird, two Unicode names by Anonymous Monk on Mar 11, 2011 at 10:11 UTC
Re^5: One bird, two Unicode names by ikegami (Patriarch) on Mar 11, 2011 at 20:10 UTC
Some notes below your chosen depth have not been shown here
Re: One bird, two Unicode names by vkon (Curate) on Mar 11, 2011 at 08:08 UTC
Unicode::String is obsolete and deprecated, it was used in perl 5.6.0 when perl had weak internal unicode support. Use `Encode`, also it is more convenient because it comes with perl	[reply] [d/l]
Re^2: One bird, two Unicode names by RCH (Sexton) on Mar 11, 2011 at 08:26 UTC
Could you explain how to apply <Use Encode> in my particular case? The relevant bit of program is like this `my $zip = Archive::Zip->new( $infile ); my $content = $zip->contents('content.xml'); my $workbook_ref = read_xml_string($content); foreach my $sheet ( sort keys %$workbook_ref ) { foreach my $row( @{$$workbook_ref{$sheet}} ) { foreach my $cell_contents (@{$row}){ next unless defined( $cell_contents ); DECODE $cell_contents SOMEHOW` [download] Thanks in advance RichardH P.S. For `DECODE $cell_contents SOMEHOW` [download] I tried this `use Encode; my $octets = encode("iso-8859-1", $string); return( $octets );` [download] Result of the above GÃ¼ldenstÃ¤dtâ€™s Redstart => Güldenstädt?s Redstart i.e. "ü" for "Ã¼" (good) and "ä" for "Ã¤" (good) but "?" for "â€™" (Bad) "â€™" is RIGHT SINGLE QUOTATION MARK, says charinfo(8217 )	[reply] [d/l] [select]
Re: One bird, two Unicode names by Eliya (Vicar) on Mar 11, 2011 at 16:14 UTC
The first step to debugging Unicode/encoding issues is to check what you actually have to start with. So, use Devel::Peek to print (`Dump`) the original `$string`, and look at the PV entry.	[reply] [d/l] [select]
Re^2: One bird, two Unicode names by RCH (Sexton) on Mar 11, 2011 at 16:44 UTC
That's very helpful `use Devel::Peek; Dump $cell_contents;` [download] shows that problem boils down to the difference between UTF8 "G\x{fc}ldenst\x{e4}dt\x{2019}s Redstart" and UTF8 "G\x{fc}ldenst\x{e4}dt's Redstart" and also makes me think that perhaps 1/2 my difficulties are due to the fact that I'm printing to STDOUT which I visualise in my (non-Unicode-aware) programmer's editor :-( RichardH	[reply] [d/l]
Re^3: One bird, two Unicode names by ikegami (Patriarch) on Mar 11, 2011 at 20:17 UTC
aha! As I was starting to suspect, the problem is with your output! You should have been getting "Wide character in print" warnings, though. The following will fix your display problems if you're using STDOUT. `use open ':std', ':locale';` [download] As for the handling the fancy quotes, you could fix characters individually (e.g. `s/\x{2019}/'/g;`) or you could use Text::Unidecode.	[reply] [d/l] [select]
Re^4: One bird, two Unicode names by RCH (Sexton) on Mar 12, 2011 at 11:10 UTC
Re^5: One bird, two Unicode names by ikegami (Patriarch) on Mar 12, 2011 at 17:59 UTC
Some notes below your chosen depth have not been shown here
Re^3: One bird, two Unicode names by Eliya (Vicar) on Mar 11, 2011 at 17:10 UTC
So, one way to make the two strings equal would be to replace the Unicode apostrophe U+2019 found in the first string with the a ASCII single quote used in the second string: `$s1 =~ s/\x{2019}/'/g;` [download] (just in case it's not obvious...)	[reply] [d/l]
Re^4: One bird, two Unicode names by RCH (Sexton) on Mar 11, 2011 at 18:04 UTC
Re: One bird, two Unicode names by JavaFan (Canon) on Mar 11, 2011 at 11:14 UTC
You could upgrade or downgrade all strings before comparing them. Note that downgrading isn't going to work if they contain characters with code points exceeding 256 - but then such strings don't have LATIN-1 equivalents anyway. See `utf8;` (But if you go this way, don't put a `use utf8;` in your code).	[reply] [d/l] [select]
Re^2: One bird, two Unicode names by RCH (Sexton) on Mar 11, 2011 at 15:52 UTC
I tried utf8::downgrade with FAIL_OK set to 1 `my $success = utf8::downgrade($string, 1); return($string);` [download] It didn't seem to do anything given `GÃ¼ldenstÃ¤dtâ€™s Redstart` it returned `GÃ¼ldenstÃ¤dtâ€™s Redstart` If I set FAIL_OK to 0, it died, with `Wide character in subroutine entry` [download] RichardH	[reply] [d/l] [select]