in reply to Re^2: One bird, two Unicode names
in thread One bird, two Unicode names

aha! As I was starting to suspect, the problem is with your output! You should have been getting "Wide character in print" warnings, though. The following will fix your display problems if you're using STDOUT.

use open ':std', ':locale';

As for the handling the fancy quotes, you could fix characters individually (e.g. s/\x{2019}/'/g;) or you could use Text::Unidecode.

Replies are listed 'Best First'.
Re^4: One bird, two Unicode names
by RCH (Sexton) on Mar 12, 2011 at 11:10 UTC
    The other thing that I have to confess is that I'm doing this in W*ndows. After struggling with
    use open ':std', ':locale';
    (fails "Cannot figure out an encoding to use")
    and
    Win32::Locale;
    and trying to understand layers, open, and PerlIO
    my head is spinning, and I'm nowhere nearer to my goal.
    (a neat way of getting both spreadsheets to return "Güldenstädt's Redstart" for "Phoenicurus erythrogastrus")

    The ugly hack that all this started with does work, so I guess I'll just stick with ugliness

    RichardH
      Probably
      use open ':std', ':encoding(cp1252)';
      for Windows.
        Thank you. That prints to STDOUT very nicely. So one problem resolved.
        But it doesn't solve the original problem. Here's a worked example.
        When I parse my two "authoritative" spreadsheets of names of palearctic birds, I would hope that both authorities would have the same common name for the species whose latin binomial is Phoenicurus erythrogastrus
        But they dont.
        One calls it Güldenstädt's Redstart
        The other Güldenstädt’s Redstart
        (That's the difference between 0027;APOSTROPHE; and 2019;RIGHT SINGLE QUOTATION MARK (fide C:\Perl\lib\unicore\UnicodeData.txt))
        My current solution is to do a s/// on each $string from each OOorg spreadsheet, as follows
        $string =~ s/(\P{InBasic_Latin})/ # Look for codepoi +nts that are not in Basic_Latin; for example the sign ü defined( $subs{ord($1)} ) # if $1 = ü, the +n ord($1) = 252. We ask is there a value in %subs for key '251' ? ? $subs{ord($1)} # If yes ( $subs{2 +51} = û ), then return û : ' <$subs{' # if no, then retu +rn<$hash{ ... . ord($1) # 252 ... . "} = ${charinfo(ord($1))}{name};> " # } = LATIN SMALL + LETTER U WITH DIAERESIS;> /egx; # /egx = e execute + g repeated x spaced out regex # If a sigle was f +ound that is absent from the hash, then the outfile will contain "<$ +subs{8224} = DAGGER;>" etc # You have to writ +e into make_the_subs_hash() a line like this $subs{8224} = '¦'; . Th +ats at [1] below # Then re run the +script with the extended %subs return($string);
        Where the hash %subs is made as follows
        foreach my $i (126 ... 255) { $subs{$i} = chr($i); } # Plus higher value code points found empirically; see [1] above $subs{338} = 'OE';# LATIN CAPITAL LIGATURE OE $subs{339} = 'oe';# LATIN SMALL LIGATURE OE $subs{8217} = "'" ;# RIGHT SINGLE QUOTATION MARK $subs{8224} = '×' ;# DAGGER
        Ugly, but at least everyone can see what is going on
        Richard H