in reply to One bird, two Unicode names

The first step to debugging Unicode/encoding issues is to check what you actually have to start with.

So, use Devel::Peek to print (Dump) the original $string, and look at the PV entry.

Replies are listed 'Best First'.
Re^2: One bird, two Unicode names
by RCH (Sexton) on Mar 11, 2011 at 16:44 UTC
    That's very helpful
    use Devel::Peek; Dump $cell_contents;
    shows that problem boils down to the difference between
    UTF8 "G\x{fc}ldenst\x{e4}dt\x{2019}s Redstart"
    and UTF8 "G\x{fc}ldenst\x{e4}dt's Redstart"

    and also makes me think that perhaps 1/2 my difficulties are due to the fact that I'm printing to STDOUT which I visualise in my (non-Unicode-aware) programmer's editor :-(
    RichardH

      aha! As I was starting to suspect, the problem is with your output! You should have been getting "Wide character in print" warnings, though. The following will fix your display problems if you're using STDOUT.

      use open ':std', ':locale';

      As for the handling the fancy quotes, you could fix characters individually (e.g. s/\x{2019}/'/g;) or you could use Text::Unidecode.

        The other thing that I have to confess is that I'm doing this in W*ndows. After struggling with
        use open ':std', ':locale';
        (fails "Cannot figure out an encoding to use")
        and
        Win32::Locale;
        and trying to understand layers, open, and PerlIO
        my head is spinning, and I'm nowhere nearer to my goal.
        (a neat way of getting both spreadsheets to return "Güldenstädt's Redstart" for "Phoenicurus erythrogastrus")

        The ugly hack that all this started with does work, so I guess I'll just stick with ugliness

        RichardH

      So, one way to make the two strings equal would be to replace the Unicode apostrophe U+2019 found in the first string with the a ASCII single quote used in the second string:

      $s1 =~ s/\x{2019}/'/g;

      (just in case it's not obvious...)

        Yes, thats how I've been doing it
        $editted_copy = $string; # Look for codepoints not in Basic Latin while ( $string =~ s/(\P{InBasic_Latin})// ) { my $U_char = $1; my $U_codepoint = ord($U_char); #and try to replace them if( defined( $subs{$U_codepoint} ) && exists( $subs{$U_codepoint} )){ $editted_copy =~ s/$U_char/$subs{$U_codepoint}/; } else{ #add the missing U_codepoint by hand to the %subs hash #and iterate till no more U_codepoints causing problems }

        (I was just hoping for something prettier)
        RichardH