Re: One bird, two Unicode names

Replies are listed 'Best First'.
Re^2: One bird, two Unicode names by RCH (Sexton) on Mar 11, 2011 at 16:44 UTC
That's very helpful `use Devel::Peek; Dump $cell_contents;` [download] shows that problem boils down to the difference between UTF8 "G\x{fc}ldenst\x{e4}dt\x{2019}s Redstart" and UTF8 "G\x{fc}ldenst\x{e4}dt's Redstart" and also makes me think that perhaps 1/2 my difficulties are due to the fact that I'm printing to STDOUT which I visualise in my (non-Unicode-aware) programmer's editor :-( RichardH	[reply] [d/l]
Re^3: One bird, two Unicode names by ikegami (Patriarch) on Mar 11, 2011 at 20:17 UTC
aha! As I was starting to suspect, the problem is with your output! You should have been getting "Wide character in print" warnings, though. The following will fix your display problems if you're using STDOUT. `use open ':std', ':locale';` [download] As for the handling the fancy quotes, you could fix characters individually (e.g. `s/\x{2019}/'/g;`) or you could use Text::Unidecode.	[reply] [d/l] [select]
Re^4: One bird, two Unicode names by RCH (Sexton) on Mar 12, 2011 at 11:10 UTC
The other thing that I have to confess is that I'm doing this in W*ndows. After struggling with `use open ':std', ':locale';` [download] (fails "Cannot figure out an encoding to use") and `Win32::Locale;` [download] and trying to understand layers, open, and PerlIO my head is spinning, and I'm nowhere nearer to my goal. (a neat way of getting both spreadsheets to return "Güldenstädt's Redstart" for "Phoenicurus erythrogastrus") The ugly hack that all this started with does work, so I guess I'll just stick with ugliness RichardH	[reply] [d/l] [select]
Re^5: One bird, two Unicode names by ikegami (Patriarch) on Mar 12, 2011 at 17:59 UTC
Re^6: One bird, two Unicode names by RCH (Sexton) on Mar 13, 2011 at 19:07 UTC
Some notes below your chosen depth have not been shown here
Re^3: One bird, two Unicode names by Eliya (Vicar) on Mar 11, 2011 at 17:10 UTC
So, one way to make the two strings equal would be to replace the Unicode apostrophe U+2019 found in the first string with the a ASCII single quote used in the second string: `$s1 =~ s/\x{2019}/'/g;` [download] (just in case it's not obvious...)	[reply] [d/l]
Re^4: One bird, two Unicode names by RCH (Sexton) on Mar 11, 2011 at 18:04 UTC
Yes, thats how I've been doing it `$editted_copy = $string; # Look for codepoints not in Basic Latin while ( $string =~ s/(\P{InBasic_Latin})// ) { my $U_char = $1; my $U_codepoint = ord($U_char); #and try to replace them if( defined( $subs{$U_codepoint} ) && exists( $subs{$U_codepoint} )){ $editted_copy =~ s/$U_char/$subs{$U_codepoint}/; } else{ #add the missing U_codepoint by hand to the %subs hash #and iterate till no more U_codepoints causing problems }` [download] (I was just hoping for something prettier) RichardH	[reply] [d/l]