Re^5: One bird, two Unicode names

Replies are listed 'Best First'.
Re^6: One bird, two Unicode names by RCH (Sexton) on Mar 13, 2011 at 19:07 UTC
Thank you. That prints to STDOUT very nicely. So one problem resolved. But it doesn't solve the original problem. Here's a worked example. When I parse my two "authoritative" spreadsheets of names of palearctic birds, I would hope that both authorities would have the same common name for the species whose latin binomial is Phoenicurus erythrogastrus But they dont. One calls it Güldenstädt's Redstart The other Güldenstädt’s Redstart (That's the difference between 0027;APOSTROPHE; and 2019;RIGHT SINGLE QUOTATION MARK (fide C:\Perl\lib\unicore\UnicodeData.txt)) My current solution is to do a s/// on each $string from each OOorg spreadsheet, as follows $string =~ s/(\P{InBasic_Latin})/ # Look for codepoi +nts that are not in Basic_Latin; for example the sign Ã¼ defined( $subs{ord($1)} ) # if $1 = Ã¼, the +n ord($1) = 252. We ask is there a value in %subs for key '251' ? ? $subs{ord($1)} # If yes ( $subs{2 +51} = û ), then return û : ' <$subs{' # if no, then retu +rn<$hash{ ... . ord($1) # 252 ... . "} = ${charinfo(ord($1))}{name};> " # } = LATIN SMALL + LETTER U WITH DIAERESIS;> /egx; # /egx = e execute + g repeated x spaced out regex # If a sigle was f +ound that is absent from the hash, then the outfile will contain "<$ +subs{8224} = DAGGER;>" etc # You have to writ +e into make_the_subs_hash() a line like this $subs{8224} = '¦'; . Th +ats at [1] below # Then re run the +script with the extended %subs return($string); [download] Where the hash %subs is made as follows `foreach my $i (126 ... 255) { $subs{$i} = chr($i); } # Plus higher value code points found empirically; see [1] above $subs{338} = 'OE';# LATIN CAPITAL LIGATURE OE $subs{339} = 'oe';# LATIN SMALL LIGATURE OE $subs{8217} = "'" ;# RIGHT SINGLE QUOTATION MARK $subs{8224} = '×' ;# DAGGER` [download] Ugly, but at least everyone can see what is going on Richard H	[reply] [d/l] [select]
Re^7: One bird, two Unicode names by ikegami (Patriarch) on Mar 13, 2011 at 19:11 UTC
The closest to a generic solution is Text::Unidecode's `unidecode`. An alternative tact would be to measure how different two strings are, and considering the two the same if the difference is sufficiently small. One measure of difference is the Hamming Distance.	[reply] [d/l]
Re^8: One bird, two Unicode names by RCH (Sexton) on Mar 14, 2011 at 08:10 UTC
Better and better! I'd been wondering how to analyse the differences between names e.g. Ammomanes cinctura is either the "Bar-tailed Desert lark" or the "Bar-tailed Lark" With your kind hint, I found my way to Text::Brew Which not only tells me that the distance (Bar-tailed Desert lark, Bar-tailed Lark) is 8 But also tells me that the path is to DEL < Desert> and to SUBST<l,L> Many thanks RichardH	[reply]