RCH has asked for the wisdom of the Perl Monks concerning the following question:
The second file has thisGüldenstädt's Redstart
for the same speciesGüldenstädtâ??s Redstart
And I've tried$string =~ s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
Anduse Unicode::Normalize 'normalize';
No joyuse Unicode::String qw(utf8 latin1);
The hash %subs was made byuse Unicode::UCD 'charinfo'; # Look for codepoints not in Basic Latin while ( $string =~ s/(\P{InBasic_Latin})// ) { my $U_char = $1; # e.g. U_char = ü my $U_codepoint = ord($U_char); # so U_codepoint = ord(ü) = 252 $string =~ s/$U_char/$subs{$U_codepoint}/; # and $subs{252} = ü }
This works, but seems ugly and suboptimalforeach my $i (126 ... 255) { $subs{$i} = chr($i); }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: One bird, two Unicode names
by ikegami (Patriarch) on Mar 11, 2011 at 07:23 UTC | |
by RCH (Sexton) on Mar 11, 2011 at 08:19 UTC | |
by ikegami (Patriarch) on Mar 11, 2011 at 08:38 UTC | |
by Anonymous Monk on Mar 11, 2011 at 10:11 UTC | |
by ikegami (Patriarch) on Mar 11, 2011 at 20:10 UTC | |
| |
|
Re: One bird, two Unicode names
by vkon (Curate) on Mar 11, 2011 at 08:08 UTC | |
by RCH (Sexton) on Mar 11, 2011 at 08:26 UTC | |
|
Re: One bird, two Unicode names
by Eliya (Vicar) on Mar 11, 2011 at 16:14 UTC | |
by RCH (Sexton) on Mar 11, 2011 at 16:44 UTC | |
by ikegami (Patriarch) on Mar 11, 2011 at 20:17 UTC | |
by RCH (Sexton) on Mar 12, 2011 at 11:10 UTC | |
by ikegami (Patriarch) on Mar 12, 2011 at 17:59 UTC | |
| |
by Eliya (Vicar) on Mar 11, 2011 at 17:10 UTC | |
by RCH (Sexton) on Mar 11, 2011 at 18:04 UTC | |
|
Re: One bird, two Unicode names
by JavaFan (Canon) on Mar 11, 2011 at 11:14 UTC | |
by RCH (Sexton) on Mar 11, 2011 at 15:52 UTC |