Dear PerlMonks
The following unicode problem has me baffled. (I'm a biologist, not a computer person)
I've got two "authoritative" lists of Palearctic birds.
I want to write one consolidated list, with notes on the various differences
between list 1 and list 2.
Both lists are in OOorg spreadsheet format.
I'm using Spreadsheet::ReadSXC qw(read_xml_string) to read each list.
Then I examine differences between names, etc.
But I'm getting a lot of false differences, due to differences in the way that the same accented letter is represented in the two files.
For example one file has this
Güldenstädt's Redstart
The second file has this
Güldenstädtâ??s Redstart
for the same species
I've tried to replace UTF-8 chars by ISO 8859-1 thus:-
$string =~
s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
And I've tried
use Unicode::Normalize 'normalize';
And
use Unicode::String qw(utf8 latin1);
No joy
So I'm doing this for every string retrieved from each spreadsheet
use Unicode::UCD 'charinfo';
# Look for codepoints not in Basic Latin
while ( $string =~ s/(\P{InBasic_Latin})// ) {
my $U_char = $1;
# e.g. U_char = ü
my $U_codepoint = ord($U_char);
# so U_codepoint = ord(ü) = 252
$string =~ s/$U_char/$subs{$U_codepoint}/;
# and $subs{252} = ü
}
The hash %subs was made by
foreach my $i (126 ... 255) {
$subs{$i} = chr($i);
}
This works, but seems ugly and suboptimal
Your help much appreciated
Richard H
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.