While this is a confusing topic, at the root it's not too hard as long as you know the input encoding and adjust for the output. UTF-8 probably covers all the characters you want. You will have to do quite a bit of reading to understand what you're doing here, though. This is the nuclear option for explanations: tchrist on UTF-8 in Perl. Normally the only parts you really have to understand are, decode your input to UTF-8, do your business in Perl, encode your output to UTF-8 (on in your case, ASCII HTML entities). And another basic caveat. If you expect to be able to see UTF-8 in a display layer like a terminal, the layer has to be aware of the encoding you want to use. Unicode is the catch-all for all encodings in the standard. You will be dealing with *an* encoding at input and *an* encoding at output that may or may not be the same; Latin-1, CP-1252, UTF-8, UTF-16, Big5, etc.
This little snippet might get you started. I had to use <pre/> tags because PM's <code/> tags don't like wide characters. :P
use utf8;
use strictures;
use HTML::Entities "encode_entities_numeric";
binmode STDOUT, ":encoding(UTF-8)";
# OR use Encode, print encode_utf8(...)
while (<DATA>)
{
chomp;
next unless /\w/;
print $_, $/;
print " -> ", length, " characters long", $/;
print " -> ", encode_entities_numeric($_), $/;
}
__DATA__
antennæ
עברית
Ελληνικά
العَرَبِية
antennæ -> 7 characters long -> antennæ עברית -> 5 characters long -> עברית Ελληνικά -> 8 characters long -> Ελληνικά العَرَبِية -> 11 characters long -> العَرَبِية‎
Further reading: Encode, utf8, perlunitut. Branch out from those as desired.
In reply to Re: Unicode words match and catch
by Your Mother
in thread Unicode words match and catch
by kepler
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |