Re: Unicode words match and catch

While this is a confusing topic, at the root it's not too hard as long as you know the input encoding and adjust for the output. UTF-8 probably covers all the characters you want. You will have to do quite a bit of reading to understand what you're doing here, though. This is the nuclear option for explanations: tchrist on UTF-8 in Perl. Normally the only parts you really have to understand are, decode your input to UTF-8, do your business in Perl, encode your output to UTF-8 (on in your case, ASCII HTML entities). And another basic caveat. If you expect to be able to see UTF-8 in a display layer like a terminal, the layer has to be aware of the encoding you want to use. Unicode is the catch-all for all encodings in the standard. You will be dealing with *an* encoding at input and *an* encoding at output that may or may not be the same; Latin-1, CP-1252, UTF-8, UTF-16, Big5, etc.

This little snippet might get you started. I had to use <pre/> tags because PM's <code/> tags don't like wide characters. :P

use utf8;
use strictures;
use HTML::Entities "encode_entities_numeric";

binmode STDOUT, ":encoding(UTF-8)";
# OR use Encode, print encode_utf8(...)

while (<DATA>)
{
    chomp;
    next unless /\w/;
    print $_, $/;
    print "  -> ",  length, " characters long", $/;
    print "  -> ", encode_entities_numeric($_), $/;
}

__DATA__
antennć
עברית
Ελληνικά
العَرَبِية‎

antennć
  -> 7 characters long
  -> antenn&#xE6;
עברית
  -> 5 characters long
  -> &#x5E2;&#x5D1;&#x5E8;&#x5D9;&#x5EA;
Ελληνικά
  -> 8 characters long
  -> &#x395;&#x3BB;&#x3BB;&#x3B7;&#x3BD;&#x3B9;&#x3BA;&#x3AC;
العَرَبِية‎
   -> 11 characters long
   -> &#x627;&#x644;&#x639;&#x64E;&#x631;&#x64E;&#x628;&#x650;&#x64A;&#x629;&#x200E;

Further reading: Encode, utf8, perlunitut. Branch out from those as desired.

Comment on Re: Unicode words match and catch Select or Download Code