While this is a confusing topic, at the root it's not too hard as long as you know the input encoding and adjust for the output. UTF-8 probably covers all the characters you want. You will have to do quite a bit of reading to understand what you're doing here, though. This is the nuclear option for explanations: tchrist on UTF-8 in Perl. Normally the only parts you really have to understand are, decode your input to UTF-8, do your business in Perl, encode your output to UTF-8 (on in your case, ASCII HTML entities). And another basic caveat. If you expect to be able to see UTF-8 in a display layer like a terminal, the layer has to be aware of the encoding you want to use. Unicode is the catch-all for all encodings in the standard. You will be dealing with *an* encoding at input and *an* encoding at output that may or may not be the same; Latin-1, CP-1252, UTF-8, UTF-16, Big5, etc.

This little snippet might get you started. I had to use <pre/> tags because PM's <code/> tags don't like wide characters. :P

use utf8;
use strictures;
use HTML::Entities "encode_entities_numeric";

binmode STDOUT, ":encoding(UTF-8)";
# OR use Encode, print encode_utf8(...)

while (<DATA>)
{
    chomp;
    next unless /\w/;
    print $_, $/;
    print "  -> ",  length, " characters long", $/;
    print "  -> ", encode_entities_numeric($_), $/;
}

__DATA__
antennæ
עברית
Ελληνικά
العَرَبِية‎
antennæ
  -> 7 characters long
  -> antenn&#xE6;
עברית
  -> 5 characters long
  -> &#x5E2;&#x5D1;&#x5E8;&#x5D9;&#x5EA;
Ελληνικά
  -> 8 characters long
  -> &#x395;&#x3BB;&#x3BB;&#x3B7;&#x3BD;&#x3B9;&#x3BA;&#x3AC;
العَرَبِية‎
   -> 11 characters long
   -> &#x627;&#x644;&#x639;&#x64E;&#x631;&#x64E;&#x628;&#x650;&#x64A;&#x629;&#x200E;

Further reading: Encode, utf8, perlunitut. Branch out from those as desired.


In reply to Re: Unicode words match and catch by Your Mother
in thread Unicode words match and catch by kepler

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.