I am not familiar with this stuff either, but I think you
need to know a little more about your data: what is the encoding
of this text? Unicode or Shift-JIS (that's a Japanese encoding,
it can also encode Roman characters)? Look for the
encoding declaration in your HTML document, it should
look like <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=x-sjis">
for shift-JIS.
If it is shift-JIS you can use the shift-JIS table
to figure out how to separate Roman characters (130,96 to
130,154) from the rest. You have to decide what to do with
punctuation, spaces, $ and the likes, which can belong
to either kind of text though.
An other way is to convert to Unicode using Text::Iconv, and then use
Unicode::Charname
to get the name of each character (if it starts with LATIN
it's a latin character!).
In any case, please let us know how you solve that problem.
By the way, I think you need Perl 5.6 to do Unicode processing,
so be ready to update if you haven't already.
Mirod, ready and fully functional (see picture)
|