I am not familiar with this stuff either, but I think you need to know a little more about your data: what is the encoding of this text? Unicode or Shift-JIS (that's a Japanese encoding, it can also encode Roman characters)? Look for the encoding declaration in your HTML document, it should look like <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=x-sjis"> for shift-JIS.
If it is shift-JIS you can use the shift-JIS table to figure out how to separate Roman characters (130,96 to 130,154) from the rest. You have to decide what to do with punctuation, spaces, $ and the likes, which can belong to either kind of text though.
An other way is to convert to Unicode using Text::Iconv, and then use Unicode::Charname to get the name of each character (if it starts with LATIN it's a latin character!).
In any case, please let us know how you solve that problem.
By the way, I think you need Perl 5.6 to do Unicode processing, so be ready to update if you haven't already.
Mirod, ready and fully functional (see picture)
In reply to Re: Regular Expression 1 byte vs 2 byte characters
by mirod
in thread Regular Expression 1 byte vs 2 byte characters
by feloniousMonk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |