Re: Regular Expression 1 byte vs 2 byte characters

I am not familiar with this stuff either, but I think you need to know a little more about your data: what is the encoding of this text? Unicode or Shift-JIS (that's a Japanese encoding, it can also encode Roman characters)? Look for the encoding declaration in your HTML document, it should look like <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=x-sjis"> for shift-JIS.

If it is shift-JIS you can use the shift-JIS table to figure out how to separate Roman characters (130,96 to 130,154) from the rest. You have to decide what to do with punctuation, spaces, $ and the likes, which can belong to either kind of text though.

An other way is to convert to Unicode using Text::Iconv, and then use Unicode::Charname to get the name of each character (if it starts with LATIN it's a latin character!).

In any case, please let us know how you solve that problem.

By the way, I think you need Perl 5.6 to do Unicode processing, so be ready to update if you haven't already.

Mirod, ready and fully functional (see picture)

Comment on Re: Regular Expression 1 byte vs 2 byte characters Download Code