feloniousMonk has asked for the wisdom of the Perl Monks concerning the following question:

--
Hello folks.

Today I am trying to learn how to strip out all western font
characters from a mixed Asian/Western text file.

Basically, it's an HTML file, markup and errata in
English with a smattering of Japanese characters.

I want to pull out the Japanese characters only and place
them gently into a separate file.

My understanding is that Japanese characters are 2 bytes
and a regex pattern like [a-zA-Z0-9] seems to match
everything.

Anyone familiar with this stuff?

Thanks,
Felonious

Replies are listed 'Best First'.
(tye)Re: Regular Expression 1 byte vs 2 byte characters
by tye (Sage) on Apr 07, 2001 at 00:13 UTC

    Assuming Unicode, something that I only have passing familiarity with (which, as usually, doesn't preclude me from pontificating about it), the 2-byte characters will not have a first byte with a high-order bit of 0 (that is, the first byte of each pair will have an ASCII value between 128 and 255 -- actually, you can narrow it down further than that since larger-than-two-byte characters, for example, take up part of that range).

    So something like s/[\200-\277].//gs should strip out two-byte characters.

    Simply matching on /\w/ doesn't work since you will match some of the bytes that are the second half of a two-byte character.

            - tye (but my friends call me "Tye")
      --
      OK, looks like s/[^128-277].//sg yanked enough out.

      I'm still figuring out what some non-displayable
      characters are, but it does look like I didn't lose any Japanese text.
      with the above regexp.

      (The goal here was to keep the Japanese and yank the English)

      As far as encoding, Shift-JIS. This is actually my first
      strange text obstacle, and I hope I have
      very few, at least until I get a better handle on it.

      Thanks again,
      Felonious
Re: Regular Expression 1 byte vs 2 byte characters
by mirod (Canon) on Apr 07, 2001 at 00:33 UTC

    I am not familiar with this stuff either, but I think you need to know a little more about your data: what is the encoding of this text? Unicode or Shift-JIS (that's a Japanese encoding, it can also encode Roman characters)? Look for the encoding declaration in your HTML document, it should look like <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=x-sjis"> for shift-JIS.

    If it is shift-JIS you can use the shift-JIS table to figure out how to separate Roman characters (130,96 to 130,154) from the rest. You have to decide what to do with punctuation, spaces, $ and the likes, which can belong to either kind of text though.

    An other way is to convert to Unicode using Text::Iconv, and then use Unicode::Charname to get the name of each character (if it starts with LATIN it's a latin character!).

    In any case, please let us know how you solve that problem.

    By the way, I think you need Perl 5.6 to do Unicode processing, so be ready to update if you haven't already.

    Mirod, ready and fully functional (see picture)