(tye)Re: Regular Expression 1 byte vs 2 byte characters

Assuming Unicode, something that I only have passing familiarity with (which, as usually, doesn't preclude me from pontificating about it), the 2-byte characters will not have a first byte with a high-order bit of 0 (that is, the first byte of each pair will have an ASCII value between 128 and 255 -- actually, you can narrow it down further than that since larger-than-two-byte characters, for example, take up part of that range).

So something like s/[\200-\277].//gs should strip out two-byte characters.

Simply matching on /\w/ doesn't work since you will match some of the bytes that are the second half of a two-byte character.

- tye (but my friends call me "Tye")

Comment on (tye)Re: Regular Expression 1 byte vs 2 byte characters Select or Download Code

Replies are listed 'Best First'.
Re: (tye)Re: Regular Expression 1 byte vs 2 byte characters by feloniousMonk (Pilgrim) on Apr 07, 2001 at 01:35 UTC
-- OK, looks like `s/[^128-277].//sg` yanked enough out. I'm still figuring out what some non-displayable characters are, but it does look like I didn't lose any Japanese text. with the above regexp. (The goal here was to keep the Japanese and yank the English) As far as encoding, Shift-JIS. This is actually my first strange text obstacle, and I hope I have very few, at least until I get a better handle on it. Thanks again, Felonious	[reply] [d/l]

Replies are listed 'Best First'.

Re: (tye)Re: Regular Expression 1 byte vs 2 byte characters
by feloniousMonk (Pilgrim) on Apr 07, 2001 at 01:35 UTC

s/[^128-277].//sg

[reply]
[d/l]