RegExp to exclude 2-byte characters

feloniousMonk has asked for the wisdom of the Perl Monks concerning the following question:

--
Hello.

I am seeking the best regexp to do a single task,
it must delete all characters except the ones in a
given range. Problem - given ranges are 2-byte chars
for which I have the oct and hex codes.
I can easily exclude single-byte chars but cannot make
the next step to 2-byte for some reason, my output turns to junk.

Here's what my character ranges are:

\201\101-\201\114
\202\237-\202\362
\203\100-\203-177
\203\200-\203\227
\210\237-\237\375
[download]

I thought as a start this would match, but now I need to negate it:

s/(([\201][\101-\114])|
  ([\202][\237-\362])|
  ([\203][\100-\177\200-\227)|
  ([\210][\237-777])|
  ([\211-\236][\000-\777])|
  ([\237][\000-\375]))//gx;
[download]

I think I'm a bit off 'cuz the above is matching nothing
and I dunno y.

Any advice would be greatly appreciated, thanks,

--
-Felonious

Comment on RegExp to exclude 2-byte characters Select or Download Code

Replies are listed 'Best First'.
Re: RegExp to exclude 2-byte characters by Caillte (Friar) on Apr 10, 2001 at 20:45 UTC
By two byte chars I presume you mean they are from the unicode charset. Looking through the utf pages in the docs I get the impresion that your syntax may be a bit off. I had a play with the following code and had no problems deleting pairs of chars... `use utf8; my $data = "\x{200}\x{201}\x{102}\x{200}\x{210}\x{375}\x{ab}\x{263A}"; # XXXXXXXXXXXXX XXXXXXXXXXXXX # Blocks marked with an x are deleted print $data, "\n"; $data =~ s/ (\x{201}[\x{101}-\x{114}]\| \x{202}[\x{237}-\x{262}]\| \x{203}[\x{100}-\x{177}]\| \x{203}[\x{200}-\x{227}]\| \x{210}[\x{237}-\x{375}]) //gx; print $data;` [download] I hope this goes some way towards helping. `$japh->{'Caillte'} = $me;`	[reply] [d/l] [select]
(tye)Re2: RegExp to exclude 2-byte characters by tye (Sage) on Apr 10, 2001 at 22:00 UTC
"\201" is pack("C",0201) [that is, octal] while "\x{201}" is closer to pack("S",0x201) [that is, using hexadecimal]. - tye (but my friends call me "Tye")	[reply]
(tye)Re: RegExp to exclude 2-byte characters by tye (Sage) on Apr 10, 2001 at 21:56 UTC
Negating multiple-byte regular expressions is quite a pain. It is possible but even truely great regex hackers get it wrong over and over (I've seen it). An informative example is the classic failures at getting a regular expression to match C-style /* comments / (without using .? which is flawed if used as part of a larger regular expression). So don't negate the regular expression, reverse the process. That is, rather than deleting things that match, keep things that match: `$string= join "", $string =~ /((?:$re)+)/g;` where $re is your current regex minus the parens and with the typo fixed (a dropped ]). - tye (but my friends don't call me not `/^[^T][^y][^e]$/`)	[reply] [d/l] [select]
Re: (tye)Re: RegExp to exclude 2-byte characters by feloniousMonk (Pilgrim) on Apr 10, 2001 at 22:33 UTC
-- That be it..... I believe I have the output I need now. I will now slink off into the shadows and hide from my relentless Unicode hell. -- Thanks much (again), Felonious	[reply]