feloniousMonk has asked for the wisdom of the Perl Monks concerning the following question:

--
Hello.

I am seeking the best regexp to do a single task,
it must delete all characters except the ones in a
given range. Problem - given ranges are 2-byte chars
for which I have the oct and hex codes.
I can easily exclude single-byte chars but cannot make
the next step to 2-byte for some reason, my output turns to junk.

Here's what my character ranges are:
\201\101-\201\114 \202\237-\202\362 \203\100-\203-177 \203\200-\203\227 \210\237-\237\375
I thought as a start this would match, but now I need to negate it:
s/(([\201][\101-\114])| ([\202][\237-\362])| ([\203][\100-\177\200-\227)| ([\210][\237-777])| ([\211-\236][\000-\777])| ([\237][\000-\375]))//gx;
I think I'm a bit off 'cuz the above is matching nothing
and I dunno y.

Any advice would be greatly appreciated, thanks,

--
-Felonious

Replies are listed 'Best First'.
Re: RegExp to exclude 2-byte characters
by Caillte (Friar) on Apr 10, 2001 at 20:45 UTC

    By two byte chars I presume you mean they are from the unicode charset. Looking through the utf pages in the docs I get the impresion that your syntax may be a bit off. I had a play with the following code and had no problems deleting pairs of chars...

    use utf8; my $data = "\x{200}\x{201}\x{102}\x{200}\x{210}\x{375}\x{ab}\x{263A}"; # XXXXXXXXXXXXX XXXXXXXXXXXXX # Blocks marked with an x are deleted print $data, "\n"; $data =~ s/ (\x{201}[\x{101}-\x{114}]| \x{202}[\x{237}-\x{262}]| \x{203}[\x{100}-\x{177}]| \x{203}[\x{200}-\x{227}]| \x{210}[\x{237}-\x{375}]) //gx; print $data;

    I hope this goes some way towards helping.

    $japh->{'Caillte'} = $me;

      "\201" is pack("C",0201) [that is, octal] while "\x{201}" is closer to pack("S",0x201) [that is, using hexadecimal].

              - tye (but my friends call me "Tye")
(tye)Re: RegExp to exclude 2-byte characters
by tye (Sage) on Apr 10, 2001 at 21:56 UTC

    Negating multiple-byte regular expressions is quite a pain. It is possible but even truely great regex hackers get it wrong over and over (I've seen it). An informative example is the classic failures at getting a regular expression to match C-style /* comments */ (without using .*? which is flawed if used as part of a larger regular expression).

    So don't negate the regular expression, reverse the process. That is, rather than deleting things that match, keep things that match: $string= join "", $string =~ /((?:$re)+)/g; where $re is your current regex minus the parens and with the typo fixed (a dropped ]).

            - tye (but my friends don't call me not /^[^T][^y][^e]$/)
      --
      That be it.....

      I believe I have the output I need now.

      I will now slink off into the shadows and hide from my
      relentless Unicode hell.

      --
      Thanks much (again),
      Felonious