in reply to regex: how to negate a set of character ranges?

As The Camel tells us, in chapter 5, verse 4 (Character Classes), you may combine ranges (note Table 5.8, it has several ranges combined in the lower portion). So - you can write the first few lines of $shiftjis as:
[\x30-\x39\x41-\x59\x61-\x7A]
intsead of:
[\x30-\x39] | [\x41-\x59] | [\x61-\x7A]
This would allow you to have one huge character class instead of multiple. I don't have any source text with exotic chacters, so this is not tested. But, I see other problems. You have a missing second slash in the substitution regex. Another issue is that you have tried to use nesting within character classes, which doesn't work.. So, assuming that you've done the above and combined the classes, in the fix I'm also removing the outer square brackets. Instead of:
s/[${shiftjis}]/ogx;
Try:
s/${shiftjis}//ogx;
If that doesn't help, please post a short section of your source material to help test other solutions.

Replies are listed 'Best First'.
Re^2: regex: how to negate a set of character ranges?
by kettle (Beadle) on Apr 29, 2007 at 17:49 UTC
    Thanks for the speedy reply! I've kept both the single and multibyte ranges on separate lines just to help me keep track of what they actually represent.

    However, I do not think it is possible to combine the multibyte characters, which means that I can't quite combine everything.

    The missing slash was a typo.

    Also, I should point out that,
    s/[${shiftjis}]//ogx; (or s/${shiftjis}//ogx; or s/$shiftjis//ogx;) will work as expected.

    What doesn't work as expected is:
    s/[^${shiftjis}]//ogx;

    Unfortunately I'm now at home and do not have access to the text. However, I think that the problem is that I don't know this little corner of the regex syntax...
      Is this still not working as expected when you combine even just a couple of the ranges? I don't think that using multiple ranges and the bitwise or (|) op is doing what you want once it's expanded inside of the char class brackets. Unless performance is a really big problem, if you can't combine the classes for whatever reason, or don't want to, try storing each class string in an array and go through it running the substitution once per char class on the source text. It'll get the job done. Good luck!