in reply to Re: regex: how to negate a set of character ranges?
in thread regex: how to negate a set of character ranges?

Thanks for the reply!
"Which means you're using the literal and characters as part of a bigger character class."

Yeah, I sort of figured this out, but hadn't figured a way around it.

You're right, I am trying to match the byte sequences. For somewhat annoying reasons I have to first run a parser over the shiftjis text, then convert it to eucjp, run a utility on that (which only accepts eucjp input) and then output the final product in utf8. I could do what you say, and then convert back to eucjp, but I'm processing a very large amount of data and need to do it in as timely a manner as possible. I'm also just a little bit worried that perhaps there are a couple of shiftjis characters that don't translate properly into utf8 (read about this issue somewhere...) Finally, I just sort of like to know whether this is possible, and if so, how I can accomplish it.
  • Comment on Re^2: regex: how to negate a set of character ranges?

Replies are listed 'Best First'.
Re^3: regex: how to negate a set of character ranges?
by Joost (Canon) on Apr 29, 2007 at 18:22 UTC
    Well, the problem with using regexes for raw variable-width encodings is that you can't match characters with character ranges anymore, since ranges on raw data match only bytes. That means normal character ranges will match invalid data, and inverted ranges will exclude data that's possibly valid.

    You might have to give up on using combined character ranges altogether if you want to process the encoded data directly, and inverting ranges will be especially annoying. I mean, you could possibly match like this /([\x00-\x40][\x56-\x90]|[\x50-\x60][\x56-\x90])*/ (numbers made up), but you can't (easily) invert that match. Also, keep in mind that your regexes might shift (eh) off their alignment since shift-jis has 1 and multi-byte characters - meaning [\x00-\x40] might match both the first and/or later byte(s) of any character.

    I think it's still likely that using the internal perl multi-byte encoding (i.e. utf-8) will be a lot easier, but it depends on what you're trying to do exactly.