in reply to regex: how to negate a set of character ranges?

Hmm... I don't think you can nest character classes. your final regex looks something like:
/[^[\x30-39]|[\x41-\x59]|[\x61-\x7a]| ... #etc
Which means you're using the literal [, - | and ] characters as part of a bigger character class.

Also, it looks like you're trying to match raw byte sequences instead of characters. I have zero experience with shift-jis so I have no clue what characters (if any) you're trying to match, but as a wild stab, I would assume it's a lot easier to use Encode's decode() function to translate the shift-jis bytes into true (utf-8) characters and then match on characters (since you then can match multi-byte codepoints directly).

Replies are listed 'Best First'.
Re^2: regex: how to negate a set of character ranges?
by kettle (Beadle) on Apr 29, 2007 at 17:56 UTC
    Thanks for the reply!
    "Which means you're using the literal and characters as part of a bigger character class."

    Yeah, I sort of figured this out, but hadn't figured a way around it.

    You're right, I am trying to match the byte sequences. For somewhat annoying reasons I have to first run a parser over the shiftjis text, then convert it to eucjp, run a utility on that (which only accepts eucjp input) and then output the final product in utf8. I could do what you say, and then convert back to eucjp, but I'm processing a very large amount of data and need to do it in as timely a manner as possible. I'm also just a little bit worried that perhaps there are a couple of shiftjis characters that don't translate properly into utf8 (read about this issue somewhere...) Finally, I just sort of like to know whether this is possible, and if so, how I can accomplish it.
      Well, the problem with using regexes for raw variable-width encodings is that you can't match characters with character ranges anymore, since ranges on raw data match only bytes. That means normal character ranges will match invalid data, and inverted ranges will exclude data that's possibly valid.

      You might have to give up on using combined character ranges altogether if you want to process the encoded data directly, and inverting ranges will be especially annoying. I mean, you could possibly match like this /([\x00-\x40][\x56-\x90]|[\x50-\x60][\x56-\x90])*/ (numbers made up), but you can't (easily) invert that match. Also, keep in mind that your regexes might shift (eh) off their alignment since shift-jis has 1 and multi-byte characters - meaning [\x00-\x40] might match both the first and/or later byte(s) of any character.

      I think it's still likely that using the internal perl multi-byte encoding (i.e. utf-8) will be a lot easier, but it depends on what you're trying to do exactly.