Well, the problem with using regexes for raw variable-width encodings is that you can't match characters with character ranges anymore, since ranges on raw data match only bytes. That means normal character ranges will match invalid data, and inverted ranges will exclude data that's possibly valid.

You might have to give up on using combined character ranges altogether if you want to process the encoded data directly, and inverting ranges will be especially annoying. I mean, you could possibly match like this /([\x00-\x40][\x56-\x90]|[\x50-\x60][\x56-\x90])*/ (numbers made up), but you can't (easily) invert that match. Also, keep in mind that your regexes might shift (eh) off their alignment since shift-jis has 1 and multi-byte characters - meaning [\x00-\x40] might match both the first and/or later byte(s) of any character.

I think it's still likely that using the internal perl multi-byte encoding (i.e. utf-8) will be a lot easier, but it depends on what you're trying to do exactly.


In reply to Re^3: regex: how to negate a set of character ranges? by Joost
in thread regex: how to negate a set of character ranges? by kettle

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.