nysus has asked for the wisdom of the Perl Monks concerning the following question:

A quote from Camel Book, 3rd Ed., Chapter 5, "Pattern Matching", pg 139, Paragraph 2:

"While most of your data will probably be text strings, there's nothing stopping you from using regexes to search and replace any byte sequence..."

At the end of the same paragraph, there is a reference to Chapt. 15, "Unicode". From all this, I'm guessing a byte sequence RE could be performed with a use bytes pragma within scope of the RE. I'm also guessing that a byte sequence RE will be treated exactly the same as a "text string" RE for most Western European languages. Can someone please confirm my thinking on this?

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot";
$nysus = $PM . $MCF;

Replies are listed 'Best First'.
Re: How to perfrom a 'byte sequence' RE?
by John M. Dlugosz (Monsignor) on Jun 26, 2001 at 00:55 UTC
    That's how I understand it, but it doesn't always work. I've brought it up in other newsgroups and am told that strings will eventually be stored "transparantly" but it will still handle binary data OK, though I can't really get a better answer than that.

    Currently you need use utf8 in the scope of the regex to enable certain behavior.

    There is also the fact that any particular string may be byte or character encoded, but no function to tell which. A regex on a byte string will work with binary data—no special pragma is needed.

    However, use byte and use utf8 are not simple opposites as I had thought from reading those docs. They mean different things, and the real behavior is different from what p5p people tell me.

    So, always test it and try it. Unicode support is still "experimental" according to the docs.

Re: How to perfrom a 'byte sequence' RE?
by mattr (Curate) on Jun 26, 2001 at 12:44 UTC
    Um, how about encoding into ascii somehow and doing a regex on that? It should work. I've had enough wierd things crop up with binary code in the past that I'm paranoid on this issue (unless it's in C, in which case I'm both more at ease and more paranoid if that makes sense).

    I've not used utf8 regex in the past but if I want to compare Japanese strings I change the character encoding into a 7-bit encoding (EUC for Japanese, which is two byte and preserves ascii codes), so maybe MIME::Base64 or some other packing method for the general case?

    I believe that JPerl, a Japanized Perl, actually lets you do tr// and things with binary but this all seems pretty iffy, seems best to me to do something you know must work, though maybe not so elegant. Then test anyway. :)

    By the way I don't know where your binary is coming from but watch out for endianness, which means that if you are reading a resource file from a Mac it probably stores a two or four byte value in the reverse order that a PC does. But if it is just an 8 bit ISO standard that problem shouldn't arise.