in reply to regular expression searching in binary files

I've tried variously, use utf8; use various encodings but my matches don't capture the strings I'm seeking.
You must not have tried the most appropriate encoding (IMO), namely UCS-2, and most likely, this being Windows, it's Little Endian (UCS-2LE): the plain ASCII/Latin-1/Windows-1252 character comes first, the null byte comes next.

But grandfather is most likely right, you're trying to find Unicode strings inside a binary file, so treating the whole file as 16-bit Unicode, using binmode or open to set the encoding of the filehandle to 'ucs2le', for example using

open IN, '<:encoding(ucs2le)', $file
may likely fail, as characters needn't necessarily start at the even file positions in the binary file.

So you could try grandfather's approach, which is a very sensible one, or you could do the inverse, and convert the strings you're searching for into UCS-2LE, and search the binary file using that.

Actually, I suspect that if indeed Unicode strings start at odd file (or buffer) positions, grandfathers method will fail to find them.

BTW A plain Perl, non Encode way to convert plain Latin-1 to UCS-2LE is using pack/unpack:

$ucs2 = pack 'v*', unpack 'C*', $text;

Replies are listed 'Best First'.
Re^2: regular expression searching in binary files
by GrandFather (Saint) on Nov 12, 2006 at 08:46 UTC
    Actually, I suspect that if indeed Unicode strings start at odd file (or buffer) positions, grandfathers method will fail to find them.

    Interesting thought. However I checked it out with the sample code by inserting an extra byte before the 'Author' string and the match string was still found.

    On reflection Perl doesn't know anything special about either the match string or the buffer being matched so the fact that there is meta information (the fact that it is actually utf-16) associated with the data is of no consequence.


    DWIM is Perl's answer to Gödel
      I am now looking at what I had finally written, and I've clarified my question in my own mind to ask "how does the R.E. engine handle the metacharacters in a non-text environment."

      Grandfather's example's \Q...\E led me to enlightment in the perlreref

      Many thi^Hanks
      Donald.

      Many thanks to all; I'll give those a try. I don't think I tried UCS-2, certainly not UCS-2LE.

      Donald.