dhlocker has asked for the wisdom of the Perl Monks concerning the following question:

ActiveState Perl 5.8.something (I'm not at the machine right now.) on Micros**t Windoze XP (or sometimes 2000)

I'm trying to extract subversion (or RCS, or CVS, or whatever version control system) keywords and values from compiled sources, encoded as 16-bit elements. I can see (with emacs) the characters in their place, but my REs don't really extract what I need. Sometimes, but not always. Looks like octet-alignment or line-alignment (which doesn't exist for binary files, of course) problems.

Sooooo. My statement (if it were dealing with a plain text file) would be

print $1 if (m#(\$(Author|Date|Id|URL|Version): [-\.\$ _a-zA-Z0-9]+\$) +#);

If I use \000A\000u\000t\000h\000o\000r for Author and similarly substitute for each of the literal characters I'm seeking I can find the keywords. But extracting the values as REs has eluded me.

All ASCII characters in the files I'm examining are represented in two octets, the first being 0x00, the second being the normal ASCII character.

I've tried variously, use utf8; use various encodings but my matches don't capture the strings I'm seeking.

Suggestions? and TIA.
(RTFMs would be welcome; perlre, perlreref, perlretut, and searching on unicode in perl docs found no help.)
Donald.

Replies are listed 'Best First'.
Re: regular expression searching in binary files
by GrandFather (Saint) on Nov 12, 2006 at 05:22 UTC

    Looks like the strings you are trying to match are utf-16, but burried in a binary file. I'd recommend you use binmode on the file handle you are using to read the data and then you can:

    use warnings; use strict; use Encode; my $binstr = "\x{00}\x{01}\x{02}\x{03}\x{04}\x{05}" . "\x{00}A\x{00}u\x{00}t\x{00}h\x{00}o\x{00}r\x{00}" . "\x{80}\x{90}\x{a0}\x{b0}\x{c0}\x{d0}\x{e0}"; my $matchStr = encode ('utf16be', 'Author'); if ($binstr =~ /(\Q$matchStr\E)/) { my $match = decode ('utf16be', $1); print "Found $match\n"; }

    Prints:

    Found Author

    Note that this assumes big endien which seems to match your example, but could be little endien which is native for Windows systems and normal for the net.


    DWIM is Perl's answer to Gödel
Re: regular expression searching in binary files
by bart (Canon) on Nov 12, 2006 at 07:40 UTC
    I've tried variously, use utf8; use various encodings but my matches don't capture the strings I'm seeking.
    You must not have tried the most appropriate encoding (IMO), namely UCS-2, and most likely, this being Windows, it's Little Endian (UCS-2LE): the plain ASCII/Latin-1/Windows-1252 character comes first, the null byte comes next.

    But grandfather is most likely right, you're trying to find Unicode strings inside a binary file, so treating the whole file as 16-bit Unicode, using binmode or open to set the encoding of the filehandle to 'ucs2le', for example using

    open IN, '<:encoding(ucs2le)', $file
    may likely fail, as characters needn't necessarily start at the even file positions in the binary file.

    So you could try grandfather's approach, which is a very sensible one, or you could do the inverse, and convert the strings you're searching for into UCS-2LE, and search the binary file using that.

    Actually, I suspect that if indeed Unicode strings start at odd file (or buffer) positions, grandfathers method will fail to find them.

    BTW A plain Perl, non Encode way to convert plain Latin-1 to UCS-2LE is using pack/unpack:

    $ucs2 = pack 'v*', unpack 'C*', $text;
      Actually, I suspect that if indeed Unicode strings start at odd file (or buffer) positions, grandfathers method will fail to find them.

      Interesting thought. However I checked it out with the sample code by inserting an extra byte before the 'Author' string and the match string was still found.

      On reflection Perl doesn't know anything special about either the match string or the buffer being matched so the fact that there is meta information (the fact that it is actually utf-16) associated with the data is of no consequence.


      DWIM is Perl's answer to Gödel
        Many thanks to all; I'll give those a try. I don't think I tried UCS-2, certainly not UCS-2LE.

        Donald.

        I am now looking at what I had finally written, and I've clarified my question in my own mind to ask "how does the R.E. engine handle the metacharacters in a non-text environment."

        Grandfather's example's \Q...\E led me to enlightment in the perlreref

        Many thi^Hanks
        Donald.

Re: regular expression searching in binary files
by aufflick (Deacon) on Nov 13, 2006 at 01:18 UTC
    This is not really answering your question (especially since you're on Windows), but do you know the Unix command ident does exactly what you're after?

    IDENT(1) -- 1993/11/09 -- GNU NAME ident - identify RCS keyword strings in files SYNOPSIS ident [ -q ] [ -V ] [ file ... ] DESCRIPTION ident searches for all instances of the pattern $keyword: text $ in the named files or, if no files are named, the standard input.
    Certainly I have used ident successfully on binary files under Cygwin on Windows.
      Ident is exactly what I am trying to emulate on this "platform." Unfortunately, ident doesn't seem to find these double-octet encoded strings as I thought it would, so I turned to Perl. ident does work fine on the source code, of course. (I use a _lot_ of cygwin to get me through the day in this Micros**t shop.) Maybe I did something wrong, though with ident that's hard to do :)

      Thanks for the thought; I'll try again.

      Donald.