regular expression searching in binary files

dhlocker has asked for the wisdom of the Perl Monks concerning the following question:

ActiveState Perl 5.8.something (I'm not at the machine right now.) on Micros**t Windoze XP (or sometimes 2000)

I'm trying to extract subversion (or RCS, or CVS, or whatever version control system) keywords and values from compiled sources, encoded as 16-bit elements. I can see (with emacs) the characters in their place, but my REs don't really extract what I need. Sometimes, but not always. Looks like octet-alignment or line-alignment (which doesn't exist for binary files, of course) problems.

Sooooo. My statement (if it were dealing with a plain text file) would be

print $1 if (m#(\$(Author|Date|Id|URL|Version): [-\.\$ _a-zA-Z0-9]+\$)
+#);
[download]

If I use \000A\000u\000t\000h\000o\000r for Author and similarly substitute for each of the literal characters I'm seeking I can find the keywords. But extracting the values as REs has eluded me.

All ASCII characters in the files I'm examining are represented in two octets, the first being 0x00, the second being the normal ASCII character.

I've tried variously, use utf8; use various encodings but my matches don't capture the strings I'm seeking.

Suggestions? and TIA.
(RTFMs would be welcome; perlre, perlreref, perlretut, and searching on unicode in perl docs found no help.)
Donald.

Comment on regular expression searching in binary files Download Code

Replies are listed 'Best First'.
Re: regular expression searching in binary files by GrandFather (Saint) on Nov 12, 2006 at 05:22 UTC
Looks like the strings you are trying to match are utf-16, but burried in a binary file. I'd recommend you use `binmode` on the file handle you are using to read the data and then you can: `use warnings; use strict; use Encode; my $binstr = "\x{00}\x{01}\x{02}\x{03}\x{04}\x{05}" . "\x{00}A\x{00}u\x{00}t\x{00}h\x{00}o\x{00}r\x{00}" . "\x{80}\x{90}\x{a0}\x{b0}\x{c0}\x{d0}\x{e0}"; my $matchStr = encode ('utf16be', 'Author'); if ($binstr =~ /(\Q$matchStr\E)/) { my $match = decode ('utf16be', $1); print "Found $match\n"; }` [download] Prints: `Found Author` [download] Note that this assumes big endien which seems to match your example, but could be little endien which is native for Windows systems and normal for the net. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: regular expression searching in binary files by bart (Canon) on Nov 12, 2006 at 07:40 UTC
I've tried variously, use utf8; use various encodings but my matches don't capture the strings I'm seeking. You must not have tried the most appropriate encoding (IMO), namely UCS-2, and most likely, this being Windows, it's Little Endian (UCS-2LE): the plain ASCII/Latin-1/Windows-1252 character comes first, the null byte comes next. But grandfather is most likely right, you're trying to find Unicode strings inside a binary file, so treating the whole file as 16-bit Unicode, using binmode or open to set the encoding of the filehandle to 'ucs2le', for example using `open IN, '<:encoding(ucs2le)', $file` [download] may likely fail, as characters needn't necessarily start at the even file positions in the binary file. So you could try grandfather's approach, which is a very sensible one, or you could do the inverse, and convert the strings you're searching for into UCS-2LE, and search the binary file using that. Actually, I suspect that if indeed Unicode strings start at odd file (or buffer) positions, grandfathers method will fail to find them. BTW A plain Perl, non Encode way to convert plain Latin-1 to UCS-2LE is using pack/unpack: `$ucs2 = pack 'v', unpack 'C', $text;` [download]	[reply] [d/l] [select]
Re^2: regular expression searching in binary files by GrandFather (Saint) on Nov 12, 2006 at 08:46 UTC
Actually, I suspect that if indeed Unicode strings start at odd file (or buffer) positions, grandfathers method will fail to find them. Interesting thought. However I checked it out with the sample code by inserting an extra byte before the 'Author' string and the match string was still found. On reflection Perl doesn't know anything special about either the match string or the buffer being matched so the fact that there is meta information (the fact that it is actually utf-16) associated with the data is of no consequence. DWIM is Perl's answer to Gödel	[reply]
Re^3: regular expression searching in binary files by dhlocker (Novice) on Nov 12, 2006 at 14:20 UTC
Many thanks to all; I'll give those a try. I don't think I tried UCS-2, certainly not UCS-2LE. Donald.	[reply]
Re^3: regular expression searching in binary files by dhlocker (Novice) on Nov 13, 2006 at 13:25 UTC
I am now looking at what I had finally written, and I've clarified my question in my own mind to ask "how does the R.E. engine handle the metacharacters in a non-text environment." Grandfather's example's \Q...\E led me to enlightment in the perlreref Many thi^Hanks Donald.	[reply]
Re: regular expression searching in binary files by aufflick (Deacon) on Nov 13, 2006 at 01:18 UTC
This is not really answering your question (especially since you're on Windows), but do you know the Unix command `ident` does exactly what you're after? `IDENT(1) -- 1993/11/09 -- GNU NAME ident - identify RCS keyword strings in files SYNOPSIS ident [ -q ] [ -V ] [ file ... ] DESCRIPTION ident searches for all instances of the pattern $keyword: text $ in the named files or, if no files are named, the standard input.` [download] Certainly I have used `ident` successfully on binary files under Cygwin on Windows.	[reply] [d/l]
Re^2: regular expression searching in binary files by dhlocker (Novice) on Nov 13, 2006 at 02:32 UTC
Ident is exactly what I am trying to emulate on this "platform." Unfortunately, ident doesn't seem to find these double-octet encoded strings as I thought it would, so I turned to Perl. ident does work fine on the source code, of course. (I use a _lot_ of cygwin to get me through the day in this Micros**t shop.) Maybe I did something wrong, though with ident that's hard to do :) Thanks for the thought; I'll try again. Donald.	[reply]