in reply to regular expression searching in binary files
I've tried variously, use utf8; use various encodings but my matches don't capture the strings I'm seeking.You must not have tried the most appropriate encoding (IMO), namely UCS-2, and most likely, this being Windows, it's Little Endian (UCS-2LE): the plain ASCII/Latin-1/Windows-1252 character comes first, the null byte comes next.
But grandfather is most likely right, you're trying to find Unicode strings inside a binary file, so treating the whole file as 16-bit Unicode, using binmode or open to set the encoding of the filehandle to 'ucs2le', for example using
may likely fail, as characters needn't necessarily start at the even file positions in the binary file.open IN, '<:encoding(ucs2le)', $file
So you could try grandfather's approach, which is a very sensible one, or you could do the inverse, and convert the strings you're searching for into UCS-2LE, and search the binary file using that.
Actually, I suspect that if indeed Unicode strings start at odd file (or buffer) positions, grandfathers method will fail to find them.
BTW A plain Perl, non Encode way to convert plain Latin-1 to UCS-2LE is using pack/unpack:
$ucs2 = pack 'v*', unpack 'C*', $text;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: regular expression searching in binary files
by GrandFather (Saint) on Nov 12, 2006 at 08:46 UTC | |
by dhlocker (Novice) on Nov 13, 2006 at 13:25 UTC | |
by dhlocker (Novice) on Nov 12, 2006 at 14:20 UTC |