Re: How do I search this binary file?
by tachyon (Chancellor) on Aug 20, 2002 at 19:45 UTC
|
my $last4 = '';
my $buffer = '';
my $find = qr /1234/;
my $chunk = 7; # 2**20 or some such is more logical!
while ( read ( DATA, $buffer, $chunk ) ) {
print "got last-$last4\tbuffer-$buffer\n";
print "simple match\n" while $buffer =~ s/$find//;
$buffer = $last4 . $buffer;
$last4 = substr $buffer, -4, 4, '';
print "buffer match\n" if $buffer =~ m/$find/;
}
__DATA__
1234567890123456789012345678901234567890123456789012345678901234567890
__END__
got last- buffer-1234567
simple match
got last-567 buffer-8901234
simple match
got last-7890 buffer-5678901
got last-8901 buffer-2345678
buffer match
got last-5678 buffer-9012345
simple match
got last-8905 buffer-6789012
got last-9012 buffer-3456789
buffer match
got last-6789 buffer-0123456
simple match
got last-9056 buffer-7890123
got last-0123 buffer-4567890
buffer match
got last-7890 buffer-
In this example I use a chunk size of 7 so that we need the buffer often but 2**20 (megabyte) chunks work effectively with big files (I wrote a node about processing big files fast at Re: Performance Question). You should be able to get around a 4-8MB/s throughput depending on your hardware.
I'll leave it as an exercise for you as to what to do with the data between matches. Logging the positions and making a second pass through the file to get the data may well be most efficient.
Note if we match out of the initial buffer we s/// it out before we add the previously buffered chunk and retest (avoids double match) - replace with a filler string to maintain length. There are 7 matches available and 7 found. A dozen lines with debugging - you gotta love Perl.
Update
Here is a version that records the positions of matches for you
my $last4 = '';
my $buffer = '';
my $find = '1234';
my $find_re = qr /$find/;
my $length = length $find;
my $chunk = 7; # 2**20 or some such is more logical!
my $pos = 0;
while ( my $read_length = read ( DATA, $buffer, $chunk ) ) {
$buffer = $last4 . $buffer;
print "got last-$last4\tbuffer-$buffer\n";
while ($buffer =~ m/$find_re/g) {
print "match at ", $pos - (length $last4) -$length + pos $bu
+ffer, "\n";
}
$last4 = substr $buffer, -$length, $length, '';
$last4 = '' if $last4 =~ m/$find_re/; # stop double match bug
$pos += $read_length;
}
__DATA__
1234567890123456789012345678901234567890123456789012345678901234567890
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] [d/l] [select] |
|
|
Very good start, but I think some more can be squeezed out of it.
my $buffer;
my $delim = '1234';
my $delim_len = length $delim;
my $chunk_len = 65536 - $delim_len;
my $read_len = read $fh, $buffer, $delim_len;
my $pos = 0;
while ($read_len) {
my $rel_pos = -1;
print "delim at offset: ", $pos + $rel_pos
while ($rel_pos = index $buffer, $delim, $rel_pos + 1) > -1;
$buffer = substr $buffer, -$delim_len;
$pos += $read_len;
$read_len = read $fh, $buffer, $chunk_len, $delim_len;
}
A slower, but single pass alternative would be to shift the buffer back every time we find a match.
my $buffer = '';
my $delim = '1234';
my $chunk_len = 65536;
my $delim_len = length $delim;
read $fh, $buffer, $chunk_len, length $buffer;
my $rel_pos = -1;
while (length $buffer) {
$rel_pos = index $buffer, $delim, $rel_pos + 1;
if($rel_pos > -1) {
do_checks_on(substr $buffer, 0, $rel_pos - 1);
$buffer = substr $buffer, $rel_pos + $delim_len;
} else {
$buffer = substr $buffer, -$delim_len;
}
read $fh, $buffer, $chunk_len - length $buffer, length $buffer;
}
Warning: this is untested code. I don't see any glaring mistakes though. It pulls the match to the front of the buffer, refills the back of the buffer and then looks for where the next match is. If none is found, it takes another whole bufferfull bite out of the file.
Both of these snippets pay careful attention to always copy the last $delim_len bytes to the front of the buffer, reading the next load into the buffer at that offset, so any delimiters falling across the top boundary of the buffer are not a concern.
Makeshifts last the longest. | [reply] [d/l] [select] |
|
|
I don't understand your point. Why does double-scanning (and copying a megabyte) prevent "double matches"? Copying just the last 3 bytes (not 4!) to the beginning and then reading the next meg into the scalar after that would avoid the copy, and it can't match twice because the 4 bytes in the pattern are all different so matches can't overlap. Even without that restriction, the "junking" of previous found matches would take care of it.
| [reply] |
|
|
When you read from a disk you get a minimum of 512 bytes read (one sector) but in reality the disk reads and buffers a decent sized chunk (varies but ever wondered why disks have RAM?). Practical experimentation reveals an optimum read size (for a perl match type program) of 1-4MB as outlined in the RE:Performance Question link.
OK 3 bytes is fine if all the bytes are different and there is no overlap - you did not specify.
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] |
Re: How do I search this binary file?
by bluto (Curate) on Aug 20, 2002 at 19:20 UTC
|
The only way I can think of doing this *scalably* wrt memory seems
pretty boring (i.e. you have to read in fixed sized blocks or something similar).Could you just use sysopen/sysread to read blocks of
data in and then remember the offsets of the 4-byte
sequences? You could then go back and seek directly to the remembered offset and sysread the data in, repeating as needed. You do have to worry about the
case where the 4-bytes overlapped the end of a block. You could handle this by always reading ahead 1 block and then copying 3 bytes from the beginning of the next block to the end of the one you are working on before you check for the 4-byte sequence. (I've written too much C to be able to come up with anything more elegant.)
bluto | [reply] |
Re: How do I search this binary file?
by derby (Abbot) on Aug 20, 2002 at 19:29 UTC
|
Or you could depend on the buffering power of the IO libraries and inspect your data four-bytes at a time, rewinding three bytes if no match is found. Not as fast as fixed size records but not as bad as your probably think it is.
#!/usr/bin/perl -wd
$file = shift @ARGV || die "$0 <filename>\n";
open( FH, "<$file" ) or die "cannot open file $!\n";
#MARK
$in_read = 0;
while( read( FH, $four_bytes, 4 ) > 0 ) {
last if( length( $four_bytes ) != 4 ); # EOF?
if( $four_bytes eq "MARK" ) {
print "buffer was ", $buffer, "\n" if $in_read;
$in_read = $in_read ? 0 : 1;
$buffer = "";
} else {
$buffer .= substr( $four_bytes, 0, 1 ) if $in_read;
# back up three bytes
$pos = tell(FH) - 3;
seek( FH, $pos, 0 );
}
}
# catch any trailing?
$buffer .= $four_bytes if $in_read;
# do something with the last buffer
print "dangling buffer: ", $buffer, "\n" if $buffer;
-derby | [reply] [d/l] |
|
|
Reading 4 bytes at a time is incredibly inefficient and slow. See my answer below for a solution and Re: Performance Question for an explanation
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] |
Re: How do I search this binary file?
by Zaxo (Archbishop) on Aug 20, 2002 at 23:25 UTC
|
You've gotten plenty of good suggestions for the processing. I have one to add on the size of chunk to read. In list context, the stat call returns a preferred block size at index eleven.
sub strings_from_bin {
my ($file, $dlim) = @_;
local *BIN;
open BIN, "< ".$file or die $!;
binmode BIN;
local $/ = \(stat BIN)[11];
my @strings;
# ... populate @strings from <BIN>
close BIN or die $!;
@strings;
}
Update: I believe that the block size on fat filesystems can vary, but I have been unable to track down a way of reading it. I seem to recall that 4k and 32k were popular. Can anyone more familiar with win32 help?
After Compline, Zaxo
| [reply] [d/l] |
|
|
I've never noticed those fields before.
Hmm, on Win32 stat only returned 11 elements, not 13.
Found "Platforms that do not have rdev, blksize, or blocks will return these as '', so numeric comparison or manipulation of these fields may cause 'not numeric' warnings."
So, it should really be something like:
$blocksize= (stat BIN)[11] || 2048;
| [reply] [d/l] |
Re: How do I search this binary file?
by kschwab (Vicar) on Aug 20, 2002 at 19:42 UTC
|
How about a find/rewind approach ?
Go ahead and set $/ to your delimiter,
but don't assign to a variable as you read.
(for the sake of conserving memory on huge
records)
Once you have the record boundaries, then
you could seek() back and process it however
you need to.
This is making the assumption that the bare
<FILE> construct isn't using memory.
I'm not clear on whether that's true or not.
#!/usr/bin/perl
my $delim="asdf";
$/=$delim;
open(FILE,"/some/big/file");
my $offset=0;
my $lastoffset=0;
for (;;) {
<FILE>;
my $offset=tell(FILE);
last if eof(FILE);
print "this record spans [$lastoffset] to [$offset], including";
print "the delimiter\n";
print " this means you could seek to $lastoffset and read for ";
printf "%d bytes to get the record\n",
$offset-$lastoffset-length($delim);
$lastoffset=$offset;
}
| [reply] [d/l] |
|
|
If <HANDLE> in void context functioned as a scan rather than a read, that would be terrific and elegant.
| [reply] [d/l] |
|
|
It is a read, but in void context, it doesn't appear
to be using excessive memory ( I suppose since it's
not filling $_ or any other variable ).
I did try this on a fairly large file (with large "records"), and didn't notice
a significant growth in memory footprint compared to
a smaller file.
Did you have a different experience ?
Update: Tried it on a really large file, and
sure enough, it eats up VM :) Never mind...sounds like
asking the gods to map <HANDLE> in a void context into a scan
might be a good idea though...
| [reply] |
Re: How do I search this binary file?
by talexb (Chancellor) on Aug 20, 2002 at 19:46 UTC
|
If I were faced with this challenge, I would re-use a technique from my days as a C programmer, and read a chunk of N bytes into a buffer from the file. I'd search my way through the buffer, stopping when I went past the half way point.
I'd then shift the buffer and where I'm searching in the buffer down by N/2 bytes, read another N/2 bytes into the now free half of the buffer, and continue until the pattern was found or the file is exhausted.
I can't provide any working code at this time (paying work intervenes) but that should get you going.
--t. alex
but my friends call me T.
| [reply] |
Re: How do I search this binary file?
by dpuu (Chaplain) on Aug 20, 2002 at 20:04 UTC
|
An alternative approach is the old-fashioned state machine:
- set $/ to the first character of the separator (this is less likely to use gigabytes than looking for all 4). Read and discard a block.
- now read the next character: if it's the first char of the separator then repeat; if it's not the second, then goto step 1
- now read the next: if it's not the third, then goto step 1
- read the next: if it's the fourth, then you've found the start of your block. Otherwise, goto 1
Now repeat a similar SM for the end separator (the only difference is tat you don't discard anything while you're in the middle of your block.--Dave | [reply] |
Re: How do I search this binary file?
by sauoq (Abbot) on Aug 20, 2002 at 19:02 UTC
|
Do you have any idea how long your delimited chunk of data will be? If not, how about maximum and/or minimum lengths for it?
Once you find the chunk you are searching for, how will you differentiate it from any other delimited chunk included in this "arbitrary binary file?" I ask because if you have some criteria by which you can evaluate it as being the one you want, that might help you in the actual search.
Update: It occurred to me that since you seem to have a maximum size buffer in mind anyway, the best approach might be to simply read in a buffer of that size, search it with index for your delimiter, discard everything up to the delimiter, and use read to fill the buffer back up to the max size. That's more C-ish than Perlish I suppose but does it matter?
Update 2: bluto's point below about the delimiter crossing the block boundary is very well-taken. That would have been a nasty bug in the method I described.
-sauoq
"My two cents aren't worth a dime.";
| [reply] |
|
|
re: Do you have any idea how long your delimited chunk of data will be? If not, how about maximum and/or minimum lengths for it?
The minimum length (of what's between the two 4-byte endcaps) is only a couple bytes, I don't have my notes but it's a very small number, like maybe 5. The largest possible value is 128 bytes greater than that.
re: Once you find the chunk you are searching for, how will you differentiate it from any other delimited chunk included in this "arbitrary binary file?" I ask because if you have some criteria by which you can evaluate it as being the one you want, that might help you in the actual search.
OK, I'll go into more detail. Between the two endcaps, there will be another sequence of 3 or 4 specific bytes. Before that mark I expect 0-128 bytes of legal UTF-8 text. After that, the data also has internal format consistancies. Furthermore, the byte following the ending 4-byte delimiter is a hash "checksum" of the data block.
I consider a valid "hit" if the middle mark is present, there is no FE or FF bytes before that mark, and the checksum checks. (after I accept it, I decide if the other data values are any good)
Thanks;
—John
| [reply] |
|
|
I think I would read the file in one block at a time using the blocksize returned by stat as suggested by Zaxo in a post below (Zaxo++) as long as it was larger than your max chunk size (I can't imagine it wouldn't be.)
I'd use a regular expression to search for the whole chunk. If found, great; process it. If not, I'd start with the last 150 or so (one byte less than your max "chunk" size would do it) and use a four-argument read to append it to the leftover. Then search again... etc. etc.
I don't know how well this approach would do next to some of the other suggestions. It has the advantage of looking for the whole chunk at once and using the regex engine to do it. Presumably that will be pretty quick. It has the disadvantage that you'll be searching through some fraction of the file twice. If you search the whole file, the number of bytes you'd search through twice would be approximately equal to the max chunk size times the size of the file in blocks. Given that, you might be able to improve it by increasing the size of the block you read. If you keep it a multiple of the preferred size it shouldn't hurt anything.
-sauoq
"My two cents aren't worth a dime.";
| [reply] [d/l] |
|
|