in reply to Re^4: pack/unpack binary editing
in thread pack/unpack binary editing
How long is your current attempt taking?
The following code processes a 100MB file (including finding and recordng 25million hits of a 10 bit pattern) in ~ 20 seconds, and a 1GB file (250 million hits) in ~ 3 minutes 20 seconds.
I make that a round 1/2 hour to process your 9GB. And probably much less as your hits will be less frequent and you can advance the buffer pointer by 480 bytes after each hit.
It uses a basic sliding buffer to process the file in 1 MB chunks with an overlap of enough bytes to ensure continuity. (You'll need to verify the math of the byte/bit offset calculations).
Code + some timings
#! perl -slw use strict; $|=1; my $BSIZE ||= 1_000_000; open my $fh, '+<:raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my $pattern = $ARGV[ 1 ] or die "no pattern supplied"; my $overlap = int( ( length( $pattern ) + 1 ) / 8 ); my $buffer = ''; my $buffs = 0; my $found = 0; while( sysread( $fh, $buffer, $BSIZE, length $buffer ) ) { ## Convert the buffer to asciiized bits; my $bits = unpack 'B*', $buffer; printf "\r$buffs: [$found] "; ## Search for the pattern my $p = 0; while( $p = 1 + index( $bits, $pattern, $p ) ) { ## And record the hits $found++; ## Calculate byte/bit offsets # my $byte = ( $buffs * $BSIZE ) # - ( $overap * $buffs ) # + int( ( $p - 1 ) / 8 ); # my $bit = ( $p - 1 ) % 8; # printf "\rFound it at byte: $byte bit: $bit '%s'", # substr( $bits, $p-1, length( $pattern ) );; } ## Keep track of the number of buffers process $buffs++; ## Move enough bytes to the front of the buffer ## to ensure overlap. $buffer = substr( $buffer, -$overlap ); } print "Found $found occurances of '$pattern'"; __END__ [16:40:12.64] P:\test>429065 data\100millionbytes.dat 1111111111 100: [0] Found 0 occurances of '1111111111' [16:40:46.04] P:\test> [16:41:37.46] P:\test>429065 data\100millionbytes.dat 1100000011 100: [24999982] Found 24999990 occurances of '1100000011' [16:41:56.28] P:\test> [16:42:03.65] P:\test>429065 data\1000millionbytes.dat 1100000011 1000: [249999876] Found 249999900 occurances of '1100000011' [16:45:24.09] P:\test>
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^6: pack/unpack binary editing
by tperdue (Sexton) on Feb 10, 2005 at 12:40 UTC | |
by BrowserUk (Patriarch) on Feb 10, 2005 at 14:10 UTC | |
by tperdue (Sexton) on Feb 10, 2005 at 14:17 UTC | |
by BrowserUk (Patriarch) on Feb 10, 2005 at 14:25 UTC | |
by tperdue (Sexton) on Feb 10, 2005 at 14:55 UTC | |
|