comment on

How long is your current attempt taking?

The following code processes a 100MB file (including finding and recordng 25million hits of a 10 bit pattern) in ~ 20 seconds, and a 1GB file (250 million hits) in ~ 3 minutes 20 seconds.

I make that a round 1/2 hour to process your 9GB. And probably much less as your hits will be less frequent and you can advance the buffer pointer by 480 bytes after each hit.

It uses a basic sliding buffer to process the file in 1 MB chunks with an overlap of enough bytes to ensure continuity. (You'll need to verify the math of the byte/bit offset calculations).

Code + some timings

#! perl -slw 
use strict;
$|=1;

my $BSIZE ||= 1_000_000;

open my $fh, '+<:raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!";

my $pattern = $ARGV[ 1 ] or die "no pattern supplied";
my $overlap = int( ( length( $pattern ) + 1 ) / 8 );
my $buffer = '';
my $buffs = 0;
my $found = 0;
while( sysread( $fh, $buffer, $BSIZE, length $buffer ) ) {
    ## Convert the buffer to asciiized bits;
    my $bits = unpack 'B*', $buffer;
    
    printf "\r$buffs: [$found] ";
    
    ## Search for the pattern
    my $p = 0;
    while( $p = 1 + index( $bits, $pattern, $p ) ) {
        ## And record the hits
        $found++;

## Calculate byte/bit offsets        
#        my $byte = ( $buffs * $BSIZE ) 
#                 - ( $overap * $buffs ) 
#                 + int( ( $p - 1 ) / 8 );
#        my $bit = ( $p - 1 ) % 8;
#        printf "\rFound it at byte: $byte bit: $bit '%s'", 
#            substr( $bits, $p-1, length( $pattern ) );;
    }
    
    ## Keep track of the number of buffers process
    $buffs++;
    
    ## Move enough bytes to the front of the buffer
    ## to ensure overlap.
    $buffer = substr( $buffer, -$overlap );
}
print "Found $found occurances of '$pattern'";
__END__
[16:40:12.64] P:\test>429065 data\100millionbytes.dat 1111111111
100: [0] Found 0 occurances of '1111111111'

[16:40:46.04] P:\test>

[16:41:37.46] P:\test>429065 data\100millionbytes.dat 1100000011
100: [24999982] Found 24999990 occurances of '1100000011'

[16:41:56.28] P:\test>

[16:42:03.65] P:\test>429065 data\1000millionbytes.dat 1100000011
1000: [249999876] Found 249999900 occurances of '1100000011'

[16:45:24.09] P:\test>
[download]

Examine what is said, not who speaks.

Silence betokens consent.

Love the truth but pardon error.

In reply to Re^5: pack/unpack binary editing by BrowserUk
in thread pack/unpack binary editing by tperdue

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.