How long is your current attempt taking?

The following code processes a 100MB file (including finding and recordng 25million hits of a 10 bit pattern) in ~ 20 seconds, and a 1GB file (250 million hits) in ~ 3 minutes 20 seconds.

I make that a round 1/2 hour to process your 9GB. And probably much less as your hits will be less frequent and you can advance the buffer pointer by 480 bytes after each hit.

It uses a basic sliding buffer to process the file in 1 MB chunks with an overlap of enough bytes to ensure continuity. (You'll need to verify the math of the byte/bit offset calculations).

Code + some timings

#! perl -slw use strict; $|=1; my $BSIZE ||= 1_000_000; open my $fh, '+<:raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my $pattern = $ARGV[ 1 ] or die "no pattern supplied"; my $overlap = int( ( length( $pattern ) + 1 ) / 8 ); my $buffer = ''; my $buffs = 0; my $found = 0; while( sysread( $fh, $buffer, $BSIZE, length $buffer ) ) { ## Convert the buffer to asciiized bits; my $bits = unpack 'B*', $buffer; printf "\r$buffs: [$found] "; ## Search for the pattern my $p = 0; while( $p = 1 + index( $bits, $pattern, $p ) ) { ## And record the hits $found++; ## Calculate byte/bit offsets # my $byte = ( $buffs * $BSIZE ) # - ( $overap * $buffs ) # + int( ( $p - 1 ) / 8 ); # my $bit = ( $p - 1 ) % 8; # printf "\rFound it at byte: $byte bit: $bit '%s'", # substr( $bits, $p-1, length( $pattern ) );; } ## Keep track of the number of buffers process $buffs++; ## Move enough bytes to the front of the buffer ## to ensure overlap. $buffer = substr( $buffer, -$overlap ); } print "Found $found occurances of '$pattern'"; __END__ [16:40:12.64] P:\test>429065 data\100millionbytes.dat 1111111111 100: [0] Found 0 occurances of '1111111111' [16:40:46.04] P:\test> [16:41:37.46] P:\test>429065 data\100millionbytes.dat 1100000011 100: [24999982] Found 24999990 occurances of '1100000011' [16:41:56.28] P:\test> [16:42:03.65] P:\test>429065 data\1000millionbytes.dat 1100000011 1000: [249999876] Found 249999900 occurances of '1100000011' [16:45:24.09] P:\test>

Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

In reply to Re^5: pack/unpack binary editing by BrowserUk
in thread pack/unpack binary editing by tperdue

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.