Re^4: What is a "big job" in the industry?

It's not at all clear to me that is true, but I'm open to pursuasion. Particularly in the form of a benchmark :)

To start with, there is no need to invoke the regex engine in order to search a string of ascii 0s & 1s for another string of ascii 0s & 1s. A straight string search (ala index will do fine). If that is indeed what the OPs code does which appears to be in question, but he isn't forthcoming with a definitive answer.

If you are to do this is Perl, I'm absolutely certain that a single call to the index opcode is going to win hands down over the required loop of shifts, ands and ors required to do this at the binary level.

Even if you dropped into C to perform the shifting and masking, and left the string search as a call to index, it's still not a certainty that the former would win out. Shifting a string through a byte register 1 bit at a time is a fairly complex task with lots of edge cases.

You might do it with hand coded assembler, but then a well-optimised string search can be pretty damn fast. For a start, you can do the bulk of the searching 4 bytes at a time by treating both strings as arrays of unsigned longs until you reach the residual 1, 2, or 3 bytes. And full register opcodes tend to be more efficient to boot.

Finally, if the OPs data is already in the ascii-ised binary form, then you'd have to factor in the conversion to real binary, which whilst not onnerous to do, still occupies time and favours the string search approach.

If you take up the benchmark challenge, these two scripts that were the basis of my "under an hour" challenge above might give you a basis for comparison:

Generate two test files that fit the OPs description as close as I could figure:

#! perl -slw
use strict;
use Math::Random::MT qw[ rand ];

open O, '>:raw', 'data/650MB.bin';
print O unpack 'B44', pack'NN', rand( 2**32 ), rand( 2**32 ) for 1 .. 
+15e6;
close O;

open O, '>:raw', 'data/600.bin';
print O unpack 'B7', pack 'C', rand(256)  for 1 .. 600;
close O;
[download]

And a test script that loads the big file as a single string, and then searches that for each line from the small file, logging the hits to separate files for each pattern. Note: The 600 lines/7-bits 127 patterns dichotomy.

#! perl -slw
use strict;

my $data;
open my $bigFH, '<:raw', 'data/650MB.bin' or die $!;
sysread( $bigFH, $data, -s( 'data/650MB.bin' ) );
close $bigFH;

open my $smallFH, '<', 'data/600.bin' or die $!;
while( <$smallFH> ) {
    print STDERR "\n$. ";
    chomp;
    my $name = "data/$_.posns";
    next if -e $name;
    open O, '>', $name or die $!;
    print STDERR "File: '$name' opened";
    my $p = 0;
    while( $p = 1+index( $data, $_, $p ) ) {
        print O $p;
    }
    close O;
}
close $smallFH;
[download]

On my system, the second script completed in about 45 minutes.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^4: What is a "big job" in the industry? Select or Download Code