Well the question is whether punch_card_don needs to do that. I'm not clear on that. It seems to me he's just comparing things on known positions, not searching for them, but I may easily be wrong. I have no idea whether pack()ing the data (at whatever stage of the processing or generation) can be used or whether it would help or just complicate things. I was just trying to suggest something that punch_card_don might have omited, hoping it will be of some use.
Besides ... even if he does need to search for the substrings it might still be beneficial to pack the data for storage. He might save up to 7/8 of the disk access. Which might help even if he has to unpack the data to process them.
| [reply] |
You're right. Sorry.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
I C, I suspect. Even the brute-force method of shifting a bit at a time into a register, masking, & comparing would surely beat a regexp engine. Worst case, that's a shift, bitwise or, bitwise and, & a branch if not 0.
I'm sure there are much better algorithms, too. | [reply] |
It's not at all clear to me that is true, but I'm open to pursuasion. Particularly in the form of a benchmark :)
To start with, there is no need to invoke the regex engine in order to search a string of ascii 0s & 1s for another string of ascii 0s & 1s. A straight string search (ala index will do fine). If that is indeed what the OPs code does which appears to be in question, but he isn't forthcoming with a definitive answer.
If you are to do this is Perl, I'm absolutely certain that a single call to the index opcode is going to win hands down over the required loop of shifts, ands and ors required to do this at the binary level.
Even if you dropped into C to perform the shifting and masking, and left the string search as a call to index, it's still not a certainty that the former would win out. Shifting a string through a byte register 1 bit at a time is a fairly complex task with lots of edge cases.
You might do it with hand coded assembler, but then a well-optimised string search can be pretty damn fast. For a start, you can do the bulk of the searching 4 bytes at a time by treating both strings as arrays of unsigned longs until you reach the residual 1, 2, or 3 bytes. And full register opcodes tend to be more efficient to boot.
Finally, if the OPs data is already in the ascii-ised binary form, then you'd have to factor in the conversion to real binary, which whilst not onnerous to do, still occupies time and favours the string search approach.
If you take up the benchmark challenge, these two scripts that were the basis of my "under an hour" challenge above might give you a basis for comparison:
Generate two test files that fit the OPs description as close as I could figure:
#! perl -slw
use strict;
use Math::Random::MT qw[ rand ];
open O, '>:raw', 'data/650MB.bin';
print O unpack 'B44', pack'NN', rand( 2**32 ), rand( 2**32 ) for 1 ..
+15e6;
close O;
open O, '>:raw', 'data/600.bin';
print O unpack 'B7', pack 'C', rand(256) for 1 .. 600;
close O;
And a test script that loads the big file as a single string, and then searches that for each line from the small file, logging the hits to separate files for each pattern. Note: The 600 lines/7-bits 127 patterns dichotomy.
#! perl -slw
use strict;
my $data;
open my $bigFH, '<:raw', 'data/650MB.bin' or die $!;
sysread( $bigFH, $data, -s( 'data/650MB.bin' ) );
close $bigFH;
open my $smallFH, '<', 'data/600.bin' or die $!;
while( <$smallFH> ) {
print STDERR "\n$. ";
chomp;
my $name = "data/$_.posns";
next if -e $name;
open O, '>', $name or die $!;
print STDERR "File: '$name' opened";
my $p = 0;
while( $p = 1+index( $data, $_, $p ) ) {
print O $p;
}
close O;
}
close $smallFH;
On my system, the second script completed in about 45 minutes.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |