Actually, in looking at the code a second time, the problem is with the
$final = pack("B*", $block); statement. It should read
$final = pack("B*", substr($block,0,BLOCKSZ); Sorry about that. Please see amended code above. (I used a variable $blocksz in the code in place of BLOCKSZ in this discussion)
$block = $block01.$block02 creates a single variable, $block, of size BLOCKSZ * 2 (4096 in my code). The substitution works across the read boundary of 2048 between blocks 01 and 02 for this one instance The substitution will fail if the pattern crosses the upper boundery of $block02 since the pattern is incomplete. Thus, after writing out $block01, you move $block02 in to $block01 so that the next pattern substition will catch any pattern that crosses that boundary. Actually, come to think of it, you should be assigning the upper BLOCKSZ of $block to $block02 ie.
$block01 = substr($block,-BLOCKSZ).
As for speed, you could increase the size of your blocks maybe to 32768 or 65536 or larger if you have the memory.
You're using some pretty big sequences in the substitution regex, I wonder if that isn't your biggest bottleneck. Is it possible to break up your pattern into parts? You might pick up some speed there using several smaller substitutions rather than one big one. I'm not a regex guru (sort of a novice really) but it seems that there is the potential for a lot of backtracking in your regex and that has got to take time. Maybe one of the more experienced monks speak to that.
The rest of the algorithm should be fairly quick. I would recommend that you move the file open operation
open OUT, ">>tmp"; (and the related close op) out of your first loop. That will cut some overhead opening and closing a file. Pack and Unpack are pretty efficient so you probably can't squeeze any more out of thos ops. I'm not sure if this matters any but you don't have to undef $array each time in the first loop. There is a little overhead involved in reinitializing $array each time.
Setting
$array = '' will accomplish the same thing without forcing the loop to recreate $array each time through. Every little bit adds up particulary when a loop repeats tens of thousands of times.
I'll have to try benchmarking this sometime. Maybe after work ...
Update:
Running a simple benchmark on the undef vs nullifying produced this (786500 is approx the number of reads necessary to absorb a file of ~3Gb in 4K chunks). The second option runs about 17% faster on the first test. And the second compare testing the open and close op ran over 900% faster even on a short run of 3 CPU seconds
use strict;
use warnings;
use diagnostics;
use Benchmark qw(cmpthese);
cmpthese(-60,{a=>sub{for (0..786500){my $array = '1'; undef $array;}},
b=>sub{for (0..786500){my $array = '1'; $array = ''; }}})
+;
cmpthese(0,{a=>sub{for (0..10){my $array = '1';
open OUT, ">>tmp";
print OUT "$array";
undef $array;
close OUT;}},
b=>sub{open OUT, ">>tmp"; for (0..10){my $array = '1';
print OUT "$array";
undef $array;}}});
+
__END__
Benchmark: running a, b, each for at least 60 CPU seconds...
a: 62 wallclock secs (60.50 usr + 0.00 sys = 60.50 CPU) @ 1
+.69/s (n=102)
b: 64 wallclock secs (62.31 usr + 0.00 sys = 62.31 CPU) @ 1
+.97/s (n=123)
Rate a b
a 1.69/s -- -15%
b 1.97/s 17% --
Benchmark: running a, b, each for at least 3 CPU seconds...
a: 11 wallclock secs ( 0.03 usr + 3.75 sys = 3.78 CPU) @ 2
+.12/s (n=8)
b: 8 wallclock secs ( 0.00 usr + 3.14 sys = 3.14 CPU) @ 23
+.24/s (n=73)
Rate a b
a 2.12/s -- -91%
b 23.2/s 998% --
PJ
use strict; use warnings; use diagnostics;
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
|
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.