An excellent suggestion!!
Adding this method to the tests and benchmark ...
unpackM => sub { # Multi-line unpack suggested by LanX
seek $inFH, 0, 0;
my $buffer = <$inFH>;
my $lineLen = length $buffer;
my $nLines = 500;
my $chunkSize = $lineLen * $nLines;
seek $inFH, 0, 0;
my $retStr;
my $fmt = qq{(x${offset}ax@{ [ $lineLen - $offset - 1 ] })*};
while ( my $bytesRead = read $inFH, $buffer, $chunkSize )
{
$retStr .= join q{}, unpack $fmt, $buffer;
}
return \ $retStr;
},
... produced a new clear winner.
ok 1 - brutish
ok 2 - pushAoA
ok 3 - regex
ok 4 - rsubstr
ok 5 - seek
ok 6 - split
ok 7 - substr
ok 8 - unpack
ok 9 - unpackM
Rate pushAoA brutish split seek regex unpack substr rsubs
+tr unpackM
pushAoA 1.14/s -- -32% -60% -61% -90% -97% -98% -9
+8% -98%
brutish 1.68/s 47% -- -41% -43% -86% -95% -96% -9
+7% -97%
split 2.83/s 148% 69% -- -3% -76% -92% -94% -9
+4% -95%
seek 2.93/s 157% 75% 4% -- -76% -92% -94% -9
+4% -95%
regex 12.0/s 952% 618% 325% 310% -- -66% -75% -7
+5% -80%
unpack 35.1/s 2970% 1993% 1140% 1097% 192% -- -27% -2
+8% -43%
substr 47.8/s 4081% 2751% 1588% 1530% 297% 36% -- -
+3% -22%
rsubstr 49.1/s 4193% 2827% 1634% 1574% 308% 40% 3%
+-- -20%
unpackM 61.1/s 5247% 3546% 2059% 1985% 408% 74% 28% 2
+5% --
1..9
However, your idea of reading and processing larger chunks of the file led me to consider whether using a mask ANDed with a larger buffer would produce any improvement. Initial attempts using a regex to pull out non-NULL characters after ANDing were not encouraging but using tr instead was much better. This routine ...
ANDmask => sub { # Multi-line AND mask by johngg
seek $inFH, 0, 0;
my $buffer = <$inFH>;
my $lineLen = length $buffer;
my $nLines = 500;
my $chunkSize = $lineLen * $nLines;
seek $inFH, 0, 0;
my $retStr;
my $mask
= qq{\x00} x ${offset}
. qq{\xff}
. qq{\x00} x ( $lineLen - $offset - 1 );
$mask x= $nLines;
while ( my $bytesRead = read $inFH, $buffer, $chunkSize )
{
( my $anded = $buffer & $mask ) =~ tr{\x00}{}d;
$retStr .= $anded;
}
return \ $retStr;
},
... seems to produce the best result so far.
ok 1 - ANDmask
ok 2 - brutish
ok 3 - pushAoA
ok 4 - regex
ok 5 - rsubstr
ok 6 - seek
ok 7 - split
ok 8 - substr
ok 9 - unpack
ok 10 - unpackM
Rate pushAoA brutish split seek regex unpack substr rsubstr
+ unpackM ANDmask
pushAoA 1.11/s -- -35% -61% -62% -91% -97% -98% -98%
+ -98% -99%
brutish 1.71/s 55% -- -39% -41% -86% -95% -96% -96%
+ -97% -98%
split 2.82/s 155% 65% -- -3% -77% -92% -94% -94%
+ -95% -97%
seek 2.91/s 163% 70% 3% -- -76% -92% -94% -94%
+ -95% -97%
regex 12.3/s 1010% 617% 336% 322% -- -65% -74% -75%
+ -79% -88%
unpack 35.0/s 3060% 1943% 1141% 1102% 185% -- -25% -27%
+ -40% -67%
substr 46.9/s 4137% 2638% 1564% 1512% 282% 34% -- -3%
+ -20% -55%
rsubstr 48.2/s 4254% 2714% 1610% 1556% 292% 38% 3% --
+ -18% -54%
unpackM 58.7/s 5194% 3321% 1979% 1914% 377% 68% 25% 22%
+ -- -44%
ANDmask 105/s 9407% 6045% 3634% 3517% 757% 201% 124% 118%
+ 80% --
1..10
I would be interested to know if any Monk can spot flaws in the benchmark?