comment on

An excellent suggestion!!

Adding this method to the tests and benchmark ...

   unpackM => sub { # Multi-line unpack suggested by LanX
       seek $inFH, 0, 0;

       my $buffer     = <$inFH>;
       my $lineLen    = length $buffer;
       my $nLines     = 500;
       my $chunkSize  = $lineLen * $nLines;

       seek $inFH, 0, 0;

       my $retStr;
       my $fmt = qq{(x${offset}ax@{ [ $lineLen - $offset - 1 ] })*};
       while ( my $bytesRead = read $inFH, $buffer, $chunkSize )
       {
           $retStr .= join q{}, unpack $fmt, $buffer;
       }

       return \ $retStr;
   },
[download]

... produced a new clear winner.

ok 1 - brutish
ok 2 - pushAoA
ok 3 - regex
ok 4 - rsubstr
ok 5 - seek
ok 6 - split
ok 7 - substr
ok 8 - unpack
ok 9 - unpackM
          Rate pushAoA brutish  split   seek regex unpack substr rsubs
+tr unpackM
pushAoA 1.14/s      --    -32%   -60%   -61%  -90%   -97%   -98%    -9
+8%    -98%
brutish 1.68/s     47%      --   -41%   -43%  -86%   -95%   -96%    -9
+7%    -97%
split   2.83/s    148%     69%     --    -3%  -76%   -92%   -94%    -9
+4%    -95%
seek    2.93/s    157%     75%     4%     --  -76%   -92%   -94%    -9
+4%    -95%
regex   12.0/s    952%    618%   325%   310%    --   -66%   -75%    -7
+5%    -80%
unpack  35.1/s   2970%   1993%  1140%  1097%  192%     --   -27%    -2
+8%    -43%
substr  47.8/s   4081%   2751%  1588%  1530%  297%    36%     --     -
+3%    -22%
rsubstr 49.1/s   4193%   2827%  1634%  1574%  308%    40%     3%      
+--    -20%
unpackM 61.1/s   5247%   3546%  2059%  1985%  408%    74%    28%     2
+5%      --
1..9
[download]

However, your idea of reading and processing larger chunks of the file led me to consider whether using a mask ANDed with a larger buffer would produce any improvement. Initial attempts using a regex to pull out non-NULL characters after ANDing were not encouraging but using tr instead was much better. This routine ...

   ANDmask => sub { # Multi-line AND mask by johngg
       seek $inFH, 0, 0;

       my $buffer     = <$inFH>;
       my $lineLen    = length $buffer;
       my $nLines     = 500;
       my $chunkSize  = $lineLen * $nLines;

       seek $inFH, 0, 0;

       my $retStr;
       my $mask
          = qq{\x00} x ${offset}
          . qq{\xff}
          . qq{\x00} x ( $lineLen - $offset - 1 );
       $mask x= $nLines;
       while ( my $bytesRead = read $inFH, $buffer, $chunkSize )
       {
           ( my $anded = $buffer & $mask ) =~ tr{\x00}{}d;
           $retStr .= $anded;
       }

       return \ $retStr;
   },
[download]

... seems to produce the best result so far.

ok 1 - ANDmask
ok 2 - brutish
ok 3 - pushAoA
ok 4 - regex
ok 5 - rsubstr
ok 6 - seek
ok 7 - split
ok 8 - substr
ok 9 - unpack
ok 10 - unpackM
          Rate pushAoA brutish split  seek regex unpack substr rsubstr
+ unpackM ANDmask
pushAoA 1.11/s      --    -35%  -61%  -62%  -91%   -97%   -98%    -98%
+    -98%    -99%
brutish 1.71/s     55%      --  -39%  -41%  -86%   -95%   -96%    -96%
+    -97%    -98%
split   2.82/s    155%     65%    --   -3%  -77%   -92%   -94%    -94%
+    -95%    -97%
seek    2.91/s    163%     70%    3%    --  -76%   -92%   -94%    -94%
+    -95%    -97%
regex   12.3/s   1010%    617%  336%  322%    --   -65%   -74%    -75%
+    -79%    -88%
unpack  35.0/s   3060%   1943% 1141% 1102%  185%     --   -25%    -27%
+    -40%    -67%
substr  46.9/s   4137%   2638% 1564% 1512%  282%    34%     --     -3%
+    -20%    -55%
rsubstr 48.2/s   4254%   2714% 1610% 1556%  292%    38%     3%      --
+    -18%    -54%
unpackM 58.7/s   5194%   3321% 1979% 1914%  377%    68%    25%     22%
+      --    -44%
ANDmask  105/s   9407%   6045% 3634% 3517%  757%   201%   124%    118%
+     80%      --
1..10
[download]

I would be interested to know if any Monk can spot flaws in the benchmark?

Cheers,

JohnGG

In reply to Re^5: Faster and more efficient way to read a file vertically by johngg
in thread Faster and more efficient way to read a file vertically by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.