Re^5: Faster and more efficient way to read a file vertically

An excellent suggestion!!

Adding this method to the tests and benchmark ...

   unpackM => sub { # Multi-line unpack suggested by LanX
       seek $inFH, 0, 0;

       my $buffer     = <$inFH>;
       my $lineLen    = length $buffer;
       my $nLines     = 500;
       my $chunkSize  = $lineLen * $nLines;

       seek $inFH, 0, 0;

       my $retStr;
       my $fmt = qq{(x${offset}ax@{ [ $lineLen - $offset - 1 ] })*};
       while ( my $bytesRead = read $inFH, $buffer, $chunkSize )
       {
           $retStr .= join q{}, unpack $fmt, $buffer;
       }

       return \ $retStr;
   },
[download]

... produced a new clear winner.

ok 1 - brutish
ok 2 - pushAoA
ok 3 - regex
ok 4 - rsubstr
ok 5 - seek
ok 6 - split
ok 7 - substr
ok 8 - unpack
ok 9 - unpackM
          Rate pushAoA brutish  split   seek regex unpack substr rsubs
+tr unpackM
pushAoA 1.14/s      --    -32%   -60%   -61%  -90%   -97%   -98%    -9
+8%    -98%
brutish 1.68/s     47%      --   -41%   -43%  -86%   -95%   -96%    -9
+7%    -97%
split   2.83/s    148%     69%     --    -3%  -76%   -92%   -94%    -9
+4%    -95%
seek    2.93/s    157%     75%     4%     --  -76%   -92%   -94%    -9
+4%    -95%
regex   12.0/s    952%    618%   325%   310%    --   -66%   -75%    -7
+5%    -80%
unpack  35.1/s   2970%   1993%  1140%  1097%  192%     --   -27%    -2
+8%    -43%
substr  47.8/s   4081%   2751%  1588%  1530%  297%    36%     --     -
+3%    -22%
rsubstr 49.1/s   4193%   2827%  1634%  1574%  308%    40%     3%      
+--    -20%
unpackM 61.1/s   5247%   3546%  2059%  1985%  408%    74%    28%     2
+5%      --
1..9
[download]

However, your idea of reading and processing larger chunks of the file led me to consider whether using a mask ANDed with a larger buffer would produce any improvement. Initial attempts using a regex to pull out non-NULL characters after ANDing were not encouraging but using tr instead was much better. This routine ...

   ANDmask => sub { # Multi-line AND mask by johngg
       seek $inFH, 0, 0;

       my $buffer     = <$inFH>;
       my $lineLen    = length $buffer;
       my $nLines     = 500;
       my $chunkSize  = $lineLen * $nLines;

       seek $inFH, 0, 0;

       my $retStr;
       my $mask
          = qq{\x00} x ${offset}
          . qq{\xff}
          . qq{\x00} x ( $lineLen - $offset - 1 );
       $mask x= $nLines;
       while ( my $bytesRead = read $inFH, $buffer, $chunkSize )
       {
           ( my $anded = $buffer & $mask ) =~ tr{\x00}{}d;
           $retStr .= $anded;
       }

       return \ $retStr;
   },
[download]

... seems to produce the best result so far.

ok 1 - ANDmask
ok 2 - brutish
ok 3 - pushAoA
ok 4 - regex
ok 5 - rsubstr
ok 6 - seek
ok 7 - split
ok 8 - substr
ok 9 - unpack
ok 10 - unpackM
          Rate pushAoA brutish split  seek regex unpack substr rsubstr
+ unpackM ANDmask
pushAoA 1.11/s      --    -35%  -61%  -62%  -91%   -97%   -98%    -98%
+    -98%    -99%
brutish 1.71/s     55%      --  -39%  -41%  -86%   -95%   -96%    -96%
+    -97%    -98%
split   2.82/s    155%     65%    --   -3%  -77%   -92%   -94%    -94%
+    -95%    -97%
seek    2.91/s    163%     70%    3%    --  -76%   -92%   -94%    -94%
+    -95%    -97%
regex   12.3/s   1010%    617%  336%  322%    --   -65%   -74%    -75%
+    -79%    -88%
unpack  35.0/s   3060%   1943% 1141% 1102%  185%     --   -25%    -27%
+    -40%    -67%
substr  46.9/s   4137%   2638% 1564% 1512%  282%    34%     --     -3%
+    -20%    -55%
rsubstr 48.2/s   4254%   2714% 1610% 1556%  292%    38%     3%      --
+    -18%    -54%
unpackM 58.7/s   5194%   3321% 1979% 1914%  377%    68%    25%     22%
+      --    -44%
ANDmask  105/s   9407%   6045% 3634% 3517%  757%   201%   124%    118%
+     80%      --
1..10
[download]

I would be interested to know if any Monk can spot flaws in the benchmark?

Cheers,

JohnGG

Comment on Re^5: Faster and more efficient way to read a file vertically Select or Download Code

Replies are listed 'Best First'.
Re^6: Faster and more efficient way to read a file vertically by vr (Curate) on Nov 07, 2017 at 01:40 UTC
Not "flaws", but it's measuring performance per size of read buffer... (something LanX was saying needs tuning) ~$ perl vert4.pl Benchmark: timing 3 iterations of 1, 10, 100, 1000, 10000, 100000, 100 +0000... 1: 1 wallclock secs ( 0.94 usr + 0.01 sys = 0.95 CPU) @ 3 +.16/s (n=3) (warning: too few iterations for a reliable count) 10: 1 wallclock secs ( 0.23 usr + 0.02 sys = 0.25 CPU) @ 12 +.00/s (n=3) (warning: too few iterations for a reliable count) 100: 0 wallclock secs ( 0.17 usr + 0.01 sys = 0.18 CPU) @ 16 +.67/s (n=3) (warning: too few iterations for a reliable count) 1000: 0 wallclock secs ( 0.16 usr + 0.03 sys = 0.19 CPU) @ 15 +.79/s (n=3) (warning: too few iterations for a reliable count) 10000: 0 wallclock secs ( 0.18 usr + 0.00 sys = 0.18 CPU) @ 16 +.67/s (n=3) (warning: too few iterations for a reliable count) 100000: 0 wallclock secs ( 0.20 usr + 0.03 sys = 0.23 CPU) @ 13 +.04/s (n=3) (warning: too few iterations for a reliable count) 1000000: 1 wallclock secs ( 0.23 usr + 0.03 sys = 0.26 CPU) @ 11 +.54/s (n=3) (warning: too few iterations for a reliable count) [download] Read more... (1335 Bytes)	[reply] [d/l] [select]
Re^7: Faster and more efficient way to read a file vertically by johngg (Canon) on Nov 07, 2017 at 16:39 UTC
I guess that the tuning parameters will vary depending on the specification of the target system and the line length of the data file. On your system the best performance (without narrowing it down further) looks to be with a 10,000 line buffer. On my rather elderly, vintage 2008 IIRC, Core 2 Duo laptop the sweet spot is around 1,000 lines for both the unpack and mask methods. Working on a 2,500,000 line file with 51 byte (inc. line terminator) lines I get the following ... ok 1 - ANDmask ok 2 - unpackM Rate u950 u1050 u900 u1100 u1000 A950 A1050 A1100 A900 A10 +00 u950 1.28/s -- -0% -0% -1% -1% -39% -39% -39% -39% -4 +1% u1050 1.28/s 0% -- -0% -0% -1% -39% -39% -39% -39% -4 +1% u900 1.28/s 0% 0% -- -0% -1% -39% -39% -39% -39% -4 +1% u1100 1.28/s 1% 0% 0% -- -1% -39% -39% -39% -39% -4 +1% u1000 1.29/s 1% 1% 1% 1% -- -38% -38% -39% -39% -4 +0% A950 2.10/s 65% 64% 64% 64% 62% -- -0% -0% -0% - +3% A1050 2.10/s 65% 64% 64% 64% 63% 0% -- -0% -0% - +3% A1100 2.10/s 65% 64% 64% 64% 63% 0% 0% -- -0% - +3% A900 2.11/s 65% 65% 65% 64% 63% 0% 0% 0% -- - +2% A1000 2.16/s 69% 69% 69% 68% 67% 3% 3% 3% 2% +-- 1..2 [download] ... with this code. Read more... (5 kB) Cheers, JohnGG	[reply] [d/l] [select]
Re^8: Faster and more efficient way to read a file vertically by vr (Curate) on Nov 08, 2017 at 19:57 UTC
Hi, johngg, I rather tried to communicate that all contestants should be placed in equal conditions, i.e. let them all read in chunks, rather than in single lines, as it was for some of them. But all the same your "ANDmask" is fastest. Which is weird. Simple act of extracting some part of data, instead of just indexing, requires, to be fast, modification of unrelated data. Sure, it's because of speed of "anding" and transliteration, but still... Therefore, here is something completely different (sorry I keep adjusting your setup to suite my "dna.txt"): use strict; use warnings; use Benchmark qw{ cmpthese timethese }; use Test::More qw{ no_plan }; use String::Random 'random_regex'; my $fn = 'dna.txt'; unless ( -e $fn ) { open my $fh, '>', $fn; print $fh random_regex( '[ACTG]{42}' ), "\n" for 1 .. 1e6; } open my $inFH, q{<}, $fn or die $!; binmode $inFH; my $buffer = <$inFH>; my $lineLen = length $buffer; my $nLines = 500; my $chunkSize = $lineLen * $nLines; my $offset = 9; # Column 10 if numbering from 1 my %methods = ( ANDmask => sub { # Multi-line AND mask by johngg seek $inFH, 0, 0; my $retStr; my $mask = qq{\x00} x ${offset} . qq{\xff} . qq{\x00} x ( $lineLen - $offset - 1 ); $mask x= $nLines; while ( my $bytesRead = read $inFH, $buffer, $chunkSize ) { ( my $anded = $buffer & $mask ) =~ tr{\x00}{}d; $retStr .= $anded; } return \ $retStr; }, pdl => sub { seek $inFH, 0, 0; use PDL; my $retStr; my $chunkPDL = zeroes( byte, $lineLen, $nLines ); my $bufRef = $chunkPDL-> get_dataref; while ( my $bytesRead = read $inFH, $$bufRef, $chunkSize ) { my $lastLine = $bytesRead / $lineLen - 1; $retStr .= ${ $chunkPDL-> slice( "$offset,0:$lastLine" ) -> get_dataref } } return \ $retStr; }, ); ok ${ $methods{ ANDmask }-> ()} eq ${ $methods{ pdl }-> ()}; cmpthese( -10, { map { $_ => $methods{ $_ } } keys %methods }); [download] `>perl vert5.pl ok 1 Rate ANDmask pdl ANDmask 7.86/s -- -55% pdl 17.3/s 120% -- 1..1` [download]	[reply] [d/l] [select]
Re^9: Faster and more efficient way to read a file vertically by johngg (Canon) on Nov 09, 2017 at 11:33 UTC
Re^10: Faster and more efficient way to read a file vertically by etj (Priest) on May 07, 2022 at 23:19 UTC