Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^2: Faster and more efficient way to read a file vertically

by LanX (Saint)
on Nov 05, 2017 at 21:48 UTC ( [id://1202788]=note: print w/replies, xml ) Need Help??


in reply to Re: Faster and more efficient way to read a file vertically
in thread Faster and more efficient way to read a file vertically

> unpack  => sub { # Suggested but not implemented by pryrt

Actually unpack was suggested (and not implemented) by me first. ;)

FWIW: My idea was to unpack multiple lines simultaneously instead of going line by line.

If you are interested and all lines really have the same length (the OP never clarified)

  • read a chunk of complete lines bigger 4 or 8kb (depending on the blocksize of the OS to optimize read operations)
  • run a repeated unpack pattern
  • get a list of 1 result for each chunk line

Please see if substr on single lines is still faster then.

$line_length += $newline_length; # OS dependend $line_count = int(8 * 1024 / $line_length) +1; $chunk_size = $line_count * line_length;

And yes I'm still reluctant to implement it, smells too much like an XY Problem :)

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

update

In hindsight... probably having a slightly smaller chunk is more efficient :

$line_count   = int(8 * 1024 / $line_length)

Replies are listed 'Best First'.
Re^3: Faster and more efficient way to read a file vertically
by johngg (Canon) on Nov 06, 2017 at 00:36 UTC
    Actually unpack was suggested (and not implemented) by me first. ;)

    Ah! Sorry, I missed that :-/

    Cheers,

    JohnGG

      >Ah! Sorry, I missed that :- /

      No problem at all... ;-)

      I just wanted to point you to the possibility that unpack can process many lines at the same time, (which essentially means "reading vertically").

      use strict; use warnings; my $line_count = 10; my $line = join ("", 'a'..'z',"A".."Z") . "\n"; my $file= "$line" x $line_count; my $offset = 2; my $rest = length($line) - $offset -1; my $fmt = qq{(x${offset}ax${rest})${line_count}}; my @results = unpack $fmt, $file; print @results;

      Will print cccccccccc because of offset 2

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        An excellent suggestion!!

        Adding this method to the tests and benchmark ...

        unpackM => sub { # Multi-line unpack suggested by LanX seek $inFH, 0, 0; my $buffer = <$inFH>; my $lineLen = length $buffer; my $nLines = 500; my $chunkSize = $lineLen * $nLines; seek $inFH, 0, 0; my $retStr; my $fmt = qq{(x${offset}ax@{ [ $lineLen - $offset - 1 ] })*}; while ( my $bytesRead = read $inFH, $buffer, $chunkSize ) { $retStr .= join q{}, unpack $fmt, $buffer; } return \ $retStr; },

        ... produced a new clear winner.

        ok 1 - brutish ok 2 - pushAoA ok 3 - regex ok 4 - rsubstr ok 5 - seek ok 6 - split ok 7 - substr ok 8 - unpack ok 9 - unpackM Rate pushAoA brutish split seek regex unpack substr rsubs +tr unpackM pushAoA 1.14/s -- -32% -60% -61% -90% -97% -98% -9 +8% -98% brutish 1.68/s 47% -- -41% -43% -86% -95% -96% -9 +7% -97% split 2.83/s 148% 69% -- -3% -76% -92% -94% -9 +4% -95% seek 2.93/s 157% 75% 4% -- -76% -92% -94% -9 +4% -95% regex 12.0/s 952% 618% 325% 310% -- -66% -75% -7 +5% -80% unpack 35.1/s 2970% 1993% 1140% 1097% 192% -- -27% -2 +8% -43% substr 47.8/s 4081% 2751% 1588% 1530% 297% 36% -- - +3% -22% rsubstr 49.1/s 4193% 2827% 1634% 1574% 308% 40% 3% +-- -20% unpackM 61.1/s 5247% 3546% 2059% 1985% 408% 74% 28% 2 +5% -- 1..9

        However, your idea of reading and processing larger chunks of the file led me to consider whether using a mask ANDed with a larger buffer would produce any improvement. Initial attempts using a regex to pull out non-NULL characters after ANDing were not encouraging but using tr instead was much better. This routine ...

        ANDmask => sub { # Multi-line AND mask by johngg seek $inFH, 0, 0; my $buffer = <$inFH>; my $lineLen = length $buffer; my $nLines = 500; my $chunkSize = $lineLen * $nLines; seek $inFH, 0, 0; my $retStr; my $mask = qq{\x00} x ${offset} . qq{\xff} . qq{\x00} x ( $lineLen - $offset - 1 ); $mask x= $nLines; while ( my $bytesRead = read $inFH, $buffer, $chunkSize ) { ( my $anded = $buffer & $mask ) =~ tr{\x00}{}d; $retStr .= $anded; } return \ $retStr; },

        ... seems to produce the best result so far.

        ok 1 - ANDmask ok 2 - brutish ok 3 - pushAoA ok 4 - regex ok 5 - rsubstr ok 6 - seek ok 7 - split ok 8 - substr ok 9 - unpack ok 10 - unpackM Rate pushAoA brutish split seek regex unpack substr rsubstr + unpackM ANDmask pushAoA 1.11/s -- -35% -61% -62% -91% -97% -98% -98% + -98% -99% brutish 1.71/s 55% -- -39% -41% -86% -95% -96% -96% + -97% -98% split 2.82/s 155% 65% -- -3% -77% -92% -94% -94% + -95% -97% seek 2.91/s 163% 70% 3% -- -76% -92% -94% -94% + -95% -97% regex 12.3/s 1010% 617% 336% 322% -- -65% -74% -75% + -79% -88% unpack 35.0/s 3060% 1943% 1141% 1102% 185% -- -25% -27% + -40% -67% substr 46.9/s 4137% 2638% 1564% 1512% 282% 34% -- -3% + -20% -55% rsubstr 48.2/s 4254% 2714% 1610% 1556% 292% 38% 3% -- + -18% -54% unpackM 58.7/s 5194% 3321% 1979% 1914% 377% 68% 25% 22% + -- -44% ANDmask 105/s 9407% 6045% 3634% 3517% 757% 201% 124% 118% + 80% -- 1..10

        I would be interested to know if any Monk can spot flaws in the benchmark?

        Cheers,

        JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1202788]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2024-04-20 04:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found