Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^2: Faster and more efficient way to read a file vertically

by vr (Curate)
on Nov 05, 2017 at 17:22 UTC ( [id://1202781]=note: print w/replies, xml ) Need Help??


in reply to Re: Faster and more efficient way to read a file vertically
in thread Faster and more efficient way to read a file vertically

Interesting. I had similar partial synthetic benchmark yesterday, thought to publish it mainly to advice against my "seek" solution as too slow, then decided not to :), because maybe it's not worth readers' effort.

Nevertheless, somewhat different results for a 1 million lines file, and fast NVMe SSD storage. Below is the case for returning a hash with chars counts, but it's similar for returning string.

$ perl vert2.pl ok 1 - same results ok 2 - same results ok 3 - same results (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate seek buk substr slurp seek 0.920/s -- -61% -84% -88% buk 2.36/s 157% -- -58% -69% substr 5.66/s 515% 140% -- -26% slurp 7.69/s 736% 226% 36% -- 1..3

use strict; use warnings; use feature 'say'; use String::Random 'random_regex'; use Benchmark 'cmpthese'; use Test::More 'no_plan'; my $fn = 'dna.txt'; my $POS = 10; unless ( -e $fn ) { open my $fh, '>', $fn; print $fh random_regex( '[ACTG]{42}' ), "\n" for 1 .. 1e6; } is_deeply _seek(), _substr(), 'same results'; is_deeply slurp(), _substr(), 'same results'; is_deeply buk(), _substr(), 'same results'; cmpthese( 3, { substr => \&_substr, seek => \&_seek, buk => \&buk, slurp => \&slurp, }); sub slurp { open my $fh, '<', $fn; my $s = do { local $/ = undef; <$fh> }; my $count; $count-> { substr $s, $POS - 1 + 43 * $_, 1 }++ for 0 .. length( $s ) / 43 - 1; return $count } sub buk { open my $fh, '<', $fn; my $buf = chr( 0 ) x 43; my $ref = \substr( $buf, $POS - 1, 1 ); my $count; until ( eof $fh ) { substr( $buf, 0 ) = <$fh>; $count-> { $$ref }++ } return $count } sub _seek { open my $fh, '<', $fn; my $L = length( <$fh> ) - 1; seek $fh, $POS - 1, 0; my $count; until ( eof $fh ) { $count-> { getc $fh }++; seek $fh, $L, 1 } return $count } sub _substr { open my $fh, '<', $fn; my $count; $count-> { substr $_, $POS - 1, 1 }++ while <$fh>; return $count }

$ perl -v This is perl 5, version 26, subversion 0 (v5.26.0) built for x86_64-li +nux-thread-multi (with 1 registered patch, see perl -V for more detail)

Replies are listed 'Best First'.
Re^3: Faster and more efficient way to read a file vertically
by marioroy (Prior) on Nov 05, 2017 at 21:35 UTC

    The following provides a parallel version for the slurp routine. I'm not sure why or where to look, running MCE via cmpthese reports inaccurately with MCE being 300x faster which is wrong. So, I needed to benchmark another way.

    Regarding MCE, workers receive the next chunk and tally using a local hash. Then, update the shared hash.

    use strict; use warnings; use MCE; use MCE::Shared; use String::Random 'random_regex'; use Time::HiRes 'time'; my $fn = 'dna.txt'; my $POS = 10; my $shrcount = MCE::Shared->hash(); my $mce; unless ( -e $fn ) { open my $fh, '>', $fn; print $fh random_regex( '[ACTG]{42}' ), "\n" for 1 .. 1e6; } sub slurp { open my $fh, '<', $fn; my $s = do { local $/ = undef; <$fh> }; my $count; $count-> { substr $s, $POS - 1 + 43 * $_, 1 }++ for 0 .. length( $s ) / 43 - 1; return $count } sub mce { unless ( defined $mce ) { $mce = MCE->new( max_workers => 4, chunk_size => '300k', use_slurpio => 1, user_func => sub { my ( $mce, $slurp_ref, $chunk_id ) = @_; my ( $count, @todo ); $count-> { substr ${ $slurp_ref }, $POS - 1 + 43 * $_, 1 }++ for 0 .. length( ${ $slurp_ref } ) / 43 - 1; # Each key involves one IPC trip to the shared-manager. # # $shrcount->incrby( $_, $count->{$_} ) # for ( keys %{ $count } ); # The following is faster for smaller chunk size. # Basically, send multiple commands at once. # push @todo, [ "incrby", $_, $count->{$_} ] for ( keys %{ $count } ); $shrcount->pipeline( @todo ); } )->spawn(); } $shrcount->clear(); $mce->process($fn); return $shrcount->export(); } for (qw/ slurp mce /) { no strict 'refs'; my $start = time(); my $func = "main::$_"; $func->() for 1 .. 3; printf "%5s: %0.03f secs.\n", $_, time() - $start; } __END__ slurp: 0.487 secs. mce: 0.149 secs.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1202781]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-04-24 11:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found