Re^2: Faster and more efficient way to read a file vertically

Interesting. I had similar partial synthetic benchmark yesterday, thought to publish it mainly to advice against my "seek" solution as too slow, then decided not to :), because maybe it's not worth readers' effort.

Nevertheless, somewhat different results for a 1 million lines file, and fast NVMe SSD storage. Below is the case for returning a hash with chars counts, but it's similar for returning string.

$ perl vert2.pl
ok 1 - same results
ok 2 - same results
ok 3 - same results
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
          Rate   seek    buk substr  slurp
seek   0.920/s     --   -61%   -84%   -88%
buk     2.36/s   157%     --   -58%   -69%
substr  5.66/s   515%   140%     --   -26%
slurp   7.69/s   736%   226%    36%     --
1..3
[download]

use strict;
use warnings;
use feature 'say';
use String::Random 'random_regex';
use Benchmark 'cmpthese';
use Test::More 'no_plan';

my $fn  = 'dna.txt';
my $POS = 10;

unless ( -e $fn ) {
    open my $fh, '>', $fn;
    print $fh random_regex( '[ACTG]{42}' ), "\n"
        for 1 .. 1e6;
}

is_deeply _seek(), _substr(), 'same results';
is_deeply slurp(), _substr(), 'same results';
is_deeply buk(),   _substr(), 'same results';

cmpthese( 3, {
    substr => \&_substr,
    seek   => \&_seek,
    buk    => \&buk,
    slurp  => \&slurp,
});


sub slurp {
    open my $fh, '<', $fn;
    my $s = do { local $/ = undef; <$fh> };
    my $count;
    $count-> { substr $s, $POS - 1 + 43 * $_, 1 }++
        for 0 .. length( $s ) / 43 - 1;
    return $count
}

sub buk {
    open my $fh, '<', $fn;
    my $buf = chr( 0 ) x 43;
    my $ref = \substr( $buf, $POS - 1, 1 );
    my $count;
    until ( eof $fh ) {
        substr( $buf, 0 ) = <$fh>;
        $count-> { $$ref }++
    }
    return $count
}

sub _seek {
    open my $fh, '<', $fn;
    my $L = length( <$fh> ) - 1;
    seek $fh, $POS - 1, 0;
    my $count;
    until ( eof $fh ) {
        $count-> { getc $fh }++;
        seek $fh, $L, 1
    }
    return $count
}

sub _substr {
    open my $fh, '<', $fn;
    my $count;
    $count-> { substr $_, $POS - 1, 1 }++
        while <$fh>;
    return $count
}
[download]

$ perl -v

This is perl 5, version 26, subversion 0 (v5.26.0) built for x86_64-li
+nux-thread-multi
(with 1 registered patch, see perl -V for more detail)
[download]

Comment on Re^2: Faster and more efficient way to read a file vertically Select or Download Code

Replies are listed 'Best First'.

Re^3: Faster and more efficient way to read a file vertically
by marioroy (Prior) on Nov 05, 2017 at 21:35 UTC

The following provides a parallel version for the slurp routine. I'm not sure why or where to look, running MCE via cmpthese reports inaccurately with MCE being 300x faster which is wrong. So, I needed to benchmark another way.

Regarding MCE, workers receive the next chunk and tally using a local hash. Then, update the shared hash.

use strict;
use warnings;

use MCE;
use MCE::Shared;
use String::Random 'random_regex';
use Time::HiRes 'time';

my $fn  = 'dna.txt';
my $POS = 10;

my $shrcount = MCE::Shared->hash();
my $mce;

unless ( -e $fn ) {
    open my $fh, '>', $fn;
    print $fh random_regex( '[ACTG]{42}' ), "\n"
        for 1 .. 1e6;
}

sub slurp {
    open my $fh, '<', $fn;
    my $s = do { local $/ = undef; <$fh> };
    my $count;
    $count-> { substr $s, $POS - 1 + 43 * $_, 1 }++
        for 0 .. length( $s ) / 43 - 1;

    return $count
}

sub mce {
    unless ( defined $mce ) {
      $mce = MCE->new(
        max_workers => 4,
        chunk_size  => '300k',
        use_slurpio => 1,
        user_func   => sub {
          my ( $mce, $slurp_ref, $chunk_id ) = @_;
          my ( $count, @todo );

          $count-> { substr ${ $slurp_ref }, $POS - 1 + 43 * $_, 1 }++
            for 0 .. length( ${ $slurp_ref } ) / 43 - 1;

          # Each key involves one IPC trip to the shared-manager.
          #
          # $shrcount->incrby( $_, $count->{$_} )
          #   for ( keys %{ $count } );

          # The following is faster for smaller chunk size.
          # Basically, send multiple commands at once.
          #
          push @todo, [ "incrby", $_, $count->{$_} ]
            for ( keys %{ $count } );

          $shrcount->pipeline( @todo );
        }
      )->spawn();
    }

    $shrcount->clear();
    $mce->process($fn);

    return $shrcount->export();
}

for (qw/ slurp mce /) {
    no strict 'refs';
    my $start = time();
    my $func  = "main::$_";
    $func->() for 1 .. 3;

    printf "%5s: %0.03f secs.\n", $_, time() - $start;
}

__END__

slurp: 0.487 secs.
  mce: 0.149 secs.
[download]

[reply]
[d/l]


go ahead... be a heretic
	PerlMonks