comment on

does the time consumption of perl's seek is related with file size, or it's a constant time operation?

If the file fits in the OS cache and you have already read it on the preprocessing stage, seek is going to be a very cheap operation as all the information will be available from fast RAM.

When the file doesn't fit on the cache, randomly accessing it would be several orders of magnitude slower (per operation). In those cases, it is usually better to revert to an algorithm that uses a linear access pattern to the data even if in paper it is worse .

A classical example is removing duplicates from some data using a hash to detect duplicates (O(N)) or using sort to make duplicates contiguous and trivial to detect (O(NlogN)). When the data becomes too big to fit in the available RAM, the random data access pattern on the algorithm using the hash degrades its performance to the point of making it useless.

Update: Some code (Linux only, because of the non-portable disk cache cleaning stuff):

#!/usr/bin/perl;

use strict;
use warnings;
use autodie;
use Benchmark qw(timethis);
use List::Util qw(shuffle);

my $block_size = 8 * 1024;     #   8 KB
my $size = 128 * 1024 * 1024;  # 128 MB
my $blocks = $size / $block_size;

my $fn = '/tmp/seekme.pl';

warn "file size: $size, block size: $block_size, blocks: $blocks\n";

warn "writing file...\n";
my $kb = '1' x $block_size;
open my $fh, ">", $fn;
print $fh $kb for (1..$blocks);
close $fh;
warn "file written\n";

sub clean_cache {
    warn "cleaning disk cache\n";
    system "sync ; sudo echo 3 > /proc/sys/vm/drop_caches";
    warn "disk cache clean\n"
}


my @order = shuffle 0..($blocks -1);
my $buf;

clean_cache();

warn "timing preload and random seek\n";
timethis 1 => sub {
    open my $fh, '<', $fn;
    for (@order) {
        read $fh, $buf, $block_size;
    }

    for (@order) {
        seek($fh, $_ * $block_size, 0);
        read $fh, $buf, $block_size;
    }
};

warn "timing random seek with preloaded disk cache\n";
timethis 1 => sub {
    open my $fh, '<', $fn;
    for (@order) {
        seek($fh, $_ * $block_size, 0);
        read $fh, $buf, $block_size;
    }
};

clean_cache();

warn "timing random seek with empty disk cache\n";
timethis 1 => sub {
    open my $fh, '<', $fn;
    for (@order) {
        seek($fh, $_ * $block_size, 0);
        read $fh, $buf, $block_size;
    }
};
[download]

That on my computer...

file size: 134217728, block size: 8192, blocks: 16384
writing file...
file written
cleaning disk cache
disk cache clean
timing preload and random seek
timethis 1:  6 wallclock secs ( 1.34 usr +  0.36 sys =  1.70 CPU) @  0
+.59/s (n=1)
timing random seek with preloaded disk cache
timethis 1:  1 wallclock secs ( 0.88 usr +  0.06 sys =  0.94 CPU) @  1
+.06/s (n=1)
cleaning disk cache
disk cache clean
timing random seek with empty disk cache
timethis 1: 93 wallclock secs ( 1.29 usr +  1.15 sys =  2.44 CPU) @  0
+.41/s (n=1)
[download]

In reply to Re: Efficiency of seek by salva
in thread Efficiency of seek by llancet

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.