does the time consumption of perl's seek is related with file size, or it's a constant time operation?If the file fits in the OS cache and you have already read it on the preprocessing stage, seek is going to be a very cheap operation as all the information will be available from fast RAM.
When the file doesn't fit on the cache, randomly accessing it would be several orders of magnitude slower (per operation). In those cases, it is usually better to revert to an algorithm that uses a linear access pattern to the data even if in paper it is worse .
A classical example is removing duplicates from some data using a hash to detect duplicates (O(N)) or using sort to make duplicates contiguous and trivial to detect (O(NlogN)). When the data becomes too big to fit in the available RAM, the random data access pattern on the algorithm using the hash degrades its performance to the point of making it useless.
Update: Some code (Linux only, because of the non-portable disk cache cleaning stuff):
That on my computer...#!/usr/bin/perl; use strict; use warnings; use autodie; use Benchmark qw(timethis); use List::Util qw(shuffle); my $block_size = 8 * 1024; # 8 KB my $size = 128 * 1024 * 1024; # 128 MB my $blocks = $size / $block_size; my $fn = '/tmp/seekme.pl'; warn "file size: $size, block size: $block_size, blocks: $blocks\n"; warn "writing file...\n"; my $kb = '1' x $block_size; open my $fh, ">", $fn; print $fh $kb for (1..$blocks); close $fh; warn "file written\n"; sub clean_cache { warn "cleaning disk cache\n"; system "sync ; sudo echo 3 > /proc/sys/vm/drop_caches"; warn "disk cache clean\n" } my @order = shuffle 0..($blocks -1); my $buf; clean_cache(); warn "timing preload and random seek\n"; timethis 1 => sub { open my $fh, '<', $fn; for (@order) { read $fh, $buf, $block_size; } for (@order) { seek($fh, $_ * $block_size, 0); read $fh, $buf, $block_size; } }; warn "timing random seek with preloaded disk cache\n"; timethis 1 => sub { open my $fh, '<', $fn; for (@order) { seek($fh, $_ * $block_size, 0); read $fh, $buf, $block_size; } }; clean_cache(); warn "timing random seek with empty disk cache\n"; timethis 1 => sub { open my $fh, '<', $fn; for (@order) { seek($fh, $_ * $block_size, 0); read $fh, $buf, $block_size; } };
file size: 134217728, block size: 8192, blocks: 16384 writing file... file written cleaning disk cache disk cache clean timing preload and random seek timethis 1: 6 wallclock secs ( 1.34 usr + 0.36 sys = 1.70 CPU) @ 0 +.59/s (n=1) timing random seek with preloaded disk cache timethis 1: 1 wallclock secs ( 0.88 usr + 0.06 sys = 0.94 CPU) @ 1 +.06/s (n=1) cleaning disk cache disk cache clean timing random seek with empty disk cache timethis 1: 93 wallclock secs ( 1.29 usr + 1.15 sys = 2.44 CPU) @ 0 +.41/s (n=1)
In reply to Re: Efficiency of seek
by salva
in thread Efficiency of seek
by llancet
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |