llancet has asked for the wisdom of the Perl Monks concerning the following question:

I have to randomly read records of a text file, so I preprocessed the text file, created a hash that contains record keys and position. When I want a record, I can directly seek to there.
However, I know very little about the seek operation. As my file is fairly large, does the time consumption of perl's seek is related with file size, or it's a constant time operation? Is it a costly operation?

Replies are listed 'Best First'.
Re: Efficiency of seek
by roboticus (Chancellor) on Aug 31, 2011 at 06:31 UTC

    llancet:

    The easy question first: Bad news--It's a costly operation. Generally, if you're forced to make the heads move, you're talking 5+ milliseconds (and the 5ms is probably a high-end drive).

    The good news: Modern drives typically read an entire cylinder at a time, so if your seeks are mostly local, you can often bypass the head motion. (The more platters in your drive, the more information in the cylinder.) So when you preprocess the file, try to arrange things to keep the seek distances small, and you'll have fewer head movements.

    When a head motion occurs, it has a trapezoidal motion profile: At the start of the movement, it accelerates to "top speed" which is a constant-time operation. At the end of the movement, it decelerates back to zero speed, which is another constant-time operation. After that, it takes a little time to stabilize the motion and sync up with the servo track. Between the start and end of the movement, the heads travel at a fixed velocity, so the more tracks you cross, the more time it will take. (However, I doubt it matters much, as I believe (no evidence) that the time is dominated by the speed up, slow down and synchronize operations.) (Note: I glossed over a good few details, such as short head movements in which case the speed up never gets to full speed before the heads start slowing down.)

    Note: I'm not a hardware manufacturer, though I play one on TV.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      It's a costly operation. Generally, if you're forced to make the heads move, you're talking 5+ milliseconds (and the 5ms is probably a high-end drive)
      Technically, you're only playing the price when reading (or writing) after the seek. And then only if the requested page isn't already in the buffer. The seek itself doesn't move the heads. And you may have to pay said price anyway, even without seeking. The heads may stay at the same spot if your process is the only one accessing the disk.

        JavaFan:

        Yep, very true. That's among the stuff I was glossing over (I also didn't want to talk about caching on the drive, in the OS, and whatever other optimizations exist.)

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: Efficiency of seek
by salva (Canon) on Aug 31, 2011 at 07:52 UTC
    does the time consumption of perl's seek is related with file size, or it's a constant time operation?
    If the file fits in the OS cache and you have already read it on the preprocessing stage, seek is going to be a very cheap operation as all the information will be available from fast RAM.

    When the file doesn't fit on the cache, randomly accessing it would be several orders of magnitude slower (per operation). In those cases, it is usually better to revert to an algorithm that uses a linear access pattern to the data even if in paper it is worse .

    A classical example is removing duplicates from some data using a hash to detect duplicates (O(N)) or using sort to make duplicates contiguous and trivial to detect (O(NlogN)). When the data becomes too big to fit in the available RAM, the random data access pattern on the algorithm using the hash degrades its performance to the point of making it useless.

    Update: Some code (Linux only, because of the non-portable disk cache cleaning stuff):

    #!/usr/bin/perl; use strict; use warnings; use autodie; use Benchmark qw(timethis); use List::Util qw(shuffle); my $block_size = 8 * 1024; # 8 KB my $size = 128 * 1024 * 1024; # 128 MB my $blocks = $size / $block_size; my $fn = '/tmp/seekme.pl'; warn "file size: $size, block size: $block_size, blocks: $blocks\n"; warn "writing file...\n"; my $kb = '1' x $block_size; open my $fh, ">", $fn; print $fh $kb for (1..$blocks); close $fh; warn "file written\n"; sub clean_cache { warn "cleaning disk cache\n"; system "sync ; sudo echo 3 > /proc/sys/vm/drop_caches"; warn "disk cache clean\n" } my @order = shuffle 0..($blocks -1); my $buf; clean_cache(); warn "timing preload and random seek\n"; timethis 1 => sub { open my $fh, '<', $fn; for (@order) { read $fh, $buf, $block_size; } for (@order) { seek($fh, $_ * $block_size, 0); read $fh, $buf, $block_size; } }; warn "timing random seek with preloaded disk cache\n"; timethis 1 => sub { open my $fh, '<', $fn; for (@order) { seek($fh, $_ * $block_size, 0); read $fh, $buf, $block_size; } }; clean_cache(); warn "timing random seek with empty disk cache\n"; timethis 1 => sub { open my $fh, '<', $fn; for (@order) { seek($fh, $_ * $block_size, 0); read $fh, $buf, $block_size; } };
    That on my computer...
    file size: 134217728, block size: 8192, blocks: 16384 writing file... file written cleaning disk cache disk cache clean timing preload and random seek timethis 1: 6 wallclock secs ( 1.34 usr + 0.36 sys = 1.70 CPU) @ 0 +.59/s (n=1) timing random seek with preloaded disk cache timethis 1: 1 wallclock secs ( 0.88 usr + 0.06 sys = 0.94 CPU) @ 1 +.06/s (n=1) cleaning disk cache disk cache clean timing random seek with empty disk cache timethis 1: 93 wallclock secs ( 1.29 usr + 1.15 sys = 2.44 CPU) @ 0 +.41/s (n=1)
Re: Efficiency of seek
by BrowserUk (Patriarch) on Aug 31, 2011 at 13:02 UTC

    I think you've been given a lot of irrelevant information that completely misses the answer you are seeking [sic].

    time consumption of perl's seek is related with file size, or it's a constant time operation?

    For the purposes of random access, seeking is an amortised, constant time operation, regardless of the size of the file.

    That is, all else -- hardware, system buffers, OS, system load, file fragmentation etc. -- being equal, reading the first and then the last records will take the same time as reading the last then then first, regardless of whether the file is a few 10s of megabytes or a few tens of gigabytes.

    Is it a costly operation?

    It depends upon the amount and distribution of the records you must access. If you were to read the file's entire contents in a random order, it will be significantly slower than reading that same content sequentially.

    If however, for any given run or time period, you only need a few records from the entire file, seeking to them and reading just those you need, is much quicker than reading through the entire file to get to them.

    Finally, there are various things that can be done to optimise your seek/read times. Primary amongst these is to ensure that your file is contiguous on disk. Second, if at all possible, ensure that records do not span blocks.

    It used to be the case -- where the algorithm allowed -- that deferring reads and re-ordering them could reduce head movement and improve throughput, but most disk controllers do this as a matter of course now.

    The biggest problem with trying to optimise random access is accounting for the effects of other processes accessing the same file-system and/or hardware.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      that deferring reads and re-ordering them could reduce head movement and improve throughput, but most disk controllers do this as a matter of course now
      That is impossible unless the application uses some kind of asynchronous IO in order to queue several read operations and let the OS/hardware handle them in some order that minimizes head movement delays.

      But for common applications using synchronous IO, the app will ask for some data, wait for the OS to fulfill the request, ask for some other data, etc., so there is no way for the OS to reorder the operations.

      Another matter is considering a multithreaded/multiprocess environment where the OS/hardware can reorder the IO operations queued at any point in time by all the running threads.

        That is impossible ...

        Look up the SCAN and C-SCAN algorithms.

        Start here...


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Efficiency of seek
by JavaFan (Canon) on Aug 31, 2011 at 08:22 UTC
    As my file is fairly large, does the time consumption of perl's seek is related with file size, or it's a constant time operation? Is it a costly operation?
    That depends. How's your file stored? Is it stored on a disk, or is it a tape? If the former, you're likely not to notice, and seek time will not be strongly dependent on file size, or seek distance, although exact details depend on how a file is stored (is the page seeked to already in the file buffer, are we using striping/mirroring, does the physical device spin or not, etc) However, on a tape, seek time will be roughly linear with seek distance.

    Note that bulk of the work will be done by the OS, not by Perl.

Re: Efficiency of seek
by Anonymous Monk on Aug 31, 2011 at 06:40 UTC
Re: Efficiency of seek
by thargas (Deacon) on Aug 31, 2011 at 13:59 UTC

    To quote DJB: "Profile. Don't speculate."

    In other words: measure how long it takes instead of guessing. The Benchmark module will help here. No matter how much information you give us, we won't be running your program on your machine so YMMV.

Re: Efficiency of seek
by flexvault (Monsignor) on Aug 31, 2011 at 12:30 UTC

    llancet

    You received all good answers, but I suspect that you need to clarify/update your question to get the answer you may still need. For instance,

  • does your file constantly change or is it static?
  • will it increase to a size that is greater than the memory in your computer?
  • does it need to run during prime time or can it run off-shift?
  • can you use a data base to do the seeking for you?

    None of these has anything to do with perl directly, but for perl to work correctly for you, questions like these need to be determined by you

    But if you have the answer you want, then have a great day.

    Good Luck

    "Well done is better than well said." - Benjamin Franklin

      Thanks to all answers on my question! I've finished my work.

      In my actual application, I have two files: one (file A) is the original data file, the other one(file B) is some kind of analysis result for the original data file. For some reason, the record order in file B is not same with file A.

      Now, I need to do some analysis based on each record in file A and B. So, I decided to create an index for file B, access file A sequencially, and seek current record's content on file B.

Re: Efficiency of seek
by locked_user sundialsvc4 (Abbot) on Aug 31, 2011 at 16:09 UTC

    You can also learn a lesson from all those spinning magnetic tapes, and all those punched cards, in campy science-fiction movies.

    In the years before computers existed, data was processed very successfully using the strategy of sorting one or more data streams according to the same key, after which it could be processed by one sequential run through one-or-more files in parallel.   Even when you are using an indexed file, much the same benefit can be obtained by sorting the file of keys that are being sought.

    When you know that you are dealing with sorted files, you know three things to be true:

    1. All occurrences of any given key-value will be adjacent.
    2. If any gaps occur in the sequence, no records with those key-values exist anywhere.

    External sorting operations are among the most-studied algorithms, and they are exceptionally efficient.   But notice that I said, external sorting.   “Memory” is virtual, hence it is actually a disk file, and virtual memory paging results (more or less) in random seeks.   “In-memory” operations, including hashing and lists, are considerably less efficient than external sorting based algorithms when “thrashing” begins in earnest – by several orders of magnitude.

A reply falls below the community's threshold of quality. You may see it by logging in.