llancet has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Efficiency of seek
by roboticus (Chancellor) on Aug 31, 2011 at 06:31 UTC | |
The easy question first: Bad news--It's a costly operation. Generally, if you're forced to make the heads move, you're talking 5+ milliseconds (and the 5ms is probably a high-end drive). The good news: Modern drives typically read an entire cylinder at a time, so if your seeks are mostly local, you can often bypass the head motion. (The more platters in your drive, the more information in the cylinder.) So when you preprocess the file, try to arrange things to keep the seek distances small, and you'll have fewer head movements. When a head motion occurs, it has a trapezoidal motion profile: At the start of the movement, it accelerates to "top speed" which is a constant-time operation. At the end of the movement, it decelerates back to zero speed, which is another constant-time operation. After that, it takes a little time to stabilize the motion and sync up with the servo track. Between the start and end of the movement, the heads travel at a fixed velocity, so the more tracks you cross, the more time it will take. (However, I doubt it matters much, as I believe (no evidence) that the time is dominated by the speed up, slow down and synchronize operations.) (Note: I glossed over a good few details, such as short head movements in which case the speed up never gets to full speed before the heads start slowing down.) Note: I'm not a hardware manufacturer, though I play one on TV. ...roboticus When your only tool is a hammer, all problems look like your thumb. | [reply] |
by JavaFan (Canon) on Aug 31, 2011 at 12:56 UTC | |
It's a costly operation. Generally, if you're forced to make the heads move, you're talking 5+ milliseconds (and the 5ms is probably a high-end drive)Technically, you're only playing the price when reading (or writing) after the seek. And then only if the requested page isn't already in the buffer. The seek itself doesn't move the heads. And you may have to pay said price anyway, even without seeking. The heads may stay at the same spot if your process is the only one accessing the disk. | [reply] |
by roboticus (Chancellor) on Aug 31, 2011 at 21:39 UTC | |
| [reply] | |
|
Re: Efficiency of seek
by salva (Canon) on Aug 31, 2011 at 07:52 UTC | |
does the time consumption of perl's seek is related with file size, or it's a constant time operation?If the file fits in the OS cache and you have already read it on the preprocessing stage, seek is going to be a very cheap operation as all the information will be available from fast RAM. When the file doesn't fit on the cache, randomly accessing it would be several orders of magnitude slower (per operation). In those cases, it is usually better to revert to an algorithm that uses a linear access pattern to the data even if in paper it is worse . A classical example is removing duplicates from some data using a hash to detect duplicates (O(N)) or using sort to make duplicates contiguous and trivial to detect (O(NlogN)). When the data becomes too big to fit in the available RAM, the random data access pattern on the algorithm using the hash degrades its performance to the point of making it useless. Update: Some code (Linux only, because of the non-portable disk cache cleaning stuff): That on my computer...
| [reply] [d/l] [select] |
|
Re: Efficiency of seek
by BrowserUk (Patriarch) on Aug 31, 2011 at 13:02 UTC | |
I think you've been given a lot of irrelevant information that completely misses the answer you are seeking [sic]. time consumption of For the purposes of random access, seeking is an amortised, constant time operation, regardless of the size of the file. That is, all else -- hardware, system buffers, OS, system load, file fragmentation etc. -- being equal, reading the first and then the last records will take the same time as reading the last then then first, regardless of whether the file is a few 10s of megabytes or a few tens of gigabytes. Is it a costly operation? It depends upon the amount and distribution of the records you must access. If you were to read the file's entire contents in a random order, it will be significantly slower than reading that same content sequentially. If however, for any given run or time period, you only need a few records from the entire file, seeking to them and reading just those you need, is much quicker than reading through the entire file to get to them. Finally, there are various things that can be done to optimise your seek/read times. Primary amongst these is to ensure that your file is contiguous on disk. Second, if at all possible, ensure that records do not span blocks. It used to be the case -- where the algorithm allowed -- that deferring reads and re-ordering them could reduce head movement and improve throughput, but most disk controllers do this as a matter of course now. The biggest problem with trying to optimise random access is accounting for the effects of other processes accessing the same file-system and/or hardware. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by salva (Canon) on Sep 01, 2011 at 07:18 UTC | |
that deferring reads and re-ordering them could reduce head movement and improve throughput, but most disk controllers do this as a matter of course nowThat is impossible unless the application uses some kind of asynchronous IO in order to queue several read operations and let the OS/hardware handle them in some order that minimizes head movement delays. But for common applications using synchronous IO, the app will ask for some data, wait for the OS to fulfill the request, ask for some other data, etc., so there is no way for the OS to reorder the operations. Another matter is considering a multithreaded/multiprocess environment where the OS/hardware can reorder the IO operations queued at any point in time by all the running threads. | [reply] |
by BrowserUk (Patriarch) on Sep 01, 2011 at 16:25 UTC | |
That is impossible ... Look up the SCAN and C-SCAN algorithms. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by salva (Canon) on Sep 01, 2011 at 19:42 UTC | |
by BrowserUk (Patriarch) on Sep 01, 2011 at 20:06 UTC | |
|
Re: Efficiency of seek
by JavaFan (Canon) on Aug 31, 2011 at 08:22 UTC | |
As my file is fairly large, does the time consumption of perl's seek is related with file size, or it's a constant time operation? Is it a costly operation?That depends. How's your file stored? Is it stored on a disk, or is it a tape? If the former, you're likely not to notice, and seek time will not be strongly dependent on file size, or seek distance, although exact details depend on how a file is stored (is the page seeked to already in the file buffer, are we using striping/mirroring, does the physical device spin or not, etc) However, on a tape, seek time will be roughly linear with seek distance. Note that bulk of the work will be done by the OS, not by Perl. | [reply] |
|
Re: Efficiency of seek
by Anonymous Monk on Aug 31, 2011 at 06:40 UTC | |
It doesn't matter :) perl's seek is an interface to the underlying operating system c runtimes fseek function -- there is nothing you can do to speed it up short of upgrading hardware / hardware drivers On a related note, see Suffering From Buffering, 4k read buffer is too small | [reply] |
|
Re: Efficiency of seek
by thargas (Deacon) on Aug 31, 2011 at 13:59 UTC | |
To quote DJB: "Profile. Don't speculate." In other words: measure how long it takes instead of guessing. The Benchmark module will help here. No matter how much information you give us, we won't be running your program on your machine so YMMV. | [reply] |
|
Re: Efficiency of seek
by flexvault (Monsignor) on Aug 31, 2011 at 12:30 UTC | |
You received all good answers, but I suspect that you need to clarify/update your question to get the answer you may still need. For instance, | [reply] |
by llancet (Friar) on Oct 21, 2011 at 01:10 UTC | |
Thanks to all answers on my question! I've finished my work. In my actual application, I have two files: one (file A) is the original data file, the other one(file B) is some kind of analysis result for the original data file. For some reason, the record order in file B is not same with file A. Now, I need to do some analysis based on each record in file A and B. So, I decided to create an index for file B, access file A sequencially, and seek current record's content on file B. | [reply] |
|
Re: Efficiency of seek
by locked_user sundialsvc4 (Abbot) on Aug 31, 2011 at 16:09 UTC | |
You can also learn a lesson from all those spinning magnetic tapes, and all those punched cards, in campy science-fiction movies. In the years before computers existed, data was processed very successfully using the strategy of sorting one or more data streams according to the same key, after which it could be processed by one sequential run through one-or-more files in parallel. Even when you are using an indexed file, much the same benefit can be obtained by sorting the file of keys that are being sought. When you know that you are dealing with sorted files, you know three things to be true: External sorting operations are among the most-studied algorithms, and they are exceptionally efficient. But notice that I said, external sorting. “Memory” is virtual, hence it is actually a disk file, and virtual memory paging results (more or less) in random seeks. “In-memory” operations, including hashing and lists, are considerably less efficient than external sorting based algorithms when “thrashing” begins in earnest – by several orders of magnitude. | |
| A reply falls below the community's threshold of quality. You may see it by logging in. | |