in reply to RE: RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks
in thread File reading efficiency and other surly remarks

While demonstrating one point is wrong (and again making it clear that until you benchmark, you don't really know what is faster), you demonstrate the other.

What happens in your chunk code with the last line? Which is more code? And when you are doing fixing that you may still be twice as fast but with quite a bit more (and harder to read) code. Going forward that is more to maintain.

I would strongly argue against this optimization (which I think might well give different results on different operating systems) until after your system is built and performance is known to be a problem.

One note though. The IO* modules generally have significant overhead and I don't recommend using them.

EDIT
Another bug. You used split in the chunk method without the third argument. Should your block land at the start of a paragraph, you would incorrectly lose lines!

  • Comment on RE (tilly) 5: File reading efficiency and other surly remarks

Replies are listed 'Best First'.
RE: RE (tilly) 5: File reading efficiency and other surly remarks
by lhoward (Vicar) on Aug 26, 2000 at 21:17 UTC
    I never said that the method I posted was easier to maintain; I only stated that it was significantly more efficient. If fast reading of large files (that you can't fit into memory all at once) is you concern; then the block/hand-split method is better. Also; the code I used for the "block and manual split" approach is not by own; but lifted from an earlier perlmonks discussion.
      Specifically see RE (tilly) 6 (bench): File reading efficiency and other surly remarks. Your speed claim can only be made for the specific setup you tested. If your code will need to run on multiple machines then the optimization is almost certainly wasted effort. If performance does not turn out to be a problem, it is likewise counterproductive to have sacrificed maintainability for this.

      In short, the fact that this might be faster is very good to know for the times that you need to squeeze performance out on one specific platform. But don't apply such optimizations until you know that you need to, and don't apply this one until you have benchmarked it against your target setup.

      A few general notes on optimization. Given the overhead of an interpreted language, shorter code is likely to be faster. With well modularized code you retain the ability to recognize algorithm improvements later - which is almost always a better win. Worrying about debuggability up front speeds development and gives more time to worry after the fact about performance. And readable code is easier to understand and optimize.

      Which all boils down to, don't prematurely optimize. Aim for good solid code that you are able to modify after you have enough of your project up and running that you can identify where the bottlenecks really turned out to be.

        I agree %100. If you get to the point where you absolutely-positively need to squeeze more performanceout of your file-reads you can try the block-at-a-time approach and see if it helps. If you don't absolutely need the performance boost stay with something that is easier to read and less platform-tweaking dependant. On our web-farm it has cut down the run time of our log-file analysis jobs nearly in half.