in reply to the sands of time(in search of an optimisation)

Define "takes some time". If it's something that can be done overnight and it takes less than 8 hours, you're fine. In other words, don't optimize without demonstrating a need to do so.

My criteria for good software:
  1. Does it work?
  2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
  • Comment on Re: the sands of time(in search of an optimisation)

Replies are listed 'Best First'.
Re^2: the sands of time(in search of an optimisation)
by spx2 (Deacon) on Mar 04, 2008 at 02:30 UTC
    Hello,

    At the moment from some benchmarks the time it takes to build the database on 110.000 files
    is 110 minutes.
    It is arguable to say if it's taking some time or little time.
    My plan is to imagine this scaling on a server with say 100.000.000 files and do well on it.
    That is why I search for optimisations like this one.
    I have also made some benchmarks on the software , here they are.
      So, it takes you 1 minute for every 1000 files you work with. Alternately, you can process 17 files/second. To me, this means you're hitting the fundamental limits of Perl. Perl is, frankly, a very slow language from a CPU perspective. That's not what it was optimized for. It's been optimized for developer speed.

      So, I would put forward that you really have two options:

      • Rewrite in C - should give you a 10-1000x speed improvement.
      • Fork. A lot.
      I would try the forking option first. Look at Parallel::ForkManager. There's a number of ways you can iterate into the children, depending on how your directories and files are laid out. But, that's what I'd do first.

      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
        The first step should be to find out if CPU or I/O is the bottleneck.

        There's no point in optimizing CPU usage if the program is not blocking on CPU.

        You can just look at the CPU usage, and if it's constant 100% during the program run, you know that it's worth improving.

        If file I/O is the bottleneck you can try to experiment with different file systems, RAID, different hard discs etc.

        What do you think about POE. Do you think it could be used also in this case for parallelization of the processing ?