in reply to Re: the sands of time(in search of an optimisation)
in thread the sands of time(in search of an optimisation)

Hello,

At the moment from some benchmarks the time it takes to build the database on 110.000 files
is 110 minutes.
It is arguable to say if it's taking some time or little time.
My plan is to imagine this scaling on a server with say 100.000.000 files and do well on it.
That is why I search for optimisations like this one.
I have also made some benchmarks on the software , here they are.
  • Comment on Re^2: the sands of time(in search of an optimisation)

Replies are listed 'Best First'.
Re^3: the sands of time(in search of an optimisation)
by dragonchild (Archbishop) on Mar 04, 2008 at 12:51 UTC
    So, it takes you 1 minute for every 1000 files you work with. Alternately, you can process 17 files/second. To me, this means you're hitting the fundamental limits of Perl. Perl is, frankly, a very slow language from a CPU perspective. That's not what it was optimized for. It's been optimized for developer speed.

    So, I would put forward that you really have two options:

    • Rewrite in C - should give you a 10-1000x speed improvement.
    • Fork. A lot.
    I would try the forking option first. Look at Parallel::ForkManager. There's a number of ways you can iterate into the children, depending on how your directories and files are laid out. But, that's what I'd do first.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      The first step should be to find out if CPU or I/O is the bottleneck.

      There's no point in optimizing CPU usage if the program is not blocking on CPU.

      You can just look at the CPU usage, and if it's constant 100% during the program run, you know that it's worth improving.

      If file I/O is the bottleneck you can try to experiment with different file systems, RAID, different hard discs etc.

        CPU usage goes to about 70-90%. the hardware used to test this is 2 x 900mhz , 1024ram.
      What do you think about POE. Do you think it could be used also in this case for parallelization of the processing ?
        If Perl is your problem and why you're moving to forking, why on earth would you do said forking with a massively-large Perl framework vs. a lightweight wrapper around fork?

        Also, moritz has a good point - have you determined if you're CPU-bound or I/O-bound? Forking or rewriting in C isn't going to help if your disk is pegged.


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?