Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

System Performance

by Massyn (Hermit)
on Aug 23, 2007 at 01:03 UTC ( [id://634552]=perlquestion: print w/replies, xml ) Need Help??

Massyn has asked for the wisdom of the Perl Monks concerning the following question:

#!/fellow/monks.pl

I wrote a CPU benchmarking script (at http://www.massyn.net/?p=131) a few weeks back to compare system performance at home, with our big IBM AIX servers (just because I'm curious). I realized that using a "prime number" counter is probably not the best way to go, since primes are exponential, so the numbers being returned won't be entirely accurate.

Ok, regardless of that fact, what I did find very strange was that my cpu_bench.pl script performed much slower on a big IBM P690 Regatta (AIX 5.3) than it did on my AMD Athlon XP2600 (running XP Active State Perl 5.6).

Now here's my question : When writing a system performance script, what should I do / not do? It looks like the different versions of perl (between Windows & AIX) do things differently, or my code is just not optimized enough to handle different hardware layers. Maybe Perl just isn't the language to do this in...

What do you say, gentle monks? Have you done something similar? What can we use to determine system performance, both CPU and storage, the ability to compare my desktop with my enterprise server, to ensure I get the right level of service.

Thanks!

     |\/| _. _ _  ._
www. |  |(_|_>_>\/| | .net
                /
The more I learn the more I realise I don't know.
- Albert Einstein

Replies are listed 'Best First'.
Re: System Performance
by BrowserUk (Patriarch) on Aug 23, 2007 at 04:21 UTC

    The problem is that whilst the Regatta has 32 processors and a huge memory bandwidth, your single tasking benchmark will only utilise a tiny fraction of all that power. It will only ever run on one of the POWER4 processors which were introduced in 2001 and run at a now stately 1.1 or 1.3 Ghz.

    There is a whole sub-industry devoted to constructing, maintaining and running specialised benchmarks for this kind of hardware to generate headline grabbing numbers. If you could run one of those on your home machine, it would fair very badly by comparison, despite being (I'm guessing) two or three Moore's generations newer hardware.

    Generalised hardware benchmarks are generally pretty useless. A slower (Ghz) machine with a top of the range video card will out perform a faster machine with a bad one. Benchmarks only work to the extent that they reflect the realistic operations that they are a substitute for.

    And a simple, single-tasking Perl script doing a little repetitive math won't begin to exercise the potential of even a dual-core or dual processor system, let alone the kind of two-cores per die, 4 dies to a card, 4-cards to a box machine like the p690 with its "variable frequency 'distributed switch', wave-pipelined expansion bus".

    Its another indication that the future is multi-tasking and that languages that rely on the programmer to partition their algorithms, and use fork and pipes or sockets to distribute and coalesce the data, are doomed to disappear.

    Take your prime sieve as an example. It is almost impossible to distribute the processing of a sieve across processors using fork. But its easy to set multiple threads running, that increment the shared candidate counter and then scan the shared sieve array 'striking off' multiples of their candidate.

    Not a convicing example? Then consider manipulating very large digital images, say digital X-rays or CAT or PET scans. Or searching and matching huge strings like genome sequences.

    Threading is coming, like it or not. It's just a matter of which languages are going to make using them, easiest.

    (And sorry for hijacking your question to grind my axe :)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      threads are known to work well only on middle-ground 2-4 cpu machines, later they start hitting memory locking wall, and when it comes to things "coming", most of the industry thinks that things like transaction memory is the thing that's coming.

      Not to mention that when you look at efficient multi-cpu systems outthere you start noticing designs like erlang ( granted, what they're using is lightweight threads, but that's not the power of their solution)

      Threads are here, they are ugly, and we're growing out of them pretty fast.

        ... most of the industry thinks that things like transaction memory is the thing that's coming.... Threads are here, they are ugly, and we're growing out of them pretty fast.

        Do you not see the contradiction in those two statements?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      just curious, can't you just test processes, I mean like this (from your favourite shell say) on unixlike

      % cat run for i in {1..$1} do cpu-intensive-task & done wait % for t in 100 200 300; do time run $t; done # and plot the results
      cheers --stephan

        For performance testing, yes. Absolutely.

        My point about threading is simply that if you have a 32-way processor, unless you are constantly running 32+ separate tasks, you're wasting some of that power. However, if some or all of your your less than 32 concurrent tasks are set up for threading, then they will benefit (a little or a lot) whenever there are less than 32 tasks running.

        And there are many tasks, like the OPs primes algorithm and the other examples I cited, that do not lend themselves to being multi-tasked through forking, because they need access to shared data.

        All the problems with threading lie with the nature of the low-level abstractions for controlling shared memory access. The language that makes that easier, preferably transparent, will clean up in the future.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: System Performance
by jbert (Priest) on Aug 23, 2007 at 11:08 UTC
    What you should always do is test something which is as similar as possible to what you care about.

    That's all it comes down to really.

    In this particular case, the big-iron machine *is* slower than your desktop for a single-threaded load (as others have pointed out).

    If what you care about is a non-parellisable problem, your home machine *will* run it faster than the big box.

    If what you care about is producing numbers which show your big box is fast (which is fair enough, it's fun to play with these things), then do what the others mentioned and run multiple cpu crunchers and aggregate the results.

    But this is only CPU, of course. If your problem has a 2Gbyte dataset, the desktop has 1Gb RAM and the big box has plenty, then you'll see a big difference depending on how much RAM you use in your performance test.

    So...perf comparisons between hardware come down to how accurately you can model the load you care about. Which is generally limited to how well you understand it and how well you can reproduce 'real world' conditions. (A thousand users on slow modems nibbling away at your app, plus 10% on screaming broadband can produce a very different load to 100 looping procs on another box on your LAN).

Re: System Performance
by grinder (Bishop) on Aug 23, 2007 at 13:19 UTC

    I'd stake a beer or two on one area where your IBM big iron will walk all over your piddly desktop and that is in raw throughput.

    Try setting up a dozen processes running in parallel, reading and writing files on the disk. For instance, each process writes a few million records to 20 files opened simultaneously. Then go back and read all files simulatenously (that is, one record from each open file before getting the next), and write those records out to another file. Then delete everything and start again. Do that about 10 times. See how long it takes from the first process launched until all are finished.

    The idea is to saturate the IO channel on the machine. If my theory is right, your IBM machine will do much better at dealing with IO, and will finish miles ahead of your desktop.

    • another intruder with the mooring in the heart of the Perl

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://634552]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-24 23:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found