in reply to Re^2: OT How fast a cpu to overwhelm Time::HiRes
in thread OT How fast a cpu to overwhelm Time::HiRes

because the OS can switch contexts between processes at any time,

But switching processes is not cheap. Take a good look at the code at the heart of your OS kernel and see what is involved with switching processes.

All the saving and restoring of registers alone is not insubstantial, but before you get to all of that you have to go through the mechanics of deciding which process is next to run. This involves some sort of prioritised queue mechanism. You also have to update any dynamic priorities (eg. foreground boost), check for whether the next round-robin process within the current priority arbitration level is eligable to run.

Is it sleeping or in an IO wait state, etc.

And once you chosen the next process to run, you have to check whether it has been swapped out, and potentially shuffle memory to and from disk. Did the process swap invalidate any COW memory that now needs replicating? And almost every process swap is going to cause the processor to stall while the l2 cache is refreshed. Even kernel-level thread swaps involve a substantial amount of housekeeping by the kernel. Less than a process, but still substantial.

Process swaps are not instantaneous. Can they be done in less than 2000 cycles/instructions? In C it's vaguely possible, but 18500 for perl?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^3: OT How fast a cpu to overwhelm Time::HiRes

Replies are listed 'Best First'.
Re^4: OT How fast a cpu to overwhelm Time::HiRes
by tirwhan (Abbot) on Dec 01, 2005 at 00:42 UTC
    But switching processes is not cheap.

    That depends on your OS ;-). Linux tries to make process context switches extremely cheap (and as a result you can get away with using processes instead of threads for parallel performance). This recent mail on LKML states that on a 3GHz P4 the 2.6 kernel can do up to 700,000 process context switches per second. That's only if the processes do nothing except switch, the mail goes on to explain that under normal workloads you'd only get about 10,000 switches per second. I just took a quick look around the machines I have at hand, and I found one which reports an average of ~60,000 cs/s over the period of fifteen minutes (via sar -w). Running the lat_ctx benchmark from the lmbench suite on an AMD64 machine gives me a minimum context switch overhead of 0.55 microseconds. Given these figures it seems conceivable to me that a context switch can take place in significantly under a microsecond on an extremely fast processor with large cache. I'd agree, this is definitely not something you'd expect to happen, but it seems possible.

    Anyway, code walks as they say, here's a little script which forks off a couple of processes and tries to get the same gettimeofday in different children:

    #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw(gettimeofday usleep); my $parent_time=(gettimeofday)[0]+5; my $children=10; my $measurements=5000; my $pid; for my $child (1..$children) { if ($pid=fork()) { } elsif (defined $pid) { my ($times,@temp_times); #Make all children start measuring as nearly simultaneously as + we can while(1){ last if ((gettimeofday)[0]>$parent_time); usleep 1; } # Get time measurements for (1..$measurements) { @temp_times=gettimeofday(); $times.=$temp_times[0].sprintf("%06d",$temp_times[1])."\n" +; usleep 2; } sleep 10; PrivoxyWindowOpen(my $record,">","timerecord$child") or die "C +an't open record file"; print $record $times; close $record or die "Can't close record file"; exit; } else { die("Cannot fork"); } } # Wait for children to finish my $kid; do { $kid = waitpid(-1, 0); } until $kid > 0; # Put measurements into a hashtable and end if any duplicates are foun +d my %measured; for my $child (1..$children) { PrivoxyWindowOpen(my $record,"<","timerecord$child") or die "Can't + open record file"; while(<$record>) { chomp; if (exists($measured{$_})) { print "Found duplicate: $_, gettimeofday returned the same + value in child $child and $measured{$_}\n"; exit; } $measured{$_}=$child; } close $record or die "Can't close record file"; } # Check for shortest time passed between two measurements my $difference=42; my ($t1,$t2); my $last_i=0; my $last_t=0; for my $t (sort keys %measured) { next if($last_i == $measured{$t}); my $cur_diff=$t-$last_t; if ($cur_diff<$difference) { $difference=$cur_diff; ($t1,$t2)=($last_t,$t); } $last_t=$t; $last_i=$measured{$t}; } print "Found minimum delay of $difference between $t2 - child $measure +d{$t2} and $t1 - child $measured{$t1}\n";

    On SMP machines this easily finds duplicate measurements, so, no surprise there, calls to gettimeofday can return the same value from different processes on SMP. The smallest time period I was able to achieve on a single-processor machine was 11 microseconds. Strangely enough this was not the fastest CPU I tried it on by far, so I suspect it has something to do with the Linux kernel version (this one is running the Debian 2.6.8 kernel, whereas all others have newer versions).

    So, at least from this practical test it appears you are right, processes don't switch quickly enough for Time::HiRes to return the same result.


    Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
      That depends on your OS ;-). Linux tries to make process context switches extremely cheap (and as a result you can get away with using processes instead of threads for parallel performance). This recent mail on LKML states that on a 3GHz P4 the 2.6 kernel can do up to 700,000 process context switches per second. That's only if the processes do nothing except switch, the mail goes on to explain that under normal workloads you'd only get about 10,000 switches per second.

      I just ran the following script:

      #!perl -slw use strict; use threads; sub thread{ Win32::Sleep 0 while 1; } my @threads = map{ threads->create( \&thread ) } 1 .. 100; <STDIN>;

      Which sets 100 threads going that do nothing but relinquish the processor in a tight loop. With a single copy of this running, I get sustained measurements of 320,000 context switches/second with occasional peaks of up to 345,000.

      However, if I set a second copy of the script running concurrently, so that roughly 1 in 2 context switches will be a process swap as well as thread switch, the numbers drop to a sustained average of around 215,000/s which clearly shows the extra cost of switching processes (on my OS:). It would be interesting to see the numbers you get for a linux system. Do you run a threaded Perl?

      However, you do not have to do very much at all to drop these figures way down. Calling gettimeofday() on each thread slows this to around 90,000/second for one copy of the 100 thread process, with a second copy of the script bringing it down to 80,000 or so, and each subsequent process causing a similar drop.

      #!perl -slw use strict; use threads; use Time::HiRes qw[ gettimeofday ]; sub thread{ my( $s, $u ); while( 1 ){ Win32::Sleep 0; ( $s, $u ) = gettimeofday(); } } my @threads = map{ threads->create( \&thread ) } 1 .. 100; <STDIN>;

      Of course, do any form of IO, or anything that semaphores (like shared variable accesses) and the number drops like a stone.

      So, at least from this practical test it appears you are right, processes don't switch quickly enough for Time::HiRes to return the same result.

      No, but it would appear possible for it to happen from within the same process--under Linux and 5.6.2 at least. See the subthread starting at Re^2: OT How fast a cpu to overwhelm Time::HiRes.

      I couldn't get anywhere near it on my system, but the need to convert between Win32 APIs and the returns dictated by the gettimeofday() call have a significant impact. Even going direct to the high performance timer from C, I couldn't get any closer than just over 1 microsecond elapsed.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Hmm, how do you measure context switches between threads? This is done in the perl process internally, so the OS doesn't know anything about them, or am I missing something?

        Anyway, if I modify your code to use fork instead of threading I do not get any significant amount of process context switching in vmstat either. I assume this is because of the way the Linux scheduler works (it assigns long timeslices to CPU-bound processes and preempts them if there's a higher-priority IO-bound tasks). I'll have to figure out how to make Linux context-switch rapidly from perl.

        Do you run a threaded Perl?

        Yes, 5.8.4 i386-linux-thread-multi.

        I modified my code to use threads instead of processes and got the minimal delay time down to 10 microseconds (from 11 with processes). So it looks like the context switch overhead for Linux Perl processes can be only minimally higher than that of Perl threads (for this benchmark, other workloads are bound to exhibit totally different behaviour). What do you get when you run my code?

        No, but it would appear possible for it to happen from within the same process--under Linux and 5.6.2 at least.

        Yes, I can confirm that this is true for 5.8.4 as well. So it seems to be possible to achieve duplicate identical results from Time::HiRes if

        • Subsequent calls are made quickly from one process or
        • Several processes are running on SMP architecture or
        • on hyperthreaded processors? I don't have a non-SMP HT machine to test this with, but I'd guess it is possible there as well

        Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan