olafmar has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I wrote some time ago a script to calculate the standard deviation between the corresponding values found in different input files and this week I decided to convert it to use multiple processors. I am using perl 5.8.9 compiled with macports with support for ithreads.

Conceptually, I read with the master thread the data of the different files, one at time, and I interleave them in an array. Later I start multiple threads, each one working on a different part of the initial array. I use Statistics::Basic to calculate the standard deviation of the values of a small slice of the initial array and I put the results in a temporary array, then copied to the array of the results, that is the only shared array I use.

This is the code of the main thread managing the work:

my %mythreads; my $work_package = int($elements / $number_threads + 0.5); my @ReturnData :shared; my $start = 0; my $stop = $start + $work_package * $files - 1; $stop = $elements * $files - 1 if ($stop >= $elements * $files); for (my $i = 0; $i < $number_threads; $i++) { $mythreads{$i} = threads->create(\&do_work); $start = $stop + 1; $stop += $work_package * $files; $stop = $elements * $files - 1 if ($stop >= $elements * $files); } # collect data foreach (sort(keys(%mythreads))) { $mythreads{$_}->join(); }
Each worker thread executes the following sub:
sub do_work { my @working_block = @input_data[$start .. $stop]; my @partial_output_data; $, = undef; $\ = undef; my $output_line = 0; for (my $i = 0; $i < @working_block; $i += $files) { # calculates the stddev of each slice and outputs the formatte +d resuls if (max(@working_block[$i .. $i + $files - 1]) == 0) { $partial_output_data[$output_line] = '0.00000E+00'; } else { $partial_output_data[$output_line] = sprintf('%.5E', stddev(@working_block[$i .. $i + $file +s - 1]) / mean(@working_block[$i .. $i + $files - 1])); } $output_line++; } @ReturnData[$start / $files .. $stop / $files] = @partial_output_d +ata; return; }
I timed the execution and the results are found here: http://img62.imageshack.us/i/perlthreading.png/ 6 seconds are taken by the other parts of the script, the remaining is for the code displayed in this post.

Is this behaviour normal? what could I do to improve the performances? this was more a test than a real need, but I would like to know for the future. I thought threads were easy enough not to increase too much the code size/effort, while processes would have required IPC or something else.

Thank you very much.

Replies are listed 'Best First'.
Re: Poor performances with threads
by BrowserUk (Patriarch) on Oct 23, 2009 at 11:18 UTC

    It doesn't help that you are making so many copies of your data and results:

    sub do_work { my @working_block = @input_data[$start .. $stop]; ## COPY 1 my @partial_output_data; $, = undef; $\ = undef; my $output_line = 0; for (my $i = 0; $i < @working_block; $i += $files) { # calculates the stddev of each slice and outputs the formatte +d resuls if (max(@working_block[$i .. $i + $files - 1]) == 0) { ## CO +PY 2 $partial_output_data[$output_line] = '0.00000E+00'; } else { $partial_output_data[$output_line] = sprintf('%.5E', stddev(@working_block[$i .. $i + $file +s - 1]) / ## COPY 3 mean(@working_block[$i .. $i + $files - 1])); ## COP +Y 4 } $output_line++; } @ReturnData[$start / $files .. $stop / $files] = @partial_output_d +ata; ## COPY 5 return; }

    And three of those copies are as lists onto the stack, which then get copied again once and often twice into data structures internal to Statistics::Basic.

    I suspect, but cannot verify without trying to re-create your script, (why not post the whole thing if you want help?), that you are spending most of your time allocating memory and copying data, rather than calculating.

    Here is a simple script that calculates the coefficient of variation of 55k 7-element datasets in just over half a second:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; use List::Util qw[ sum ]; sub coffOfVar { my $mean = sum( @_ ) / @_; return 0 unless $mean; my @dif2s = map{ ( $_ - $mean )**2 } @_; my $stddev = sum( @dif2s ) / $#_; return $stddev / $mean } my @data; push @data, [ map rand( 2000 )-1000, 1..7 ] for 1 .. 55*1024; my $start = time; my @results; push @results, sprintf '%.5E', coffOfVar( @$_ ) for @data; printf "Took %.2f seconds to find the coffOfVar of %d datasets\n", time - $start, scalar @data; __END__ C:\test>802855.pl Took 0.59 seconds to find the coffOfVar of 56320 datasets

    Re-using other peoples code is all well and good, but when it leads to you writing and supporting far more complex code yourself in order to meet your operational performance requirements, because you are constantly having to restructure the natural state of your data to force-fit it to those of the modules you use--and you have no choice but to inherit a bunch of functionality, or just prepreparation for that functionality that you do not require--then it is time to consider extracting just that code that you actually need from the module and make your life simpler.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      One can calculate the standard deviation with just one pass over the data set by iterating once over the numbers, calculating both the sum and the sum of squares in one go.

      In the case of only 7 numbers to sum over that doesn't make much of a difference (0.57s vs. 0.48s on my machine), The difference becomes more pronounced when there are more numbers in one set, when the whole array doesn't fit into the cache anymore (2.94s vs. 4.04s for 1k lists with 7k items each).

      But when such micro optimizations become important, it's worth pointing out that writing things in C (via XS or Inline::C) might be much faster.

      Perl 6 - links to (nearly) everything that is Perl 6.
        doesn't make much of a difference (0.57s vs. 0.48s on my machine)

        Agreed. But once you've reduced the time from 60 seconds to 0.6 of a second, the extra saving hardly seems worth it :) But I guess that's up to the OP.

        But it does highlight the problems with modules like Statistics::Basic. It offers a OO interface, storing copies of the data internally with the intention that it only need calculate things like mean once, whilst providing access to many types of statistic which re-use that calculated value.

        The problem is, calculating the mean of 7 values manually is quicker than calling a nested inherited method to retrieve it from storage. If you are processing single large datasets, amortisations can lead to savings (assuming you actually use the OO interface which the OP isn't). But for processing many small datasets, there are no amortisation gains.

        The upshot is, users pay the penalties of the OO without gaining any benefits and then go off looking at complex solutions (threads in this case) to try and recover the losses.

        So, 41 installed files and 3 dozen lines of code (and more), to achieve what can be done with 6 lines much more productively.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      In the first version of the script I used hashes of arrays to iterate through the different input files and data, then I thought that I didn't need the hashes at all and I switched to a single 1D arra containing interleaved data from the different inputs. 20% faster.

      I really don't know why I kept the Statistics module... for such a simple computation. I will now apply your suggestions and I will study them a bit...

      Thank you very much.
        Done, I used the manual calculation of the stddev on slices of the original array containing the interleaved data. It takes more than calculating the stddev of the whole set that you wrote, but it is still 9 seconds against the original 60 seconds (meaning 3 seconds of actual calculation against 54).

        Thank you.

        I post the whole code, even if I am already satisfied. If you have time to read it, maybe I can learn something more.
        The script requires an input on the command line. In that input file are listed the actual starting data files, each one with a weight.
        The header of one input data file is simply copied to the output, the following values (4 per line) split, multiplied by the weight factor and interleaved in the data array (reading is slow, so I tried to put some work in this step).
        In the last part I recreate the input file, with 4 results per line after the headers. In between there is the code I asked you about.

        #!/usr/bin/perl -w # (C) Olaf Marzocchi, 2009 # Calculates the std dev of different tallies, the size must be the sa +me # It assumes a standard order of the values: after LOOKUP_TABLE always + the data require 5.004; use strict; use List::Util qw[ sum ]; # use Data::Dump qw[ pp ]; use POSIX qw(strtod setlocale LC_NUMERIC); setlocale(LC_NUMERIC, "en_US"); $\ = "\n"; $, = "\t"; if (@ARGV < 1) { print "Usage: $0 config_file [output_file]"; print " Config file must contain on each line the relative path +to one "; print " .vtk file and the weight of each one, separated with tab +ulation."; exit; } print "Processing $ARGV[0]..."; # Formats the filenames in order to always get the extension my %inputs; my $output_name = ""; my $weights = 0; open(INPUTS, '<', $ARGV[0]) or die "Error opening $ARGV[0]: $!"; while(<INPUTS>) { next if /^#/; s|\.vtk\t|\t|; /^([^\t]+)\t+([0-9eE+-.,]+)/; $weights += strtod($2); $inputs{$1 . '.vtk'} = strtod($2); } close(INPUTS); $output_name = $ARGV[0] . ".vtk"; $output_name =~ s/\.txt//g; $output_name = $ARGV[1] if (@ARGV > 1); # Sum of the weights normalised to 1 foreach(keys(%inputs)) { $inputs{$_} /= $weights; next unless (-e $inputs{$_}); print "$inputs{$_} is not available.\n"; exit; } # The header of the first file is also copied to the output. my @input_data; my @output_data; my $header_lines = 0; my $elements; my $filename = each (%inputs); open(INPUT, '<', $filename) or die "Error opening $filename: $!"; print "Using $filename for headers.\n"; while (<INPUT>) { chomp; push(@output_data, $_); last if (/lookup_table/i); $elements = $1 if (/point_data +(\d+)/i); $header_lines++; } close(INPUT); # Saves the array of actual data of the different input files into an +array with # values already interleaved. my $files = keys(%inputs); my $current_file = 0; foreach $filename (sort(keys(%inputs))) { print $filename, $inputs{$filename}; open(INPUT, '<', $filename) or die "Error opening $filename: $!"; my $i = 0; while (<INPUT>) { last if ($i == $header_lines); $i++; } my $j = 0; while (<INPUT>) { chomp; my @records = split; foreach (@records) { # multiplies by the weight and puts in the array $input_data[$files * $j + $current_file] = $_ * $inputs{$f +ilename}; $j++; } } close(INPUT); $current_file++; } my @ReturnData; for (my $i = 0; $i < @input_data; $i += $files) { push @ReturnData, sprintf '%.5E', coffOfVar( @input_data[$i .. $i ++ $files - 1] ); } # file generation for (my $i = 0; $i < @ReturnData; $i += 4) { if (@ReturnData - $i < 4) { push(@output_data, join(' ', @ReturnData[$i .. $#ReturnData])) +; last; } push(@output_data, join(' ', @ReturnData[$i .. $i + 3])); } $, = " "; $\ = "\n"; print $output_name; open(OUTPUT, '>', $output_name) or die "Error opening output: $!"; foreach (@output_data) { print OUTPUT $_; } close(OUTPUT); exit(0); sub coffOfVar { my $mean = sum( @_ ) / $files; return 0 unless $mean; my @dif2s = map{ ( $_ - $mean )**2 } @_; my $stddev = sum( @dif2s ) / $files; return $stddev**0.5 / $mean }
Re: Poor performances with threads
by BrowserUk (Patriarch) on Oct 23, 2009 at 09:59 UTC
    ... calculate the standard deviation of the values of a small slice of the initial array ...

    If it takes longer to spawn the thread, than to process that small slice, you have a net slowdown.

    How long does a single (non-threaded) solution take? How many cores do you have? How many concurrent threads does that graph represent?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      The graph represents 1, 2, 3, 4, 8 threads and I have 8 cores.

      Actually I haven't thought about the time required to spawn the thread... since each thread has at least 55k sets of 7 values to work on, I thought the time required to spawn the thread was negligible.

      1 thread: 60 seconds
      2 threads: 40 seconds
      4 threads: 37 seconds
      8 threads: 54 seconds

      I think I will settle on 2 threads for the time being, it should be a good compromise also for smaller datasets.

Re: Poor performances with threads
by markuhs (Scribe) on Oct 23, 2009 at 11:27 UTC
    I would instantiate the worker threads before reading the data. This would also give them the job to read the data, but can speed it up since creation is faster...

    Consider: http://perldoc.perl.org/perlthrtut.html#Performance-considerations

    Lukas
Re: Poor performances with threads
by zentara (Cardinal) on Oct 23, 2009 at 12:52 UTC
    I think you are just running some intensive math computations, which is one of the slower aspects of perl, due to it's variable storage techniques.... threads run as a concurrent process... so yeah... the only thing you can do is find out the best way to run in separate processes. Your single process is being given a nice level by the kernel, and the many threads split that.

    I've been away too long, but maybe there is a module available to increase the "nice" value of your threaded script..... it would be even cooler if you could adjust each thread's priority within the main process. Of course that may bog down your other apps.

    But even if you get each computation in a separate process, you will still hit the limit of your computer's processing power..... if you went thru IPC, you might want to write the computational code in c

    Maybe setup a Distributed Computing Network, and farm out the statistical computations to other machines... :-)


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku