jaa has asked for the wisdom of the Perl Monks concerning the following question:

A novice breaks his silence, seeking wisdom in the typing of many monkees...

I am after recommendations for a structural rewrite to improve performance: priorities are 0) maintainable 1) max speed, 2) min memory. This is in a data-processing context, not web pages etc.

Problem: read a number of input items (500K items), transform them into output items and save them into load tables in the database.

Focus: optimal loop structure and transformation.

Current overview
load all items into array of hash for each input item { complex transformation rules } save all items (timestamped) remove all items with old timestamp
Some alternatives I am testing for the loop:
OUTER-IF: if ( $type eq 'apple' ) { for my $item ( @items ) { apple_wash($item); apple_core($item); apple_pulp($item); } } elsif ( $type eq 'banana') { for my $item ( @items ) { banana_bend($item); banana_hang($item); } } else { die "bad fruit: $type"; } INNER-IF: for my $item ( @items ) { if ( $type eq 'apple' ) { apple_wash($item); apple_core($item); apple_pulp($item); } elsif ( $type eq 'banana') { banana_bend($item); banana_hang($item); } else { die "bad fruit: $type"; } } ARRAY OF SUB-REF: my @funcs; if ( $type eq 'apple' ) { push @funcs, \&apple_wash, \&apple_core, \&apple_pulp; } elsif ($type eq 'bananas') { push @funcs, \&banana_bend, \&banana_hang; } else { die "bad fruit: $type"; } for my $item ( @items ) { for my $func (@funcs) { $func->($item); } } SYMBOL-TABLE FIDDLE: if ( $type eq 'apple' ) { *func1 = \&apple_wash; *func2 = \&apple_core; *func3 = \&apple_pulp; } elsif ($type eq 'bananas') { *func1 = \&banana_bend; *func2 = \&banana_hang; *func3 = \&noop; } else { die "bad fruit: $type"; } for my $item ( @items ) { func1($item) unless \&func1 == \&noop; func2($item) unless \&func2 == \&noop; func3($item) unless \&func3 == \&noop; }
Specifics: I have profiled some test code with these four approaches -

OUTER-IF is the (slightly) fastest, but unwieldy in the real instance, as there are hundreds of infrastructural lines omitted from the example that make if hard to maintain duplicate FOR LOOPs.

INNER-IF is also fast, but results in difficult to maintain and long... code inside the FOR LOOP

ARRAY OF SUB-REF: slower than IFs - must be cost of dereferencing the function ref? but makes huge FOR LOOP mucho clearer.

SYMBOL-TABLE FIDDLE: I am a symbol-table virgin, and my symbol-table stuff looks poor - are there better sym-t syntacies? Generates 'Redefined function XXX' warnings.

Oh monks of Perl, I seekest thy wisdom... TIA Jeff

edited: Tue Dec 17 15:23:04 2002 by jeffa - title truncation (was: performance - loops and complex decisions, sub refs, symbol tables, inner and outer if/elses)

update 2 (broquaint): added <readmore> tag

Replies are listed 'Best First'.
Re: performance - loops and complex decisions, sub refs, symbol tables, inner and outer if/elses
by Fletch (Bishop) on Dec 17, 2002 at 15:30 UTC

    Don't fiddle with the symbol table when just a simple hash of code refs will do.

    sub noop { 1; } my %funcs = ( apple => [\&apple_wash, \&apple_core, \&apple_pulp,], bananas => [\&banana_bend, \&banana_hang, \&noop,], ); ## . . . for my $i ( @items ) { unless( exists $funcs{ $type } ) { warn "Bad type `$type'\n"; next; } $_->( $item ) foreach( @{ $funcs{ $type } } ); }

    Of course it might make more sense to just make your items proper objects rather than using a dispatch table.

(z) Re: performance - loops, sub refs, symbol tables, and if/elses
by zigdon (Deacon) on Dec 17, 2002 at 15:25 UTC

    Current overview load all items into array of hash for each input item { complex transformation rules } save all items (timestamped) remove all items with old timestamp

    Not commenting about the rest, but this part here will cost you both in speed and in memory. Any reason why you can't process each item seperately? If you avoid loading all your data into memory, your program will defenitly benefit:

    while (read next item) { complex transformation rules save item unless it's too old }

    -- Dan

Re: performance - loops, sub refs, symbol tables, and if/elses
by MarkM (Curate) on Dec 17, 2002 at 15:46 UTC

    In terms of performance, and maintenance, with a variable set of $type's mapping to a variable set of subroutines that need to be invoked, I would combine and enhance two of your approaches to be:

    my %dispatch_table = ( 'apple' => [\&a, ...], 'orange' => [\&o, ...], ... ); while (defined($object = read one)) { $_->($object) for @{$dispatch_table{$type}}; }

    This approach avoids using a large amount of memory up front to read all objects in to memory, and dispatches using a hash lookup, which will ensure that dispatching is far more linear in terms of number of possible $type values, than would be possible with a string of if, elsif, elsif, ... would be.

    Another small trick that I employ above, is to use the suffix modifier form of the for() loop. Although in theory, for(){statement;} should be as fast as statement for();, it is not. The second does not use a block, and avoids some of the block overhead by using simpler op codes underneath.

Re: performance - loops, sub refs, symbol tables, and if/elses
by jdporter (Paladin) on Dec 17, 2002 at 15:38 UTC
    I'd try the Symbol Table Fiddle approach, but only do one function that way, e.g.
    if ( $type eq 'apple' ) { *func = sub { &apple_wash; &apple_core; &apple_pulp; }; }
    Then I'd use the statement-modifier form of "for", which avoids setting up a lexical block:
    &func for @items;
    (Pass the item in $_ instead of in an explicit variable.)

    jdporter
    ...porque es dificil estar guapo y blanco.

      I'm confused. You want create a new function every time you encounter an object of type 'apple'? Also, it seems the subroutines for each object take a parameter (unless you mean to use & to preserve @_, but he said it should be maintainable ;^) so the new sub would have to be:

      *apple_proc = sub { my $item = shift; apple_wash($item); apple_core($item); apple_pulp($item); }; # same for bananas here.

      I don't think you can avoid the lexical block here since we still need to test for the type of object before executing the appropriate subroutine:

      foreach my $item (@items) { SWITCH: for ($type) { # not sure from the writeup where # $type is supposed to come from /apple/ && do { apple_proc($item); }; /banana/ && do { banana_proc($item); }; warn "Unknown thingy: $type\n"; # Default case } }

Re: performance - loops, sub refs, symbol tables, and if/elses
by rir (Vicar) on Dec 17, 2002 at 17:14 UTC
    In your FIDDLE and your SUB-REF code you seem to be building your call list dynamically. This is a waste.

    You assume you are cpu bound. You're right. But you are bound to be, cause you are waiting to read everything, bound to process it all, then waiting to write it. You may be solving the wrong problem.

    It is not clear whether you have enough memory to avoid swapping. It is unknown whether you have other tasks running in your environment. Also unknown is whether you have multiprocessors.

    Some ideas:

    • Do items one by one.
    • Do items in batches small enough to fit memory.
    • Additionally have multiple programs doing this job, 1 to 3 more processes than cpus, to start.
    • Eliminate routine calls, i.e. apple_wash_core_pulp.
    • Use a hash to look up your processes:
      # off-hand untried code PROGRAM_INIT: my %proc_table = ( type => \&process, apple => \&apple_wash_core_pulp, banana => \&banana_bend_hang, ); foreach ( @item){ &$proc_table{$type} if exists $proc_table{$type}; next; unknown_type(); }
    • Use an array and constants instead of a hash. Similar to the above but use &$proc_array[$type]. Also your data could loaded into an array of arrays.
Re: performance - loops, sub refs, symbol tables, and if/elses
by pg (Canon) on Dec 17, 2002 at 16:38 UTC
    Form a maintainable view, I would suggest you, to put things in OO style:
    workflow.pm: package workflow; use strict; sub new { my $self = {}; $self->{TASKS} = []; bless $self; return $self; } sub add_task { my $self = shift; push @{$self->{TASKS}}, shift; } sub do_it { my $self = shift; my $item = shift; foreach my $task (@{$self->{TASKS}}) { &{$task}($item); } } 1; workflow_table.pm: package workflow_table; use workflow; use strict; sub new { my $self = {}; $self->{WORKFLOWS} = {}; bless $self; return $self; } sub add_task { my $self = shift; my $type = shift; my $task = shift; if (!defined($self->{WORKFLOWS}->{$type})) { $self->{WORKFLOWS}->{$type} = new workflow; } $self->{WORKFLOWS}->{$type}->add_task($task); } sub do_it { my $self = shift; my $type = shift; my $item = shift; $self->{WORKFLOWS}->{$type}->do_it($item); } 1; test.pl use workflow_table; use Data::Dumper; use strict; sub apple_wash {print "wash apple ".shift()."\n";} sub apple_core {print "core apple ".shift()."\n";} sub apple_pulp {print "pulp apple ".shift()."\n";} sub banana_bend {print "bend banana ".shift()."\n";} sub banana_hang {print "hang banana ".shift()."\n";} my $t = new workflow_table; $t->add_task("apple", \&apple_wash); $t->add_task("apple", \&apple_core); $t->add_task("apple", \&apple_pulp); $t->add_task("banana", \&banana_bend); $t->add_task("banana", \&banana_hang); $t->do_it("banana", "banana1"); $t->do_it("banana", "banana2"); $t->do_it("apple", "orange?");
      Except that the OO style has a significant performance penalty relative to the sub style, and the idea here is for performance....
Re: performance - loops, sub refs, symbol tables, and if/elses
by Mr. Muskrat (Canon) on Dec 17, 2002 at 15:38 UTC

    -- for calling us monkees.
    Just kidding!

    But we really prefer the term "Monks" when you are referring to us collectively.

      ...that's right! Strictly speaking, I believe "monkee" is the person receiving the answer, and "monker" is the person creating the answer.

      ...All the world looks like -well- all the world, when your hammer is Perl.
      ---v

        And here I thought we were being referred to as a "TV show about a band that became a band without a TV show"...

        psst... The Monkees - That Was Then

Re: performance - loops, sub refs, symbol tables, and if/elses
by BrowserUk (Patriarch) on Dec 18, 2002 at 10:07 UTC

    The biggest problem with your performance testing is that they do not appear to me to be equivalent?

    In the OUTER-IF case, your check some flag, and then process your entire array as whichever type that flag indicates.

    In the INNER-IF case, you check that flag for every item to decide how to process it.

    Personally, I can see no logic at all to pushing sub references into an array and invoking them in a seperate loop later.

    You end up calling exactly the same number of subroutines, except that you have to dereferece an array element to get a scalar and then dereference that to invoke the sub. On top of that, you have to build the array(s) to store the references in in the first place. It may take less space on the screen, but your increasing both memory and cpu requirements for no gain at all.

    The last one is equally bewildering. You save the conditional for deciding which function to call, but replace it with a condition to test whether the coderef has a value and can be called!

    Again, your making the decision as to the type outside the loop, so whichever type $type is set to (however it is set?), every item is being treated as that same type.

    Whatever benchmarking you did do, if it was based on a comparison of these 4 peices of code was a complete waste of time as they do not do the same things. Also, if your really after performance, then dump the 2/3/? functions to process each type and call one function with all the code inside it.

    If you can make your $type decision once and then process the entire array based on that decision as implied by three of your 4 peices of sample code, then move the seperate processing sections (wash/peal/code) and the loop into a single sub (which could branch internally if needed) and pass a reference to the data array and your code could become.

    $type = determine_type(); { process_apples(\@items), last if $type eq 'apples'; process_bananas(\@items), last if $type eq 'bananas'; ... } #final stuff common stuff exit(0);

    If you really needed to wring the last ounce of performance, you might find the computed goto and avoiding the sub call might gain you a % or two, but it probably wouldn't improve your readability.

    If however, you need to make the decision of $type for every item as implied by your INNER-IF sample then

    $type = determine_type(); for $item (@items) { process_apples($item), next if $type eq 'apples'; process_bananas($item), next if $type eq 'bananas'; ... } #final stuff common stuff exit(0);

    ..

    I seekest thy wisdom

    Erm...cut it with the cutesy examples and show us the real (or at least a good simulation of the) DATA, and an example of the type of processing you are doing.

    Then you might get some useful answers.


    Examine what is said, not who speaks.

Re: performance - loops, sub refs, symbol tables, and if/elses
by chromatic (Archbishop) on Dec 17, 2002 at 19:19 UTC

    It's hard to answer this in isolation, because so much depends on the data and the rest of the program. If TYPE is consistent through the list, go for the outer-if.

    Did you profile your program, though?

Re: performance - loops, sub refs, symbol tables, and if/elses
by sporte01 (Acolyte) on Dec 17, 2002 at 16:26 UTC
    Why not use OOP? Create base class architecture (too bad there are no java-like interfaces in perl). Inheret what you need and implement the rest. No more need for all this string parsing.. just execute on your data. All the function references will be taken care of.
        I wouldn't categorise OO-Perl as all-caps-with-emphasis slow. In practical terms, it's just an extra argument to your functions, which just happen to be in a different namespace.

        Here's a simple benchmark that reveals a maybe 6-10% penalty for using OO without any optimizations.
        use warnings; use strict; use Benchmark qw[ cmpthese ]; package Banana; sub new { my $class = shift; return bless({ @_ }, $class); } sub peel { my ($self) = @_; my $x; foreach (keys %$self) { $x += $self->{$_}; } $x; } package main; sub peel { my ($banana) = @_; my $x; foreach (keys %$banana) { $x += $banana->{$_}; } $x; } cmpthese(10_000,{ oo_peel => sub { my $banana = Banana->new( foo => 10, bar => 1 +5); for (1..250) { $banana->peel(); } }, pr_peel => sub { my $banana = { foo => 10, bar => 15 }; for (1..250) { peel($banana); } }, });
        I've found that while using an object-oriented approach might be slow at the outset, having a good framework does make optimising easier.
Re: performance - loops, sub refs, symbol tables, and if/elses
by krazken (Scribe) on Dec 19, 2002 at 16:31 UTC
    You have 3 perl programs...

    first one...reads the data and decides which program to send the data to.
    second program...does the apple transformations and the apple database stuff
    third program...does the banana transformations and the banana database stuff.


    Why break it up like this you ask?? Well, you don't have to, but where I am going with this is that the first program can fork off a process and call either the second or third program..and assuming your database can handle multiple transactions at once, you shouldn't be in any trouble from that point. Plus this method would allow you to work over multiple processors on the server to get the maximum performance out of it. Also, from a maintability standpoint, you know you have an apple program and a banana program, so if you have to change the apple logic, you don't have to sift through extraneous code looking for it.

    Just a thought. later.
    krazken
Re: performance - loops, sub refs, symbol tables, and if/elses
by jaa (Friar) on Dec 20, 2002 at 19:26 UTC
    Folks,

    Thanks for the feedback and suggestions. It's great to get a variety of responses, as it enables me to test my assumptions against the experience of others. I will be digesting your suggestions over the next week or so.

    For those who asked why the data is all read up front, the answer is that it is confidential / sensitive, and loaded from an encrypted file on disk. The decryption happens all in a single call and the decrypted results get stored in an array. I guess we could have done that bit as a filter...

    Thanks again,

    Jeff