Odud has asked for the wisdom of the Perl Monks concerning the following question:

Many of you are experts in the Perl performance area and I would appreciate your comments on the following code snippets - they do the job but I have no idea if they are particularly efficient. What I am trying to do is to produce a sorted list of words from which duplicates and common words (a at the if ... etc) have been removed.

The size of list is typically about 300 items but could, in theory, be much larger.

The first thing I do is produce a sorted unique list using some code that I saw somewhere (probably the Perl Cookbook)
# @words is the list of words (composed of letters and numbers only, n +o punctuation) my($self) = @_; foreach $r (@words) { $r{lc $r} = 1; } @words = sort keys %r;
Then the next thing I do is to look if any of the words are also in the common list, if so they don't get carried forward. I'm suspicious of this code because @only is a temporary variable which often points to an improvement being possible
# @seen is (I think) a pseudo-hash of common words @common = qw(a and at); # bigger in real life @seen{@common} = (); foreach $item (@words) { push(@only,$item) unless exists $seen{$item}; } @words = @only;
I'd appreciate your comments and suggestions very much as this is an area in which I'm happy with the basics but haven't much knowledge of the finer points.

Replies are listed 'Best First'.
Re: List processing performance
by btrott (Parson) on Jul 11, 2000 at 19:56 UTC
    Well, in general you're doing pretty well, I think. You're using hashes for uniqueness lookups, which is always a good first step. :)

    One thing that came to my mind is why don't you combine the two steps? So you'd end up w/ something like this:

    my($self) = @_; my(%seen, %r); my @common = qw(a and at); @seen{@common} = (); for my $r (@words) { next if exists $seen{ lc $r }; $r{lc $r} = 1; } my @words = sort keys %r;
      playing golf,
      my @words = ( ... ); my @common = qw|a and at|; my %seen; @seen{@common} = (1) x @common; @words = grep { $_ = lc; ! $seen{$_}++ } sort @words;
      :-) Autark
        Oh, is *that* what we're doing then. :) In which case replace your last line with this:
        @words = sort grep !$seen{$_ = lc}++, @words
        A couple characters shorter, plus it's faster because the sort is done after you've filtered out duplicates and common words.
      A good point about combining the two parts - the real code is actually two subroutines but only because I developed it in stages, there's no compelling reason for it. Your method also eliminates the temporary list which was bugging me. Thanks muchly.
RE: List processing (benchmark results)
by Russ (Deacon) on Jul 12, 2000 at 01:21 UTC
    Update: btrott found that I had left out a piece of the original code. New code and results are shown here.

    Being interminably curious, I benchmarked a few of the suggested snippets of code.

    Benchmark: timing 1000 iterations of Odud, PMGurus, btrott, visnu... Odud: 10 wallclock secs ( 9.61 usr + 0.01 sys = 9.62 CPU) PMGurus: 6 wallclock secs ( 5.41 usr + 0.01 sys = 5.42 CPU) btrott: 7 wallclock secs ( 7.17 usr + 0.04 sys = 7.21 CPU) visnu: 11 wallclock secs (11.31 usr + 0.00 sys = 11.31 CPU)

    So, combining the two actions into one (btrott's suggestion) results in an immediate 25% improvement and further refinement by the Perl Monks Guru squad reduced the times by another 25% over that. The one-liner (and I always lean toward one-liners in Perl) performed worse than the original... How depressing! :-)

    I am pasting my entire testing code (since I have been known to do silly things in the name of benchmarking ;-))

    #!/usr/bin/perl -w use strict; use Benchmark; use vars qw/@words @common/; @words = qw/Notice the use of an anonymous inner class Here an instan +ce of an unnamed class derived from Thread is created and a run method that calls endApp is defined for the class All of this means that when the application is about to terminate the JVM starts the thread representing by the passed-in thread object When the thread starts the run method is called The run method calls endApp which closes the log file This flushes the output buffer To underscore the effect of the shutdown hook comment out the addShutdownHook lines in ShutdownDemo You'll see that the log file is empty when the program terminates You can register multiple shutdown hooks In this case each thread that represents a hook is started in an unspecified order and the various threads run simultaneously You cannot register or unregister a shutdown hook after the shutdown sequence has started Doing so results in an IllegalStateException Because shutdown hooks run as threads you must use thread-safe programming techniques Otherwise you risk having threads interfere with each other Also it's wise to design your application for simple and fast shutdown processing For example you might run into trouble if your application uses services during shutdown that are themselves in the processing of being shut down There are cases where shutdown processing does not happen even if you have registered shutdown hooks One example is corrupted native methods for example when you dereference a null pointer in C code This feature is somewhat similar to the atexit library function in /; @common = qw|a an and at the to|; timethese(1000, { Odud => q{ my (@only, %r); foreach $r (@words){ $r{lc $r} = 1; } my @uniqwords = sort keys %r; @seen{@common} = (); foreach $item (@uniqwords) { push(@only,$item) unless exists $seen{$item}; } my @newwords = @only; }, PMGurus => q{ my %seen; @seen{@common} = (1) x @common; my @newwords = sort grep !$seen{+lc}++, @words; }, btrott => q{ my(%seen, %r); @seen{@common} = (); for my $r (@words) { next if exists $seen{ lc $r }; $r{lc $r} = 1; } my @newwords = sort keys %r; }, visnu => q{ my @newwords = sort keys %{+{ map { !$common{$_ = lc $_} ? ($_, 1) + : () } @words }}; }, });
    Russ
    Brainbench 'Most Valuable Professional' for Perl
      that is disappointing... the overhead of map creating the temporary array must be too great. hmmm.... now i'm trying to think of a way to pre-allocate so i can skew the benchmark results more in my favor..
        I hope you can. I'd love to see the one-liner win out! :-)

        Russ
        Brainbench 'Most Valuable Professional' for Perl

      I ran this myself, and visnu's code is broken in your benchmarking. visnu's code is checking the %common hash for the common words, and you don't have such a hash defined. Since the other benchmarked code has setup code within the benchmarks (setting up the common words hash), we should do that for visnu's, so add this at the beginning:
      my %common; @common{@common} = (1) x @common;
      Also, the PMGurus code doesn't work quite right, as it doesn't lower-case the words in the grep.

      After taking out the PMGurus code, and fixing up visnu's, I got:

      Benchmark: timing 1000 iterations of Odud, btrott, visnu... Odud: 4 secs ( 3.55 usr 0.00 sys = 3.55 cpu) btrott: 2 secs ( 2.27 usr 0.00 sys = 2.27 cpu) visnu: 4 secs ( 3.98 usr 0.00 sys = 3.98 cpu)
        I made the changes you mentioned. I had missed visnu's %common, and your code and PMGurus code ignored the common words, in my version. Also, since the PMGurus code does lower-case the words in the grep, I have included those results.

        Thanks for pointing out what I had missed.

        Here is the latest code, and results:

        @common = qw|a an and at the to|; timethese(1000, { Odud => q{ my (@only, %r, %seen); foreach $r (@words){ $r{lc $r} = 1; } my @uniqwords = sort keys %r; foreach my $item (@uniqwords) { push(@only,$item) unless exists $seen{$item}; } my @newwords = @only; }, PMGurus => q{ my %seen; @seen{@common} = (1) x @common; my @newwords = sort grep !$seen{+lc}++, @words; }, btrott_OneLiner => q{ my %seen; @seen{@common} = (1) x @common; my @newwords = sort grep !$seen{$_=lc}++, @words; }, btrott_FirstPass => q{ my (%seen, %r); @seen{@common} = (1) x @common; for my $r (@words) { next if exists $seen{ lc $r }; $r{lc $r} = 1; } my @newwords = sort keys %r; }, visnu => q{ my %common; @common{@common} = (1) x @common; my @newwords = sort keys %{+{ map { !$common{$_ = lc $_} ? ($_, 1) + : () } @words }}; }, }); Benchmark: timing 1000 iterations of Odud, PMGurus, btrott_FirstPass, +btrott_OneLiner, visnu... Odud: 9 wallclock secs ( 9.61 usr + 0.00 sys = 9.61 CPU) PMGurus: 6 wallclock secs ( 5.40 usr + 0.00 sys = 5.40 CPU) btrott_FirstPass: 7 wallclock secs ( 7.07 usr + 0.00 sys = 7.07 CPU) btrott_OneLiner: 6 wallclock secs ( 5.73 usr + 0.00 sys = 5.73 CPU) visnu: 10 wallclock secs (10.29 usr + 0.00 sys = 10.29 CPU)
        Russ
        Brainbench 'Most Valuable Professional' for Perl
RE: List processing performance
by visnu (Sexton) on Jul 11, 2000 at 23:22 UTC
    #!/usr/bin/perl -w %common = map { lc $_, 1 } qw/ a and the /; @words = qw/ hello there Hello hello the a A /; @words = keys %{+{ map { !$common{lc $_} ? (lc $_, 1) : () } @words }} +; $/ = "\n"; print @words;
      revision 3:
      @words = sort keys %{+{ map { !$common{$_ = lc $_} ? ($_, 1) : () } @w +ords }};
      and the $/ = "\n" didn't do what i wanted, but oh well.. i probably wanted $\ = ' ', but that seems broken right now too.. oh well again.
      whoops, forgot the sort:
      @words = sort keys %{+{ map { !$common{lc $_} ? (lc $_, 1) : () } @wor +ds }};
      Amazing - I'm going to have to take this off-line to work out what it is doing. map is currently top of my list of functions to try and understand. I've only ever used it based on examples in the Cookbook and then (even with the explanation) I've never been too sure of how was working. I'll print this out with the man page and look at it on the train tomorrow. Cheers, Odud
        final revision:
        #!/usr/bin/perl -w %common = map { lc $_, 1 } qw/ a and the /; @words = qw/ hello there Hello hello the a A /; @words = sort keys %{+{ map { !$common{$_ = lc $_} ? ($_, 1) : () } @w +ords }}; print "@words\n";
        haha! it's not too bad efficiency-wise. it does get rid of all temporaries and it does everything on one pass through the word list.. the only thing i'm concerned about is the list-flattening that's going on, but in reality, i'm sure it's nothing. a more readable version would be:
        %common = map { lc $_, 1 } qw/ the and a /; @words = qw / hello Hello hello a The there /; # lowercase all the words @words = map { lc $_ } @words; # shorter: @words = map { lc } @words; # filter out common words @words = grep { !common{$_} } @words; # make a hash %words = map { $_, 1 } @words # get the sorted keys @words = sort keys %words;
        the only problem here is that we're going through the list 3 or so times. the one-liner goes through it only once. and the key to understanding the %words = map { $_, 1 } @words line is to remember that a hash can be made from a list in (key, value) order. so, map goes through the @words list and creates a new list (which it returns) consisting of (word1, 1, word2, 1, word3, 1, ...).. the %{+{ .. }} funny business was just a way to get keys to play fair.
Re: List processing performance
by vijeno (Initiate) on Nov 14, 2001 at 16:13 UTC
    I did a little, very basic, test, and I find the result quite interesting (and also confusing):
    #!/usr/bin/perl use Benchmark; timethese(10000, { 'foreachmy' => sub { my ($sum,$avg); foreach my $i (1..1000) { $sum += $i; $avg = $sum/$i; } }, 'foreachglobal' => sub { my ($sum,$avg,$i); foreach $i (1..1000) { $sum += $i; $avg = $sum/$i; } }, 'foreachnovar' => sub { my ($sum,$avg); foreach (1..1000) { $sum += $_; $avg = $sum/$_; } }, 'fornovar' => sub { my ($sum,$avg); for (1..1000) { $sum += $_; $avg = $sum/$_; } }, 'ctypeFor' => sub { my ($i,$sum,$avg); for ($i=1;$i<=1000;$i++) { $sum += $i; $avg = $sum/$i; } } } );
    Benchmark: timing 10000 iterations of ctypeFor, foreachglobal, foreach +my, foreachnovar, fornovar... ctypeFor: 26 wallclock secs (26.27 usr + 0.00 sys = 26.27 CPU) foreachglobal: 22 wallclock secs (22.50 usr + 0.00 sys = 22.50 CPU) foreachmy: 23 wallclock secs (22.42 usr + 0.00 sys = 22.42 CPU) foreachnovar: 23 wallclock secs (23.33 usr + 0.00 sys = 23.33 CPU) fornovar: 24 wallclock secs (23.44 usr + 0.03 sys = 23.47 CPU)
    While I'm not so sure about the little difference between the foreachlocal and foreachmy result, the construct foreach local $i (.....) definitely seems to be faster than foreach (......) ... or am I doing something terrible to $_ IN the foreach block?