List processing performance

Odud has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: List processing performance by btrott (Parson) on Jul 11, 2000 at 19:56 UTC
Well, in general you're doing pretty well, I think. You're using hashes for uniqueness lookups, which is always a good first step. :) One thing that came to my mind is why don't you combine the two steps? So you'd end up w/ something like this: `my($self) = @_; my(%seen, %r); my @common = qw(a and at); @seen{@common} = (); for my $r (@words) { next if exists $seen{ lc $r }; $r{lc $r} = 1; } my @words = sort keys %r;` [download]	[reply] [d/l]
Re: List processing performance by autark (Friar) on Jul 11, 2000 at 20:30 UTC
playing golf, `my @words = ( ... ); my @common = qw\|a and at\|; my %seen; @seen{@common} = (1) x @common; @words = grep { $_ = lc; ! $seen{$_}++ } sort @words;` [download] :-) Autark	[reply] [d/l]
RE: Re: List processing performance by btrott (Parson) on Jul 11, 2000 at 20:52 UTC
Oh, is that what we're doing then. :) In which case replace your last line with this: `@words = sort grep !$seen{$_ = lc}++, @words` [download] A couple characters shorter, plus it's faster because the sort is done after you've filtered out duplicates and common words.	[reply] [d/l]
(chromatic) RE: RE: Re: List processing performance by chromatic (Archbishop) on Jul 11, 2000 at 23:32 UTC
RE: (chromatic) RE: RE: Re: List processing performance by btrott (Parson) on Jul 12, 2000 at 07:08 UTC
RE: Re: List processing performance by Odud (Pilgrim) on Jul 11, 2000 at 20:25 UTC
A good point about combining the two parts - the real code is actually two subroutines but only because I developed it in stages, there's no compelling reason for it. Your method also eliminates the temporary list which was bugging me. Thanks muchly.	[reply]
RE: List processing (benchmark results) by Russ (Deacon) on Jul 12, 2000 at 01:21 UTC
Update: btrott found that I had left out a piece of the original code. New code and results are shown here. Being interminably curious, I benchmarked a few of the suggested snippets of code. `Benchmark: timing 1000 iterations of Odud, PMGurus, btrott, visnu... Odud: 10 wallclock secs ( 9.61 usr + 0.01 sys = 9.62 CPU) PMGurus: 6 wallclock secs ( 5.41 usr + 0.01 sys = 5.42 CPU) btrott: 7 wallclock secs ( 7.17 usr + 0.04 sys = 7.21 CPU) visnu: 11 wallclock secs (11.31 usr + 0.00 sys = 11.31 CPU)` [download] So, combining the two actions into one (btrott's suggestion) results in an immediate 25% improvement and further refinement by the Perl Monks Guru squad reduced the times by another 25% over that. The one-liner (and I always lean toward one-liners in Perl) performed worse than the original... How depressing! :-) I am pasting my entire testing code (since I have been known to do silly things in the name of benchmarking ;-)) #!/usr/bin/perl -w use strict; use Benchmark; use vars qw/@words @common/; @words = qw/Notice the use of an anonymous inner class Here an instan +ce of an unnamed class derived from Thread is created and a run method that calls endApp is defined for the class All of this means that when the application is about to terminate the JVM starts the thread representing by the passed-in thread object When the thread starts the run method is called The run method calls endApp which closes the log file This flushes the output buffer To underscore the effect of the shutdown hook comment out the addShutdownHook lines in ShutdownDemo You'll see that the log file is empty when the program terminates You can register multiple shutdown hooks In this case each thread that represents a hook is started in an unspecified order and the various threads run simultaneously You cannot register or unregister a shutdown hook after the shutdown sequence has started Doing so results in an IllegalStateException Because shutdown hooks run as threads you must use thread-safe programming techniques Otherwise you risk having threads interfere with each other Also it's wise to design your application for simple and fast shutdown processing For example you might run into trouble if your application uses services during shutdown that are themselves in the processing of being shut down There are cases where shutdown processing does not happen even if you have registered shutdown hooks One example is corrupted native methods for example when you dereference a null pointer in C code This feature is somewhat similar to the atexit library function in /; @common = qw\|a an and at the to\|; timethese(1000, { Odud => q{ my (@only, %r); foreach $r (@words){ $r{lc $r} = 1; } my @uniqwords = sort keys %r; @seen{@common} = (); foreach $item (@uniqwords) { push(@only,$item) unless exists $seen{$item}; } my @newwords = @only; }, PMGurus => q{ my %seen; @seen{@common} = (1) x @common; my @newwords = sort grep !$seen{+lc}++, @words; }, btrott => q{ my(%seen, %r); @seen{@common} = (); for my $r (@words) { next if exists $seen{ lc $r }; $r{lc $r} = 1; } my @newwords = sort keys %r; }, visnu => q{ my @newwords = sort keys %{+{ map { !$common{$_ = lc $_} ? ($_, 1) + : () } @words }}; }, }); [download] Russ Brainbench 'Most Valuable Professional' for Perl	[reply] [d/l] [select]
RE: RE: List processing (benchmark results) by visnu (Sexton) on Jul 12, 2000 at 06:26 UTC
that is disappointing... the overhead of map creating the temporary array must be too great. hmmm.... now i'm trying to think of a way to pre-allocate so i can skew the benchmark results more in my favor..	[reply]
RE:(3) List processing (benchmark results) by Russ (Deacon) on Jul 12, 2000 at 06:27 UTC
I hope you can. I'd love to see the one-liner win out! :-) Russ Brainbench 'Most Valuable Professional' for Perl	[reply]
RE: RE:(3) List processing (benchmark results) by btrott (Parson) on Jul 12, 2000 at 07:16 UTC
RE: RE: List processing (benchmark results) by btrott (Parson) on Jul 12, 2000 at 07:14 UTC
I ran this myself, and visnu's code is broken in your benchmarking. visnu's code is checking the %common hash for the common words, and you don't have such a hash defined. Since the other benchmarked code has setup code within the benchmarks (setting up the common words hash), we should do that for visnu's, so add this at the beginning: `my %common; @common{@common} = (1) x @common;` [download] Also, the PMGurus code doesn't work quite right, as it doesn't lower-case the words in the grep. After taking out the PMGurus code, and fixing up visnu's, I got: `Benchmark: timing 1000 iterations of Odud, btrott, visnu... Odud: 4 secs ( 3.55 usr 0.00 sys = 3.55 cpu) btrott: 2 secs ( 2.27 usr 0.00 sys = 2.27 cpu) visnu: 4 secs ( 3.98 usr 0.00 sys = 3.98 cpu)` [download]	[reply] [d/l] [select]
RE:(3) List processing (benchmark results) by Russ (Deacon) on Jul 13, 2000 at 01:09 UTC
I made the changes you mentioned. I had missed visnu's %common, and your code and PMGurus code ignored the common words, in my version. Also, since the PMGurus code does lower-case the words in the grep, I have included those results. Thanks for pointing out what I had missed. Here is the latest code, and results: @common = qw\|a an and at the to\|; timethese(1000, { Odud => q{ my (@only, %r, %seen); foreach $r (@words){ $r{lc $r} = 1; } my @uniqwords = sort keys %r; foreach my $item (@uniqwords) { push(@only,$item) unless exists $seen{$item}; } my @newwords = @only; }, PMGurus => q{ my %seen; @seen{@common} = (1) x @common; my @newwords = sort grep !$seen{+lc}++, @words; }, btrott_OneLiner => q{ my %seen; @seen{@common} = (1) x @common; my @newwords = sort grep !$seen{$_=lc}++, @words; }, btrott_FirstPass => q{ my (%seen, %r); @seen{@common} = (1) x @common; for my $r (@words) { next if exists $seen{ lc $r }; $r{lc $r} = 1; } my @newwords = sort keys %r; }, visnu => q{ my %common; @common{@common} = (1) x @common; my @newwords = sort keys %{+{ map { !$common{$_ = lc $_} ? ($_, 1) + : () } @words }}; }, }); Benchmark: timing 1000 iterations of Odud, PMGurus, btrott_FirstPass, +btrott_OneLiner, visnu... Odud: 9 wallclock secs ( 9.61 usr + 0.00 sys = 9.61 CPU) PMGurus: 6 wallclock secs ( 5.40 usr + 0.00 sys = 5.40 CPU) btrott_FirstPass: 7 wallclock secs ( 7.07 usr + 0.00 sys = 7.07 CPU) btrott_OneLiner: 6 wallclock secs ( 5.73 usr + 0.00 sys = 5.73 CPU) visnu: 10 wallclock secs (10.29 usr + 0.00 sys = 10.29 CPU) [download] Russ Brainbench 'Most Valuable Professional' for Perl	[reply] [d/l]
RE: List processing performance by visnu (Sexton) on Jul 11, 2000 at 23:22 UTC
`#!/usr/bin/perl -w %common = map { lc $_, 1 } qw/ a and the /; @words = qw/ hello there Hello hello the a A /; @words = keys %{+{ map { !$common{lc $_} ? (lc $_, 1) : () } @words }} +; $/ = "\n"; print @words;` [download]	[reply] [d/l]
RE: RE: List processing performance by visnu (Sexton) on Jul 11, 2000 at 23:30 UTC
revision 3: `@words = sort keys %{+{ map { !$common{$_ = lc $_} ? ($_, 1) : () } @w +ords }};` [download] and the `$/ = "\n"` didn't do what i wanted, but oh well.. i probably wanted `$\ = ' '`, but that seems broken right now too.. oh well again.	[reply] [d/l] [select]
RE: RE: List processing performance by visnu (Sexton) on Jul 11, 2000 at 23:25 UTC
whoops, forgot the sort: `@words = sort keys %{+{ map { !$common{lc $_} ? (lc $_, 1) : () } @wor +ds }};` [download]	[reply] [d/l]
RE: RE: List processing performance by Odud (Pilgrim) on Jul 11, 2000 at 23:30 UTC
Amazing - I'm going to have to take this off-line to work out what it is doing. map is currently top of my list of functions to try and understand. I've only ever used it based on examples in the Cookbook and then (even with the explanation) I've never been too sure of how was working. I'll print this out with the man page and look at it on the train tomorrow. Cheers, Odud	[reply]
RE: RE: RE: List processing performance by visnu (Sexton) on Jul 11, 2000 at 23:45 UTC
final revision: `#!/usr/bin/perl -w %common = map { lc $_, 1 } qw/ a and the /; @words = qw/ hello there Hello hello the a A /; @words = sort keys %{+{ map { !$common{$_ = lc $_} ? ($_, 1) : () } @w +ords }}; print "@words\n";` [download] haha! it's not too bad efficiency-wise. it does get rid of all temporaries and it does everything on one pass through the word list.. the only thing i'm concerned about is the list-flattening that's going on, but in reality, i'm sure it's nothing. a more readable version would be: `%common = map { lc $_, 1 } qw/ the and a /; @words = qw / hello Hello hello a The there /; # lowercase all the words @words = map { lc $_ } @words; # shorter: @words = map { lc } @words; # filter out common words @words = grep { !common{$_} } @words; # make a hash %words = map { $_, 1 } @words # get the sorted keys @words = sort keys %words;` [download] the only problem here is that we're going through the list 3 or so times. the one-liner goes through it only once. and the key to understanding the `%words = map { $_, 1 } @words` line is to remember that a hash can be made from a list in (key, value) order. so, map goes through the @words list and creates a new list (which it returns) consisting of (word1, 1, word2, 1, word3, 1, ...).. the %{+{ .. }} funny business was just a way to get `keys` to play fair.	[reply] [d/l] [select]
Re: List processing performance by vijeno (Initiate) on Nov 14, 2001 at 16:13 UTC
I did a little, very basic, test, and I find the result quite interesting (and also confusing): #!/usr/bin/perl use Benchmark; timethese(10000, { 'foreachmy' => sub { my ($sum,$avg); foreach my $i (1..1000) { $sum += $i; $avg = $sum/$i; } }, 'foreachglobal' => sub { my ($sum,$avg,$i); foreach $i (1..1000) { $sum += $i; $avg = $sum/$i; } }, 'foreachnovar' => sub { my ($sum,$avg); foreach (1..1000) { $sum += $_; $avg = $sum/$_; } }, 'fornovar' => sub { my ($sum,$avg); for (1..1000) { $sum += $_; $avg = $sum/$_; } }, 'ctypeFor' => sub { my ($i,$sum,$avg); for ($i=1;$i<=1000;$i++) { $sum += $i; $avg = $sum/$i; } } } ); [download] `Benchmark: timing 10000 iterations of ctypeFor, foreachglobal, foreach +my, foreachnovar, fornovar... ctypeFor: 26 wallclock secs (26.27 usr + 0.00 sys = 26.27 CPU) foreachglobal: 22 wallclock secs (22.50 usr + 0.00 sys = 22.50 CPU) foreachmy: 23 wallclock secs (22.42 usr + 0.00 sys = 22.42 CPU) foreachnovar: 23 wallclock secs (23.33 usr + 0.00 sys = 23.33 CPU) fornovar: 24 wallclock secs (23.44 usr + 0.03 sys = 23.47 CPU)` [download] While I'm not so sure about the little difference between the foreachlocal and foreachmy result, the construct foreach local $i (.....) definitely seems to be faster than foreach (......) ... or am I doing something terrible to $_ IN the foreach block?	[reply] [d/l] [select]