Re^4: removing duplicate lines

As a preface, I would like to note that I don't think this benchmark is that meaningful. This type of operation would probably not benefit much from the kind of adversarial optimizations currently being engaged in... but for future reference, I'd like to point out that your benchmark is slightly flawed.

Your benchmark code isn't representative of Not_a_Number's original code. You're pre-loading the @lines array and using it for both of them, but Not_a_Number's code does not do that. When I recast the code in a more representative form, the results come out differently:

use strict;
use warnings;

use Benchmark qw(cmpthese);

my $startat = tell DATA;

our $rsHashSlice = sub
   {
       seek DATA, $startat, 0;
       our @lines = <DATA>;
       chomp @lines;
       my %uniques;
       @uniques{@lines} = ();
       my @sorted;
       push @sorted, $_ for sort keys %uniques;
       return @sorted;
   };
our $rsSeen = sub
   {
       seek DATA, $startat, 0;
       my %seen;
       my @sorted;
       while (<DATA>)
       {
           chomp;
           push @sorted, $_ unless $seen{$_}++;
       }
       return @sorted;
   };

# For double checking results
#print $rsHashSlice->(), "\n" for 1 .. 2;
#print $rsSeen->(), "\n" for 1 .. 2;

cmpthese(100000,
   {
       HashSlice => $rsHashSlice,
       Seen => $rsSeen
   });

__END__
black
black
black
black
black
black
black
black
black
black
black
black
black
black
black
black
blue
blue
blue
blue
blue
blue
blue
blue
blue
green
green
green
green
green
green
green
green
green
green
grey
grey
grey
grey
iolet
mauve
mauve
mauve
mauve
mauve
mauve
mauve
mauve
pink
pink
pink
pink
pink
purple
purple
purple
red
red
red
red
red
red
red
red
violet
violet
violet
violet
violet
violet
violet
violet
violet
white
white
white
white
white
white
white
yellow
yellow
yellow
yellow
yellow
yellow
yellow
[download]

And here are the results:

$ perl 542392
            Rate HashSlice      Seen
HashSlice 7210/s        --      -24%
Seen      9533/s       32%        --

Update: Also, note your earlier post that you linked to -- Re^3: What does 'next if $hash{$elem}++;' mean? -- suffers from the same flaw.

Comment on Re^4: removing duplicate lines Select or Download Code

Replies are listed 'Best First'.
Re^5: removing duplicate lines by johngg (Canon) on Apr 11, 2006 at 08:55 UTC
That's very interesting. I am sure I read somewhere that it was good practice to factor out system overheads such as i/o when benchmarking algorithms, which is why I have laid out the scripts that way. However, this now appears to be a flawed approach to real world problems. I will mend my ways :-) Thank you, JohnGG	[reply]
Re^6: removing duplicate lines by revdiablo (Prior) on Apr 11, 2006 at 16:24 UTC
I read somewhere that it was good practice to factor out system overheads such as i/o when benchmarking algorithms That may be a good rule of thumb, but -- as should now be obvious -- it doesn't apply all the time. In this case, the I/O is part of what we're testing. Factoring it out alters the algorithm significantly.	[reply]
Re^7: removing duplicate lines by johngg (Canon) on Apr 12, 2006 at 09:20 UTC
I will definitely bear this in mind in future. It had never occurred to me that you could do `seek`s on the `DATA` filehandle; that's really neat. Thank you for the instruction, JohnGG	[reply] [d/l] [select]