in reply to Re^2: removing duplicate lines
in thread removing duplicate lines

I've had a go at benchmarking the two methods now. I have assumed that the data is clean with no need to strip trailing spaces other than a newline which we can chomp. The benchmark seems to show that using a hash slice and subsequently sorting the keys is still more efficient than using %seen but YMMV depending on hardware etc.; I'm using Suse 10.0 OSS/Perl v5.8.7 on an AMD Athlon 2500+.

use strict; use warnings; use Benchmark qw(cmpthese); our @lines = <DATA>; chomp @lines; our $rsHashSlice = sub { my %uniques; @uniques{@lines} = (); my @sorted; push @sorted, $_ for sort keys %uniques; return @sorted; }; our $rsSeen = sub { my %seen; my @sorted; foreach (@lines) { push @sorted, $_ unless $seen{$_}++; } return @sorted; }; cmpthese(100000, { HashSlice => $rsHashSlice, Seen => $rsSeen }); __END__ black black black black black black black black black black black black black black black black blue blue blue blue blue blue blue blue blue green green green green green green green green green green grey grey grey grey iolet mauve mauve mauve mauve mauve mauve mauve mauve pink pink pink pink pink purple purple purple red red red red red red red red violet violet violet violet violet violet violet violet violet white white white white white white white yellow yellow yellow yellow yellow yellow yellow

produces

Rate Seen HashSlice Seen 18939/s -- -34% HashSlice 28571/s 51% --

I have returned a list in each case as that seems to be closer in essence to the print in the OP than the list reference I would normally use.

Cheers,

JohnGG

Replies are listed 'Best First'.
Re^4: removing duplicate lines
by revdiablo (Prior) on Apr 10, 2006 at 23:19 UTC

    As a preface, I would like to note that I don't think this benchmark is that meaningful. This type of operation would probably not benefit much from the kind of adversarial optimizations currently being engaged in... but for future reference, I'd like to point out that your benchmark is slightly flawed.

    Your benchmark code isn't representative of Not_a_Number's original code. You're pre-loading the @lines array and using it for both of them, but Not_a_Number's code does not do that. When I recast the code in a more representative form, the results come out differently:

    And here are the results:

    $ perl 542392
                Rate HashSlice      Seen
    HashSlice 7210/s        --      -24%
    Seen      9533/s       32%        --
    

    Update: Also, note your earlier post that you linked to -- Re^3: What does 'next if $hash{$elem}++;' mean? -- suffers from the same flaw.

      That's very interesting. I am sure I read somewhere that it was good practice to factor out system overheads such as i/o when benchmarking algorithms, which is why I have laid out the scripts that way. However, this now appears to be a flawed approach to real world problems.

      I will mend my ways :-)

      Thank you,

      JohnGG

        I read somewhere that it was good practice to factor out system overheads such as i/o when benchmarking algorithms

        That may be a good rule of thumb, but -- as should now be obvious -- it doesn't apply all the time. In this case, the I/O is part of what we're testing. Factoring it out alters the algorithm significantly.