Re^2: removing duplicate lines

Replies are listed 'Best First'.
Re^3: removing duplicate lines by johngg (Canon) on Apr 10, 2006 at 22:14 UTC
I used the hash slice method to find unique names because it seems to be the most efficient method following benchmarking, see Re^3: What does 'next if $hash{$elem}++;' mean?. I felt that the overhead of re-sorting the keys of the `%uniques` hash would not outweigh the efficiencies of the hash slice method but I admit I have not benchmarked this. As to `our` versus `my`, I tend to reserve `my` for subroutines or scoped code blocks and use `our` in the main part of the script but I realise that in this case it would make little difference. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: removing duplicate lines by johngg (Canon) on Apr 10, 2006 at 22:53 UTC
I've had a go at benchmarking the two methods now. I have assumed that the data is clean with no need to strip trailing spaces other than a newline which we can `chomp`. The benchmark seems to show that using a hash slice and subsequently sorting the keys is still more efficient than using `%seen` but YMMV depending on hardware etc.; I'm using Suse 10.0 OSS/Perl v5.8.7 on an AMD Athlon 2500+. Read more... (1262 Bytes) produces `Rate Seen HashSlice Seen 18939/s -- -34% HashSlice 28571/s 51% --` [download] I have returned a list in each case as that seems to be closer in essence to the `print` in the OP than the list reference I would normally use. Cheers, JohnGG	[reply] [d/l] [select]
Re^4: removing duplicate lines by revdiablo (Prior) on Apr 10, 2006 at 23:19 UTC
As a preface, I would like to note that I don't think this benchmark is that meaningful. This type of operation would probably not benefit much from the kind of adversarial optimizations currently being engaged in... but for future reference, I'd like to point out that your benchmark is slightly flawed. Your benchmark code isn't representative of Not_a_Number's original code. You're pre-loading the `@lines` array and using it for both of them, but Not_a_Number's code does not do that. When I recast the code in a more representative form, the results come out differently: Read more... (1524 Bytes) And here are the results: $ perl 542392 Rate HashSlice Seen HashSlice 7210/s -- -24% Seen 9533/s 32% -- Update: Also, note your earlier post that you linked to -- Re^3: What does 'next if $hash{$elem}++;' mean? -- suffers from the same flaw.	[reply] [d/l] [select]
Re^5: removing duplicate lines by johngg (Canon) on Apr 11, 2006 at 08:55 UTC
That's very interesting. I am sure I read somewhere that it was good practice to factor out system overheads such as i/o when benchmarking algorithms, which is why I have laid out the scripts that way. However, this now appears to be a flawed approach to real world problems. I will mend my ways :-) Thank you, JohnGG	[reply]
Re^6: removing duplicate lines by revdiablo (Prior) on Apr 11, 2006 at 16:24 UTC
Re^7: removing duplicate lines by johngg (Canon) on Apr 12, 2006 at 09:20 UTC