in reply to removing duplicate lines

If you want to keep your names in the same order, since thay already seem to be sorted, then you can't depend on keys giving you back the names in sorted order. Adding a few more names to your list gives unpredictable orders. Thus you need to sort them to preserve your original order. Doing

use strict; use warnings; our @lines = <DATA>; chomp @lines; our %uniques; @uniques{@lines} = (); print "$_\n" for keys %uniques; __END__ Name1 Name2 Name2 Name3 Name3 Name3 Name3 Name4 Name4 Name5 Name5 Name5 Name5 Name5 Name6 Name6 Name7 Name7 Name7 Name8 Name8 Name9 Name9 Name9

Gives

Name8 Name9 Name1 Name2 Name3 Name4 Name5 Name6 Name7

Adding a sort to the print ... line like this

print "$_\n" for sort keys %uniques;

corrects the problem (assuming your "names" are real names that sort lexically, as going on to Name10, Name11 etc sort after Name1 and before Name2).

Cheers,

JohnGG

Replies are listed 'Best First'.
Re^2: removing duplicate lines
by Not_a_Number (Prior) on Apr 10, 2006 at 19:34 UTC

    A more canonical (and efficient) way of preserving order, whether the input data is sorted or not, would be along the lines of:

    my %seen; while ( <DATA> ) { s/\s+$//; # Remove trailing spaces and newline. print "$_\n" unless $seen{$_}++; }

    By the way, why do you use our instead of my?

      I used the hash slice method to find unique names because it seems to be the most efficient method following benchmarking, see Re^3: What does 'next if $hash{$elem}++;' mean?. I felt that the overhead of re-sorting the keys of the %uniques hash would not outweigh the efficiencies of the hash slice method but I admit I have not benchmarked this.

      As to our versus my, I tend to reserve my for subroutines or scoped code blocks and use our in the main part of the script but I realise that in this case it would make little difference.

      Cheers,

      JohnGG

      I've had a go at benchmarking the two methods now. I have assumed that the data is clean with no need to strip trailing spaces other than a newline which we can chomp. The benchmark seems to show that using a hash slice and subsequently sorting the keys is still more efficient than using %seen but YMMV depending on hardware etc.; I'm using Suse 10.0 OSS/Perl v5.8.7 on an AMD Athlon 2500+.

      produces

      Rate Seen HashSlice Seen 18939/s -- -34% HashSlice 28571/s 51% --

      I have returned a list in each case as that seems to be closer in essence to the print in the OP than the list reference I would normally use.

      Cheers,

      JohnGG

        As a preface, I would like to note that I don't think this benchmark is that meaningful. This type of operation would probably not benefit much from the kind of adversarial optimizations currently being engaged in... but for future reference, I'd like to point out that your benchmark is slightly flawed.

        Your benchmark code isn't representative of Not_a_Number's original code. You're pre-loading the @lines array and using it for both of them, but Not_a_Number's code does not do that. When I recast the code in a more representative form, the results come out differently:

        And here are the results:

        $ perl 542392
                    Rate HashSlice      Seen
        HashSlice 7210/s        --      -24%
        Seen      9533/s       32%        --
        

        Update: Also, note your earlier post that you linked to -- Re^3: What does 'next if $hash{$elem}++;' mean? -- suffers from the same flaw.

Re^2: removing duplicate lines
by Anonymous Monk on Apr 10, 2006 at 19:16 UTC
    Thank you all for your quick, insightful responses! I'm still not sure as to what the problem was because there were no trailing spaces in the file (although that was indicated in my output here, accidentally) But I approached it using `uniq`...and it worked. Thanks!