removing duplicate lines

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: removing duplicate lines by ikegami (Patriarch) on Apr 10, 2006 at 15:55 UTC
Maybe one of the Name4's has trailing spaces? `while ($line = <FILENAME>) { $line =~ s/\s+$//; # Remove trailing spaces and newline. $uniquelines{$line} = 1; } foreach $k (keys %uniquelines) { print NEWFILE "$k\n"; }` [download]	[reply] [d/l]
Re: removing duplicate lines by japhy (Canon) on Apr 10, 2006 at 15:57 UTC
From what you've pasted, the last "Name 4" has a space after it. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply]
Re: removing duplicate lines by davidrw (Prior) on Apr 10, 2006 at 16:07 UTC
can also use the standard *nix tool `uniq`: `# assuming pre-sorted as in OP: uniq file1 > file2` [download] As for your code's issue, try Data::Dumper to see what's different about the 'Name4' keys: `use Data::Dumper; print Dumper \%uniquelines;` [download]	[reply] [d/l] [select]
Re: removing duplicate lines by johngg (Canon) on Apr 10, 2006 at 16:22 UTC
If you want to keep your names in the same order, since thay already seem to be sorted, then you can't depend on `keys` giving you back the names in sorted order. Adding a few more names to your list gives unpredictable orders. Thus you need to sort them to preserve your original order. Doing `use strict; use warnings; our @lines = <DATA>; chomp @lines; our %uniques; @uniques{@lines} = (); print "$_\n" for keys %uniques; __END__ Name1 Name2 Name2 Name3 Name3 Name3 Name3 Name4 Name4 Name5 Name5 Name5 Name5 Name5 Name6 Name6 Name7 Name7 Name7 Name8 Name8 Name9 Name9 Name9` [download] Gives `Name8 Name9 Name1 Name2 Name3 Name4 Name5 Name6 Name7` [download] Adding a `sort` to the `print ...` line like this `print "$_\n" for sort keys %uniques;` [download] corrects the problem (assuming your "names" are real names that sort lexically, as going on to Name10, Name11 etc sort after Name1 and before Name2). Cheers, JohnGG	[reply] [d/l] [select]
Re^2: removing duplicate lines by Not_a_Number (Prior) on Apr 10, 2006 at 19:34 UTC
A more canonical (and efficient) way of preserving order, whether the input data is sorted or not, would be along the lines of: `my %seen; while ( <DATA> ) { s/\s+$//; # Remove trailing spaces and newline. print "$_\n" unless $seen{$_}++; }` [download] By the way, why do you use `our` instead of `my`?	[reply] [d/l] [select]
Re^3: removing duplicate lines by johngg (Canon) on Apr 10, 2006 at 22:14 UTC
I used the hash slice method to find unique names because it seems to be the most efficient method following benchmarking, see Re^3: What does 'next if $hash{$elem}++;' mean?. I felt that the overhead of re-sorting the keys of the `%uniques` hash would not outweigh the efficiencies of the hash slice method but I admit I have not benchmarked this. As to `our` versus `my`, I tend to reserve `my` for subroutines or scoped code blocks and use `our` in the main part of the script but I realise that in this case it would make little difference. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: removing duplicate lines by johngg (Canon) on Apr 10, 2006 at 22:53 UTC
I've had a go at benchmarking the two methods now. I have assumed that the data is clean with no need to strip trailing spaces other than a newline which we can `chomp`. The benchmark seems to show that using a hash slice and subsequently sorting the keys is still more efficient than using `%seen` but YMMV depending on hardware etc.; I'm using Suse 10.0 OSS/Perl v5.8.7 on an AMD Athlon 2500+. Read more... (1262 Bytes) produces `Rate Seen HashSlice Seen 18939/s -- -34% HashSlice 28571/s 51% --` [download] I have returned a list in each case as that seems to be closer in essence to the `print` in the OP than the list reference I would normally use. Cheers, JohnGG	[reply] [d/l] [select]
Re^4: removing duplicate lines by revdiablo (Prior) on Apr 10, 2006 at 23:19 UTC
Re^5: removing duplicate lines by johngg (Canon) on Apr 11, 2006 at 08:55 UTC
Some notes below your chosen depth have not been shown here
Re^2: removing duplicate lines by Anonymous Monk on Apr 10, 2006 at 19:16 UTC
Thank you all for your quick, insightful responses! I'm still not sure as to what the problem was because there were no trailing spaces in the file (although that was indicated in my output here, accidentally) But I approached it using `uniq`...and it worked. Thanks!	[reply]
Re: removing duplicate lines by davido (Cardinal) on Apr 10, 2006 at 15:58 UTC
The code, as you've posted it, doesn't even compile. You've got three '{' braces, and only two '}' braces. Please post the actual code. Dave	[reply]
Re: removing duplicate lines by explorer (Chaplain) on Apr 10, 2006 at 16:07 UTC
Unix comand: uniq	[reply]
Re: removing duplicate lines by CountZero (Bishop) on Apr 10, 2006 at 20:36 UTC
Or if you do not want to re-invent the wheel: `use strict; use List::MoreUtils(qw/uniq/); my @uniques = uniq(<DATA>); print @uniques; __DATA__ Name1 Name2 Name2 Name3 Name3 Name3 Name3 Name4 Name4` [download] Output: `Name1 Name2 Name3 Name4` [download] The initial list does not have to be sorted (as with the UNIX `uniq`) and the order is preserved. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l] [select]
Re: removing duplicate lines by codeacrobat (Chaplain) on Apr 10, 2006 at 22:50 UTC
Why make things complicated. A little magic with the perl switches and you are there. `perl -ni -e 'print if $_ ne $old;$old = $_' duplicates` [download]	[reply] [d/l]