Re: removing duplicate lines
by ikegami (Patriarch) on Apr 10, 2006 at 15:55 UTC
|
Maybe one of the Name4's has trailing spaces?
while ($line = <FILENAME>) {
$line =~ s/\s+$//; # Remove trailing spaces and newline.
$uniquelines{$line} = 1;
}
foreach $k (keys %uniquelines) {
print NEWFILE "$k\n";
}
| [reply] [d/l] |
Re: removing duplicate lines
by japhy (Canon) on Apr 10, 2006 at 15:57 UTC
|
From what you've pasted, the last "Name 4" has a space after it.
| [reply] |
Re: removing duplicate lines
by davidrw (Prior) on Apr 10, 2006 at 16:07 UTC
|
can also use the standard *nix tool uniq:
# assuming pre-sorted as in OP:
uniq file1 > file2
As for your code's issue, try Data::Dumper to see what's different about the 'Name4' keys:
use Data::Dumper;
print Dumper \%uniquelines;
| [reply] [d/l] [select] |
Re: removing duplicate lines
by johngg (Canon) on Apr 10, 2006 at 16:22 UTC
|
If you want to keep your names in the same order, since thay already seem to be sorted, then you can't depend on keys giving you back the names in sorted order. Adding a few more names to your list gives unpredictable orders. Thus you need to sort them to preserve your original order. Doing
use strict;
use warnings;
our @lines = <DATA>;
chomp @lines;
our %uniques;
@uniques{@lines} = ();
print "$_\n" for keys %uniques;
__END__
Name1
Name2
Name2
Name3
Name3
Name3
Name3
Name4
Name4
Name5
Name5
Name5
Name5
Name5
Name6
Name6
Name7
Name7
Name7
Name8
Name8
Name9
Name9
Name9
Gives
Name8
Name9
Name1
Name2
Name3
Name4
Name5
Name6
Name7
Adding a sort to the print ... line like this
print "$_\n" for sort keys %uniques;
corrects the problem (assuming your "names" are real names that sort lexically, as going on to Name10, Name11 etc sort after Name1 and before Name2). Cheers, JohnGG
| [reply] [d/l] [select] |
|
|
my %seen;
while ( <DATA> ) {
s/\s+$//; # Remove trailing spaces and newline.
print "$_\n" unless $seen{$_}++;
}
By the way, why do you use our instead of my? | [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |
|
|
I've had a go at benchmarking the two methods now. I have assumed that the data is clean with no need to strip trailing spaces other than a newline which we can chomp. The benchmark seems to show that using a hash slice and subsequently sorting the keys is still more efficient than using %seen but YMMV depending on hardware etc.; I'm using Suse 10.0 OSS/Perl v5.8.7 on an AMD Athlon 2500+.
produces
Rate Seen HashSlice
Seen 18939/s -- -34%
HashSlice 28571/s 51% --
I have returned a list in each case as that seems to be closer in essence to the print in the OP than the list reference I would normally use. Cheers, JohnGG | [reply] [d/l] [select] |
|
|
|
|
|
|
|
Thank you all for your quick, insightful responses! I'm still not sure as to what the problem was because there were no trailing spaces in the file (although that was indicated in my output here, accidentally) But I approached it using `uniq`...and it worked. Thanks!
| [reply] |
Re: removing duplicate lines
by davido (Cardinal) on Apr 10, 2006 at 15:58 UTC
|
The code, as you've posted it, doesn't even compile. You've got three '{' braces, and only two '}' braces. Please post the actual code.
| [reply] |
Re: removing duplicate lines
by explorer (Chaplain) on Apr 10, 2006 at 16:07 UTC
|
| [reply] |
Re: removing duplicate lines
by CountZero (Bishop) on Apr 10, 2006 at 20:36 UTC
|
Or if you do not want to re-invent the wheel:
use strict;
use List::MoreUtils(qw/uniq/);
my @uniques = uniq(<DATA>);
print @uniques;
__DATA__
Name1
Name2
Name2
Name3
Name3
Name3
Name3
Name4
Name4
Output: Name1
Name2
Name3
Name4
The initial list does not have to be sorted (as with the UNIX uniq) and the order is preserved.
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [d/l] [select] |
Re: removing duplicate lines
by codeacrobat (Chaplain) on Apr 10, 2006 at 22:50 UTC
|
Why make things complicated. A little magic with the perl switches and you are there.
perl -ni -e 'print if $_ ne $old;$old = $_' duplicates
| [reply] [d/l] |