in reply to Remove Duplicates from one text file and print them in another text file.
My requirement is to identify the duplicate records in a text file.
while (<>) { chomp; $no_dupes{$_} = $. if (! exists $no_dupes{$_}); }
...and what you're doing is identifying the location of a single duplicate for each line content. Is it reasonable to ignore the possibility of more than one duplicate of a given line? (It may be fine; I don't know about your data set. Or it may be a bug waiting to be discovered.) The code you posted is fine if you only care that to know the content of any line that is duplicated somewhere in the file. But it falls short if you care about how many times, and where those duplicates are found.
To allow for the possibility of flagging more than one duplicate, you might try this instead:
use strict; use warnings; my %folded_lines; while ( <> ) { chomp; push @{$folded_lines{$_}}, $.; } my @dupes = sort { $folded_lines{$a} <=> $folded_lines{$b} } grep { @{ $folded_lines{$_} } > 1 } keys %folded_lines; print RESULT_FILE $_, ': ', join( ', ', @$folded_lines{$_} ), "\n" for + @dupes;
Every unique line of the file will become a hash key. Those lines that are repeated will be detected, and a list of all lines included in the unions will be listed. If you want to remove the first occurrence from the list, that becomes simple as well. It would look like this:
foreach ( @dupes ) { print RESULT_FILE $_, ': ', join( ', ', @{$folded_lines{$_}}[ 1 .. $#{$folded_lines{$_}} ] + ), "\n"; }
Sample output might look like this:
don't panic: 34, 55, 89, 144
Dave
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Remove Duplicates from one text file and print them in another text file.
by bedohave9 (Acolyte) on Jun 04, 2012 at 21:12 UTC | |
by davido (Cardinal) on Jun 04, 2012 at 22:13 UTC | |
by bedohave9 (Acolyte) on Jun 05, 2012 at 17:05 UTC | |
by davido (Cardinal) on Jun 05, 2012 at 18:43 UTC |