My requirement is to identify the duplicate records in a text file.
while (<>) { chomp; $no_dupes{$_} = $. if (! exists $no_dupes{$_}); }
...and what you're doing is identifying the location of a single duplicate for each line content. Is it reasonable to ignore the possibility of more than one duplicate of a given line? (It may be fine; I don't know about your data set. Or it may be a bug waiting to be discovered.) The code you posted is fine if you only care that to know the content of any line that is duplicated somewhere in the file. But it falls short if you care about how many times, and where those duplicates are found.
To allow for the possibility of flagging more than one duplicate, you might try this instead:
use strict; use warnings; my %folded_lines; while ( <> ) { chomp; push @{$folded_lines{$_}}, $.; } my @dupes = sort { $folded_lines{$a} <=> $folded_lines{$b} } grep { @{ $folded_lines{$_} } > 1 } keys %folded_lines; print RESULT_FILE $_, ': ', join( ', ', @$folded_lines{$_} ), "\n" for + @dupes;
Every unique line of the file will become a hash key. Those lines that are repeated will be detected, and a list of all lines included in the unions will be listed. If you want to remove the first occurrence from the list, that becomes simple as well. It would look like this:
foreach ( @dupes ) { print RESULT_FILE $_, ': ', join( ', ', @{$folded_lines{$_}}[ 1 .. $#{$folded_lines{$_}} ] + ), "\n"; }
Sample output might look like this:
don't panic: 34, 55, 89, 144
Dave
In reply to Re: Remove Duplicates from one text file and print them in another text file.
by davido
in thread Remove Duplicates from one text file and print them in another text file.
by bedohave9
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |