in reply to Remove Duplicates from one text file and print them in another text file.

My requirement is to identify the duplicate records in a text file.

while (<>) { chomp; $no_dupes{$_} = $. if (! exists $no_dupes{$_}); }

...and what you're doing is identifying the location of a single duplicate for each line content. Is it reasonable to ignore the possibility of more than one duplicate of a given line? (It may be fine; I don't know about your data set. Or it may be a bug waiting to be discovered.) The code you posted is fine if you only care that to know the content of any line that is duplicated somewhere in the file. But it falls short if you care about how many times, and where those duplicates are found.

To allow for the possibility of flagging more than one duplicate, you might try this instead:

use strict; use warnings; my %folded_lines; while ( <> ) { chomp; push @{$folded_lines{$_}}, $.; } my @dupes = sort { $folded_lines{$a} <=> $folded_lines{$b} } grep { @{ $folded_lines{$_} } > 1 } keys %folded_lines; print RESULT_FILE $_, ': ', join( ', ', @$folded_lines{$_} ), "\n" for + @dupes;

Every unique line of the file will become a hash key. Those lines that are repeated will be detected, and a list of all lines included in the unions will be listed. If you want to remove the first occurrence from the list, that becomes simple as well. It would look like this:

foreach ( @dupes ) { print RESULT_FILE $_, ': ', join( ', ', @{$folded_lines{$_}}[ 1 .. $#{$folded_lines{$_}} ] + ), "\n"; }

Sample output might look like this:

don't panic: 34, 55, 89, 144

Dave

Replies are listed 'Best First'.
Re^2: Remove Duplicates from one text file and print them in another text file.
by bedohave9 (Acolyte) on Jun 04, 2012 at 21:12 UTC

    The code did execute without errors, but the console is showing a bar blinking on the screen. I have been trying with many files, but was returning with the a horizontal cursor blinking in the console and it stands still even after 5-6 minutes.

      How big are these files?


      Dave

        The files are varying a range from 8KB to 250KB. I have used couple of files having 10.83KB and 109.27KB.