My requirement is to identify the duplicate records in a text file.

while (<>) { chomp; $no_dupes{$_} = $. if (! exists $no_dupes{$_}); }

...and what you're doing is identifying the location of a single duplicate for each line content. Is it reasonable to ignore the possibility of more than one duplicate of a given line? (It may be fine; I don't know about your data set. Or it may be a bug waiting to be discovered.) The code you posted is fine if you only care that to know the content of any line that is duplicated somewhere in the file. But it falls short if you care about how many times, and where those duplicates are found.

To allow for the possibility of flagging more than one duplicate, you might try this instead:

use strict; use warnings; my %folded_lines; while ( <> ) { chomp; push @{$folded_lines{$_}}, $.; } my @dupes = sort { $folded_lines{$a} <=> $folded_lines{$b} } grep { @{ $folded_lines{$_} } > 1 } keys %folded_lines; print RESULT_FILE $_, ': ', join( ', ', @$folded_lines{$_} ), "\n" for + @dupes;

Every unique line of the file will become a hash key. Those lines that are repeated will be detected, and a list of all lines included in the unions will be listed. If you want to remove the first occurrence from the list, that becomes simple as well. It would look like this:

foreach ( @dupes ) { print RESULT_FILE $_, ': ', join( ', ', @{$folded_lines{$_}}[ 1 .. $#{$folded_lines{$_}} ] + ), "\n"; }

Sample output might look like this:

don't panic: 34, 55, 89, 144

Dave


In reply to Re: Remove Duplicates from one text file and print them in another text file. by davido
in thread Remove Duplicates from one text file and print them in another text file. by bedohave9

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.