print out duplicate records

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've looked high and low for some answers to my dilemma. I need to search through a CSV file and find duplicate records, then write those records to another file. A typical file is 30,000+ records, about 4.3 Mb of text. I figured out how to use the following to remove the records, but I have to capture them so the charges can be reversed.

open(CDR, $CDR);
     open(SORTED_CDR, ">$SORTED_CDR");
     @ARRAY=<CDR>;
     my @unique = do {my %h; grep {!$h {$_} ++} @ARRAY};
     print SORTED_CDR @unique;
[download]

Comment on print out duplicate records Download Code

Replies are listed 'Best First'.
Re: print out duplicate records by graff (Chancellor) on Aug 02, 2002 at 02:18 UTC
Perhaps this would do what you want: `use strict; open( CDR, $CDR ) \|\| die "Can't read $CDR\n"; open( DUP, ">$CDR.dup" ) \|\| die "Can't write $CDR.dup\n"; open( UNQ, ">$CDR.unq" ) \|\| die "Can't write $CDR.unq\n"; my %seen; while (<CDR>) { if ( exists( $seen{$_} )) { print DUP; } else { $seen{$_}++; print UNQ; } }` [download] Having all the unique data lines in a hash in memory shouldn't be a problem for the size of files you mentioned. Note that the numbe of lines in the "dup" file plus the number of lines in the "unq" file should sum to the number of lines in the input file.	[reply] [d/l]
Re: print out duplicate records by DamnDirtyApe (Curate) on Aug 02, 2002 at 02:18 UTC
This is untested, but it ought to do the trick, and it has the added bonus of not having to read 4 megs of text into memory all at once. `#! /usr/bin/perl use strict ; use warnings ; $\|++ ; open(CDR, $CDR); open(SORTED_CDR, ">$SORTED_CDR"); open DUPS, 'dup_records.txt' ; #@ARRAY=<CDR>; #my @unique = do {my %h; grep {!$h {$_} ++} @ARRAY}; my %seen = () ; while ( <CDR> ) { if ( $seen{$_}++ ) { print DUPS $_ ; } else { print SORTED_CDR $_ ; } } #print SORTED_CDR @unique; exit ; __END__` [download] _______________ D a m n D i r t y A p e Home Node \| Email	[reply] [d/l]
Re: print out duplicate records by BrowserUk (Patriarch) on Aug 02, 2002 at 06:01 UTC
'scuse if I got this wrong--I have used a nix system for several years--but wouldn't this work for you(on a nix system)? `uniq -cd sorted_cdr >duplicates uniq -u sorted_cdr >unique_cdr` [download] Of course, a perl solution will be more portable if that's a req.	[reply] [d/l]
Re: Re: print out duplicate records by Anonymous Monk on Aug 02, 2002 at 19:54 UTC
sorry BrowserUK, I should have mentioned it was not a *nix system.	[reply]