Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've looked high and low for some answers to my dilemma. I need to search through a CSV file and find duplicate records, then write those records to another file. A typical file is 30,000+ records, about 4.3 Mb of text. I figured out how to use the following to remove the records, but I have to capture them so the charges can be reversed.
open(CDR, $CDR); open(SORTED_CDR, ">$SORTED_CDR"); @ARRAY=<CDR>; my @unique = do {my %h; grep {!$h {$_} ++} @ARRAY}; print SORTED_CDR @unique;

Replies are listed 'Best First'.
Re: print out duplicate records
by graff (Chancellor) on Aug 02, 2002 at 02:18 UTC
    Perhaps this would do what you want:
    use strict; open( CDR, $CDR ) || die "Can't read $CDR\n"; open( DUP, ">$CDR.dup" ) || die "Can't write $CDR.dup\n"; open( UNQ, ">$CDR.unq" ) || die "Can't write $CDR.unq\n"; my %seen; while (<CDR>) { if ( exists( $seen{$_} )) { print DUP; } else { $seen{$_}++; print UNQ; } }

    Having all the unique data lines in a hash in memory shouldn't be a problem for the size of files you mentioned.

    Note that the numbe of lines in the "dup" file plus the number of lines in the "unq" file should sum to the number of lines in the input file.

Re: print out duplicate records
by DamnDirtyApe (Curate) on Aug 02, 2002 at 02:18 UTC

    This is untested, but it ought to do the trick, and it has the added bonus of not having to read 4 megs of text into memory all at once.

    #! /usr/bin/perl use strict ; use warnings ; $|++ ; open(CDR, $CDR); open(SORTED_CDR, ">$SORTED_CDR"); open DUPS, 'dup_records.txt' ; #@ARRAY=<CDR>; #my @unique = do {my %h; grep {!$h {$_} ++} @ARRAY}; my %seen = () ; while ( <CDR> ) { if ( $seen{$_}++ ) { print DUPS $_ ; } else { print SORTED_CDR $_ ; } } #print SORTED_CDR @unique; exit ; __END__

    _______________
    D a m n D i r t y A p e
    Home Node | Email
Re: print out duplicate records
by BrowserUk (Patriarch) on Aug 02, 2002 at 06:01 UTC

    'scuse if I got this wrong--I have used a *nix system for several years--but wouldn't this work for you(on a *nix system)?

    uniq -cd sorted_cdr >duplicates uniq -u sorted_cdr >unique_cdr

    Of course, a perl solution will be more portable if that's a req.

      sorry BrowserUK, I should have mentioned it was not a *nix system.