wsee has asked for the wisdom of the Perl Monks concerning the following question:

I want to seperate all the duplicate records to a duplicate file and write out all the unique records to a unique file.

My input file is about 10MB and the records are sorted numerically. Here is the sample records:

00000
11111
11111
11115
22222
33333
33333
33333
44444
55555
and so on.

The dups file should look like:

11111
11111
33333
33333
33333
The unique file should look like:
00000
11115
22222
44444
55555
Here is my code:
open(IN, $ifile); open(UNQ, $ufile); open(DUP, $dupfile); my %seen; while (<IN>) { if ( exists( $seen{$_} )) { print DUP "$_\n"; } else { $seen{$_}++; print UNQ "$_\n"; } } close(IN); close(UNQ); close(DUP);

My code will not be able to pull ALL the duplicates to the dup file. Instead, it will keep the first duplicate record in unique file and write the rest of the duplicate records to the dup file.

Any suggestion?

edited: Fri Aug 8 13:37:06 2003 by jeffa - code tags, removed br tags, added readmore

Replies are listed 'Best First'.
Re: Separate duplicate and unique records
by japhy (Canon) on Aug 07, 2003 at 20:38 UTC
    You can't know an ID is unique until you've finished reading the file.
    my (%count, %unique); while (<IN>) { print DUP if $count{$_}++; if ($count{$_} == 1) { $unique{$_} = 1 } else { delete $unique{$_} } } print UNQ for sort keys %unique;

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Separate duplicate and unique records
by Thelonius (Priest) on Aug 07, 2003 at 21:39 UTC
    Oy, young people these days! Always with the hashes it is. Like memory grows on trees! When I was your age, we had no RAM. We had 4K of core and we were thankful for it!

    Since the input is sorted, you just need one record of "lookahead":

    use strict; my $prev; my $dup=0; my ($ifile, $ufile, $dupfile) = qw(data uniq dups); open(IN, $ifile) or die "Cannot open $ifile: $!\n"; open(UNQ,">", $ufile) or die "Cannot open $ufile: $!\n"; open(DUP, ">", $dupfile) or die "Cannot open $dupfile: $!\n"; while (<IN>) { if (defined($prev)) { if ($prev eq $_) { $dup = 1; print DUP $prev; } else { if ($dup) { print DUP $prev; $dup = 0; } else { print UNQ $prev; } } } $prev = $_; } if (defined($prev)) { if ($dup) { print DUP $prev; $dup = 0; } else { print UNQ $prev; } }
      Theo, this solution, if I'm parsing it correctly, assumes that all duplicate records are 'stacked' together... ie, all 1111 lines occur one after another, correct? I pretty much discarded that pattern as soon as I saw it, since rarely do the problem sets order themselves up that conveniently. ;)

      If that pattern is guaranteed, then yes, the hash is a waste of memory. If it's not, then your solution... isn't. Personally, I'd rather eat the memory usage and feel comfortable knowing I didn't have to rely on my input to match my expectations, which are fraught with danger and ignorance most days.

      If the memory usage of the hash is that problematic (and let's be honest... when folks say "I only had X amount of memory to use!" it's not because that was all they *wanted* to use, now was it? :) , then stash results off in some other manner (DBI leaps to mind) and read in a limited set of lines at a time.
        Theo, this solution, if I'm parsing it correctly, assumes that all duplicate records are 'stacked' together... ie, all 1111 lines occur one after another, correct? I pretty much discarded that pattern as soon as I saw it, since rarely do the problem sets order themselves up that conveniently. ;)
        He said in the specification that the input file was already sorted.
Re: Separate duplicate and unique records
by shemp (Deacon) on Aug 07, 2003 at 20:34 UTC
    Collect the stats first, then if the count is 1, its unique, otherwise its a duplicate:
    ... while (<IN>) { $seen{$_}++; } foreach my $id (sort keys %seen) { if ( $seen{$id} == 1 ) { print UNQ "$id\n"; } else { print DUP "$id\n"; } } ...
      A dup needs printed to the file each time it appears.

      (Untested)

      ... while (<IN>) { $seen{$_}++; } foreach my $id (sort keys %seen) { if ( $seen{$id} == 1 ) { print UNQ "$id\n"; } else { print DUP "$id\n" foreach 1..$seen{$id}; } } ...
      The sort keys %seen offered here would render the outputs sorted, not in the order they were originaly found. This may or may not be suitable.

      You can either keep a separate array of all unique items in the order discovered, or you can look at Tie::IxHash for an already-bundled solution.

      --
      [ e d @ h a l l e y . c c ]

Re: Separate duplicate and unique records
by revdiablo (Prior) on Aug 07, 2003 at 21:28 UTC

    Note: I know this is Seekers of Perl Wisdom, but there were already plenty of good Perl solutions, and I couldn't resist.

    If you're on a unix (should I say GNU?) machine, or have cygwin installed on a Windows machine, you can very easily accomplish these tasks with sort and uniq.

    Print unique lines:

    sort -u numberlist

    Print all duplicates:

    sort numberlist | uniq -D

    (Or if you know the numberlist is already sorted, uniq -D numberlist will do.)

      Almost right. sort -u will give one copy of each input record whether it was unique or not to begin with. What he wants is uniq -u and uniq -D since he stated the input was already sorted. My program below does it in one pass, though.

        Well, sort -u (or simply the default behavior of uniq) matches the behavior of the OP's sample code -- that is, it prints all values once, but only once. Perhaps that is not what he wanted, but if he stated that in his original post, it's not clear to me. Either way, it seems he has many solutions to any number of problems, some of which may be the actual problem he was looking for help with. :)

Re: Separate duplicate and unique records
by markguy (Scribe) on Aug 07, 2003 at 20:56 UTC
    EDIT: It somehow escaped me that others had suggested effectively this very thing, although I did use a little used operator for printing out the DUP records, so technically it's a different solution! I so need to go home.


    Is there some reason why just reading all the keys into a hash while incrementing the value wouldn't net you what you want?
    my %hash; while ( <IN> ) { $hash{ $_ }++; } foreach my $key ( sort keys %hash ) { if ( $hash{ $key } > 1 ) { print DUP "$key\n" x $hash{ $key }; } else { print UNQ "$key\n"; } }
Re: Separate duplicate and unique records
by ajdelore (Pilgrim) on Aug 07, 2003 at 21:58 UTC

    I suggest that you create a hash to store the number of times you have seen something. Then, you can iterate over the hash to create the output.

    Updated: So, I was a little behind on this and basically solved it the same way as other monks. Oops. That's what happens when your boss walks in while you are playing around on PM. :)

    use strict; open (IN, "test.txt") or die "Can't open file: $!"; open (UNQ, "> unq.txt") or die "Can't open file: $!"; open (DUP, "> dup.txt") or die "Can't open file: $!"; my %hash; while (<IN>) { chomp; $hash{$_}++; } foreach (keys %hash) { if ( $hash{$_} > 1) { print DUP "$_\n"; } else { print UNQ "$_\n"; } } __END__ fraser:~$cat test.txt 0000 1111 2222 3333 4444 0000 3333 1111 5555 1111 0000 0000 1111 3333 6666 0000 fraser:~$cat unq.txt 6666 4444 2222 5555 fraser:~$cat dup.txt 0000 3333 1111 fraser:~$

    </ajdelore>

Re: Separate duplicate and unique records
by rir (Vicar) on Aug 08, 2003 at 03:04 UTC
    I got interrupted or I'd have put this up earlier. This is much like thelonius's solution. I post it only because it shows the queue idea much more clearly.

    In this code the line:

    ($cur, $next) = ( $next, scalar( <IN>));
    may be generalized to slide any size window over a stream:
    @queue[0 .. 4 ] = ( @queue[ 1 .. 3], scalar( <IN> ));
    Also I like to setup such a queue going into a loop than clean up a queue when exiting a loop. It seems clearer to me but I can only call that a personal preference.

    This code seems to work.

    #!/usr/bin/perl -T use strict; use warnings; my ( $input, $unique, $dupe) = qw( input unique dupes ); my $is_dupe; open( IN, $input ) or die "Can not open $input"; open( UNIQUE, ">", $unique ) or die "Can not open $unique"; open( DUPES, ">", $dupe ) or die "Can not open $dupe"; my $cur = <IN>; exit if not defined $cur; my $next = <IN>; no warnings "uninitialized"; while ( 1) { if ( $cur == $next) { print DUPES $cur; $is_dupe = 1; } else { if ( $is_dupe) { print DUPES $cur; } else { print UNIQUE $cur; } $is_dupe = 0; } ($cur, $next) = ( $next, scalar( <IN>)); last if not defined $cur; }
Re: Separate duplicate and unique records
by BrowserUk (Patriarch) on Aug 08, 2003 at 07:48 UTC

    Of course, there's no need to write a whole program, compute expensive hashes and us gobs of memory for things that you can do with a nice, easy to remember one-liner :)

    perl -ne"print{$*ne$_&&$*ne$@?STDOUT:STDERR}$*if$*;($@,$*)=($*,$_);END +{print{$*ne$_&&$*ne$@?STDOUT:STDERR}$*}" in 1>uniq 2>dups

    Caveat: Usually OS quoting rule changes.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.