Separate duplicate and unique records

wsee has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Separate duplicate and unique records by japhy (Canon) on Aug 07, 2003 at 20:38 UTC
You can't know an ID is unique until you've finished reading the file. `my (%count, %unique); while (<IN>) { print DUP if $count{$_}++; if ($count{$_} == 1) { $unique{$_} = 1 } else { delete $unique{$_} } } print UNQ for sort keys %unique;` [download] _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re: Separate duplicate and unique records by Thelonius (Priest) on Aug 07, 2003 at 21:39 UTC
Oy, young people these days! Always with the hashes it is. Like memory grows on trees! When I was your age, we had no RAM. We had 4K of core and we were thankful for it! Since the input is sorted, you just need one record of "lookahead": use strict; my $prev; my $dup=0; my ($ifile, $ufile, $dupfile) = qw(data uniq dups); open(IN, $ifile) or die "Cannot open $ifile: $!\n"; open(UNQ,">", $ufile) or die "Cannot open $ufile: $!\n"; open(DUP, ">", $dupfile) or die "Cannot open $dupfile: $!\n"; while (<IN>) { if (defined($prev)) { if ($prev eq $_) { $dup = 1; print DUP $prev; } else { if ($dup) { print DUP $prev; $dup = 0; } else { print UNQ $prev; } } } $prev = $_; } if (defined($prev)) { if ($dup) { print DUP $prev; $dup = 0; } else { print UNQ $prev; } } [download]	[reply] [d/l]
Re: Re: Separate duplicate and unique records by markguy (Scribe) on Aug 08, 2003 at 13:07 UTC
Theo, this solution, if I'm parsing it correctly, assumes that all duplicate records are 'stacked' together... ie, all 1111 lines occur one after another, correct? I pretty much discarded that pattern as soon as I saw it, since rarely do the problem sets order themselves up that conveniently. ;) If that pattern is guaranteed, then yes, the hash is a waste of memory. If it's not, then your solution... isn't. Personally, I'd rather eat the memory usage and feel comfortable knowing I didn't have to rely on my input to match my expectations, which are fraught with danger and ignorance most days. If the memory usage of the hash is that problematic (and let's be honest... when folks say "I only had X amount of memory to use!" it's not because that was all they wanted to use, now was it? :) , then stash results off in some other manner (DBI leaps to mind) and read in a limited set of lines at a time.	[reply]
Re: Re: Re: Separate duplicate and unique records by Thelonius (Priest) on Aug 08, 2003 at 16:13 UTC
Theo, this solution, if I'm parsing it correctly, assumes that all duplicate records are 'stacked' together... ie, all 1111 lines occur one after another, correct? I pretty much discarded that pattern as soon as I saw it, since rarely do the problem sets order themselves up that conveniently. ;) He said in the specification that the input file was already sorted.	[reply]
Re: Separate duplicate and unique records by shemp (Deacon) on Aug 07, 2003 at 20:34 UTC
Collect the stats first, then if the count is 1, its unique, otherwise its a duplicate: `... while (<IN>) { $seen{$_}++; } foreach my $id (sort keys %seen) { if ( $seen{$id} == 1 ) { print UNQ "$id\n"; } else { print DUP "$id\n"; } } ...` [download]	[reply] [d/l]
Re: Re: Separate duplicate and unique records by Abstraction (Friar) on Aug 07, 2003 at 20:38 UTC
A dup needs printed to the file each time it appears. (Untested) `... while (<IN>) { $seen{$_}++; } foreach my $id (sort keys %seen) { if ( $seen{$id} == 1 ) { print UNQ "$id\n"; } else { print DUP "$id\n" foreach 1..$seen{$id}; } } ...` [download]	[reply] [d/l]
Re: Re: Separate duplicate and unique records by halley (Prior) on Aug 07, 2003 at 20:43 UTC
The `sort keys %seen` offered here would render the outputs sorted, not in the order they were originaly found. This may or may not be suitable. You can either keep a separate array of all unique items in the order discovered, or you can look at Tie::IxHash for an already-bundled solution. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l]
Re: Separate duplicate and unique records by revdiablo (Prior) on Aug 07, 2003 at 21:28 UTC
Note: I know this is Seekers of Perl Wisdom, but there were already plenty of good Perl solutions, and I couldn't resist. If you're on a unix (should I say GNU?) machine, or have cygwin installed on a Windows machine, you can very easily accomplish these tasks with `sort` and `uniq`. Print unique lines: `sort -u numberlist` Print all duplicates: `sort numberlist \| uniq -D` (Or if you know the `numberlist` is already sorted, `uniq -D numberlist` will do.)	[reply] [d/l] [select]
Re: Re: Separate duplicate and unique records by Thelonius (Priest) on Aug 07, 2003 at 21:43 UTC
Almost right. `sort -u` will give one copy of each input record whether it was unique or not to begin with. What he wants is `uniq -u` and `uniq -D` since he stated the input was already sorted. My program below does it in one pass, though.	[reply] [d/l] [select]
Re: Re: Re: Separate duplicate and unique records by revdiablo (Prior) on Aug 07, 2003 at 23:10 UTC
Well, `sort -u` (or simply the default behavior of `uniq`) matches the behavior of the OP's sample code -- that is, it prints all values once, but only once. Perhaps that is not what he wanted, but if he stated that in his original post, it's not clear to me. Either way, it seems he has many solutions to any number of problems, some of which may be the actual problem he was looking for help with. :)	[reply] [d/l] [select]
Re: Separate duplicate and unique records by markguy (Scribe) on Aug 07, 2003 at 20:56 UTC
EDIT: It somehow escaped me that others had suggested effectively this very thing, although I did use a little used operator for printing out the DUP records, so technically it's a different solution! I so need to go home. Is there some reason why just reading all the keys into a hash while incrementing the value wouldn't net you what you want? `my %hash; while ( <IN> ) { $hash{ $_ }++; } foreach my $key ( sort keys %hash ) { if ( $hash{ $key } > 1 ) { print DUP "$key\n" x $hash{ $key }; } else { print UNQ "$key\n"; } }` [download]	[reply] [d/l]
Re: Separate duplicate and unique records by ajdelore (Pilgrim) on Aug 07, 2003 at 21:58 UTC
I suggest that you create a hash to store the number of times you have seen something. Then, you can iterate over the hash to create the output. Updated: So, I was a little behind on this and basically solved it the same way as other monks. Oops. That's what happens when your boss walks in while you are playing around on PM. :) use strict; open (IN, "test.txt") or die "Can't open file: $!"; open (UNQ, "> unq.txt") or die "Can't open file: $!"; open (DUP, "> dup.txt") or die "Can't open file: $!"; my %hash; while (<IN>) { chomp; $hash{$_}++; } foreach (keys %hash) { if ( $hash{$_} > 1) { print DUP "$_\n"; } else { print UNQ "$_\n"; } } __END__ fraser:~$cat test.txt 0000 1111 2222 3333 4444 0000 3333 1111 5555 1111 0000 0000 1111 3333 6666 0000 fraser:~$cat unq.txt 6666 4444 2222 5555 fraser:~$cat dup.txt 0000 3333 1111 fraser:~$ [download] </ajdelore>	[reply] [d/l]
Re: Separate duplicate and unique records by rir (Vicar) on Aug 08, 2003 at 03:04 UTC
I got interrupted or I'd have put this up earlier. This is much like thelonius's solution. I post it only because it shows the queue idea much more clearly. In this code the line: `($cur, $next) = ( $next, scalar( <IN>));` [download] may be generalized to slide any size window over a stream: `@queue[0 .. 4 ] = ( @queue[ 1 .. 3], scalar( <IN> ));` [download] Also I like to setup such a queue going into a loop than clean up a queue when exiting a loop. It seems clearer to me but I can only call that a personal preference. This code seems to work. #!/usr/bin/perl -T use strict; use warnings; my ( $input, $unique, $dupe) = qw( input unique dupes ); my $is_dupe; open( IN, $input ) or die "Can not open $input"; open( UNIQUE, ">", $unique ) or die "Can not open $unique"; open( DUPES, ">", $dupe ) or die "Can not open $dupe"; my $cur = <IN>; exit if not defined $cur; my $next = <IN>; no warnings "uninitialized"; while ( 1) { if ( $cur == $next) { print DUPES $cur; $is_dupe = 1; } else { if ( $is_dupe) { print DUPES $cur; } else { print UNIQUE $cur; } $is_dupe = 0; } ($cur, $next) = ( $next, scalar( <IN>)); last if not defined $cur; } [download]	[reply] [d/l] [select]
Re: Separate duplicate and unique records by BrowserUk (Patriarch) on Aug 08, 2003 at 07:48 UTC
Of course, there's no need to write a whole program, compute expensive hashes and us gobs of memory for things that you can do with a nice, easy to remember one-liner :) `perl -ne"print{$ne$_&&$ne$@?STDOUT:STDERR}$if$;($@,$)=($,$_);END +{print{$ne$_&&$ne$@?STDOUT:STDERR}$*}" in 1>uniq 2>dups` [download] Caveat: Usually OS quoting rule changes. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller If I understand your problem, I can solve it! Of course, the same can be said for you.	[reply] [d/l]