Re: Separate duplicate and unique records

Oy, young people these days! Always with the hashes it is. Like memory grows on trees! When I was your age, we had no RAM. We had 4K of core and we were thankful for it!

Since the input is sorted, you just need one record of "lookahead":

use strict;
my $prev;
my $dup=0;
my ($ifile, $ufile, $dupfile) = qw(data uniq dups);

open(IN, $ifile) or die "Cannot open $ifile: $!\n";
open(UNQ,">", $ufile) or die "Cannot open $ufile: $!\n";
open(DUP, ">", $dupfile) or die "Cannot open $dupfile: $!\n";

while (<IN>) {
  if (defined($prev)) {
    if ($prev eq $_) {
      $dup = 1;
      print DUP $prev;
    } else {
      if ($dup) {
    print DUP $prev;
    $dup = 0;
      } else {
    print UNQ $prev;
      }
    }
  }
  $prev = $_;
}
  if (defined($prev)) {
      if ($dup) {
    print DUP $prev;
    $dup = 0;
      } else {
    print UNQ $prev;
      }
  }
[download]

Comment on Re: Separate duplicate and unique records Download Code

Replies are listed 'Best First'.
Re: Re: Separate duplicate and unique records by markguy (Scribe) on Aug 08, 2003 at 13:07 UTC
Theo, this solution, if I'm parsing it correctly, assumes that all duplicate records are 'stacked' together... ie, all 1111 lines occur one after another, correct? I pretty much discarded that pattern as soon as I saw it, since rarely do the problem sets order themselves up that conveniently. ;) If that pattern is guaranteed, then yes, the hash is a waste of memory. If it's not, then your solution... isn't. Personally, I'd rather eat the memory usage and feel comfortable knowing I didn't have to rely on my input to match my expectations, which are fraught with danger and ignorance most days. If the memory usage of the hash is that problematic (and let's be honest... when folks say "I only had X amount of memory to use!" it's not because that was all they wanted to use, now was it? :) , then stash results off in some other manner (DBI leaps to mind) and read in a limited set of lines at a time.	[reply]
Re: Re: Re: Separate duplicate and unique records by Thelonius (Priest) on Aug 08, 2003 at 16:13 UTC
Theo, this solution, if I'm parsing it correctly, assumes that all duplicate records are 'stacked' together... ie, all 1111 lines occur one after another, correct? I pretty much discarded that pattern as soon as I saw it, since rarely do the problem sets order themselves up that conveniently. ;) He said in the specification that the input file was already sorted.	[reply]