Re: How to eliminate redundancy in huge dataset (1,000

This is an easy job for perl. The general idea is:

my %seen;
while (<>) {
  chomp;
  my @fields = split('\|', $_);
  if (my $record = $seen{$fields[0]}) {
    ... append @fields to $record...
  } else {
    $seen{$fields[0]} = ...some data structure...
  }
}
for my $k (keys %seen) {
  my $record = $seen{$k};
  # process $record
}
[download]

The reason for the ellipses is I'm not exactly sure what data you want to parse out, and after looking at your data more carefully, I realize that I should ask some questions about its structure.

Is each record one long line or is it 4 lines consisting of a | separated line, two alphabet lines and a blank line?

Also, do you know if your input is sorted? Even though perl can handle reading in 10K lines into memory, we can optimize the code if we know that the input is already sorted by id.

For the following I will assume that each record is 4 lines:

Then the above loop would look like:

my %seen;
$/ = "\n\n";
while (<>) {
  chomp;
  my ($idline, @letters) = split(/\n/, $_);
  my @fields = split('\|', $_);
  my $record = $seen{$fields[0]};
  unless ($record) {
    $seen{$fields[0]} = $record
       = { idline => $idline, letters => [ @letters ] };
  } else {
    push(@{$record->{letters}}, @letters);
  }
}
for keys (%seen) {
  print join("\n", $seen->{$_}->{idline}, @{$seen->{$_}->{letters}}, "
+\n"), "\n";
}
[download]

Comment on Re: How to eliminate redundancy in huge dataset (1,000 - 10,000) Select or Download Code