in reply to Re: File Manipulation - Need Advise!
in thread File Manipulation - Need Advise!

> Whenever you want the unique members of a data-set, think about using a hash
When you want the pairwise unique members of a serial set, think about a state variable.

If you need unique across an entire set, no question that hashes are most useful. Problem, though, is that you have to then store all the keys.

It is not uncommon to want to dedup when there are successive runs (think unix's 'uniq'). That's when this second class comes into play. Set a state variable, and read one line at a time. You may have to keep around the previous line or two to compute your state. You may have to do some work at the end to clean up stored lines.

my $thisKey; my $lastLine = <>; my $lastKey = ''; # first line is header, so always print while (<>) { if (/(.*?)\t.*/) { $thisKey = $1 } else { warn "bad data: $_ had no tab\n"; } if ($thisKey ne $lastKey) { print $lastLine; } $lastLine = $_; $lastKey = $thisKey; } print $lastLine;
This is a big win when you have millions and millions of entries to sift through.