in reply to Removing repeated lines from file

After thinking about this WAY too long, two answers came to me: one kind of obscure, the other much more simple.

The first one used set theory and recursion. It went like this:

Until dataset is 1 line Split dataset into two halves Take intersection of sets Store intersection in duplicate list Split each dataset into two datasets, and repeat end Open original dataset file Until EOD read line compare to list of known duplicates if in that list if duplicate flag not marked emit line to output mark duplicate as emitted endif else emit line on output endif end
I thought this was a pretty cool way to generate a list of duplicates. I believe there are modules on CPAN which can do this kind of set operation.

Then I realized it should be much easier:

Sort a copy of the datafile Open sorted copy Until EOD Read line Compare to previous line If line == previous line if line not in duplicate table put line in duplicate table endif else previous line = line endif end Open original data file Until EOD read line if line in duplicate table if duplicate not marked emit line on output mark duplicate line end else emit line on output endif end

Both of these have the advantage of only needing to store the duplicate lines. Both have the disadvantage of having to read through the input set multiple times.

Although the first solution seems more "cool" to me, the second is certainly more practical and likely faster (unless the dataset is so large you can't sort it either).

Replies are listed 'Best First'.
Re^2: Removing repeated lines from file
by Aristotle (Chancellor) on Jun 30, 2003 at 22:52 UTC
    Your second solution is exactly what the propositions using a hash all do.

    Makeshifts last the longest.

      It looked to me like they put the entire file contents in a hash. Mine only puts duplicate lines in the hash.