Re: Removing repeated lines from file

After thinking about this WAY too long, two answers came to me: one kind of obscure, the other much more simple.

The first one used set theory and recursion. It went like this:

    Until dataset is 1 line
       Split dataset into two halves
       Take intersection of sets
       Store intersection in duplicate list
       Split each dataset into two datasets, and repeat
    end
    Open original dataset file
    Until EOD
        read line
        compare to list of known duplicates
        if in that list
           if duplicate flag not marked
              emit line to output 
              mark duplicate as emitted
           endif
        else
           emit line on output
        endif      
    end
[download]

I thought this was a pretty cool way to generate a list of duplicates. I believe there are modules on CPAN which can do this kind of set operation.

Then I realized it should be much easier:

    Sort a copy of the datafile
    Open sorted copy   
    Until EOD     
       Read line
       Compare to previous line
       If line == previous line
          if line not in duplicate table
              put line in duplicate table
          endif
       else 
          previous line = line
       endif
     end
     Open original data file
     Until EOD
        read line
        if line in duplicate table
           if duplicate not marked
              emit line on output
              mark duplicate line
           end
        else
           emit line on output 
        endif 
     end
[download]

Both of these have the advantage of only needing to store the duplicate lines. Both have the disadvantage of having to read through the input set multiple times.

Although the first solution seems more "cool" to me, the second is certainly more practical and likely faster (unless the dataset is so large you can't sort it either).

Comment on Re: Removing repeated lines from file Select or Download Code

Replies are listed 'Best First'.
Re^2: Removing repeated lines from file by Aristotle (Chancellor) on Jun 30, 2003 at 22:52 UTC
Your second solution is exactly what the propositions using a hash all do. Makeshifts last the longest.	[reply]
Re: Re^2: Removing repeated lines from file by husker (Chaplain) on Jul 02, 2003 at 17:55 UTC
It looked to me like they put the entire file contents in a hash. Mine only puts duplicate lines in the hash.	[reply]