comment on

After thinking about this WAY too long, two answers came to me: one kind of obscure, the other much more simple.

The first one used set theory and recursion. It went like this:

    Until dataset is 1 line
       Split dataset into two halves
       Take intersection of sets
       Store intersection in duplicate list
       Split each dataset into two datasets, and repeat
    end
    Open original dataset file
    Until EOD
        read line
        compare to list of known duplicates
        if in that list
           if duplicate flag not marked
              emit line to output 
              mark duplicate as emitted
           endif
        else
           emit line on output
        endif      
    end
[download]

I thought this was a pretty cool way to generate a list of duplicates. I believe there are modules on CPAN which can do this kind of set operation.

Then I realized it should be much easier:

    Sort a copy of the datafile
    Open sorted copy   
    Until EOD     
       Read line
       Compare to previous line
       If line == previous line
          if line not in duplicate table
              put line in duplicate table
          endif
       else 
          previous line = line
       endif
     end
     Open original data file
     Until EOD
        read line
        if line in duplicate table
           if duplicate not marked
              emit line on output
              mark duplicate line
           end
        else
           emit line on output 
        endif 
     end
[download]

Both of these have the advantage of only needing to store the duplicate lines. Both have the disadvantage of having to read through the input set multiple times.

Although the first solution seems more "cool" to me, the second is certainly more practical and likely faster (unless the dataset is so large you can't sort it either).

In reply to Re: Removing repeated lines from file by husker
in thread Removing repeated lines from file by matth

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.