After thinking about this WAY too long, two answers came to me: one kind of obscure, the other much more simple.

The first one used set theory and recursion. It went like this:

Until dataset is 1 line Split dataset into two halves Take intersection of sets Store intersection in duplicate list Split each dataset into two datasets, and repeat end Open original dataset file Until EOD read line compare to list of known duplicates if in that list if duplicate flag not marked emit line to output mark duplicate as emitted endif else emit line on output endif end
I thought this was a pretty cool way to generate a list of duplicates. I believe there are modules on CPAN which can do this kind of set operation.

Then I realized it should be much easier:

Sort a copy of the datafile Open sorted copy Until EOD Read line Compare to previous line If line == previous line if line not in duplicate table put line in duplicate table endif else previous line = line endif end Open original data file Until EOD read line if line in duplicate table if duplicate not marked emit line on output mark duplicate line end else emit line on output endif end

Both of these have the advantage of only needing to store the duplicate lines. Both have the disadvantage of having to read through the input set multiple times.

Although the first solution seems more "cool" to me, the second is certainly more practical and likely faster (unless the dataset is so large you can't sort it either).


In reply to Re: Removing repeated lines from file by husker
in thread Removing repeated lines from file by matth

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.