A more normal way to test for duplicates is to use a hash. You can set the input record seperator to @: to make this easy. There are a couple of ways to go about this depending on if order is important to you, if you need to know when a duplicate was removed and if the entire file can be read into memory at one time.

# here is an example to get you started in untested code. my ($in, $out, %hash); open $in, "<", $file_name; # you set the name somewhere else open $out, ">", $out_file; local $/ = "@:"; # set the input record seperator while (<$fh>) { if ($hash{$_}++) { # will be undef first time then >0 print "Warning duplicate found\n$_"; } else { print $out $_ } }

If the file is too big to hold in the hash then perhaps use the first line of each record as a key and store the ofset in the outfile where it can be found. If you find that key again read the one you wrote back to see if the entire record matches. Of course the value under the key would be an array of ofsets to allow multiple differing records with the same first line.

If you control the writing program perhaps you can better fix that not to write duplicte records.

You would also get more out of the Monastery if you had a quick look here How do I post a question effectively?

Cheers,
R.

Pereant, qui ante nos nostra dixerunt!

In reply to Re: Regular Expression to find duplicate text blocks by Random_Walk
in thread Regular Expression to find duplicate text blocks by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.