A few tips that haven't been mentioned (and are also not to be considered complete).

First, you didn't say if your data was ordered or not. If it happens to be ordered by either feild, then you do not need to put much effort at all into that dup check. Just keep the last item found. That will be all you need to check for to see if the next item is a duplicate.

Of course if they are not ordered, then this will not be as good of a solution. You should still consider it though. For instance, if the files are semi-ordered, that is, there may be about 5-10% mis-ordering, but otherwise it's in the right order, then you can still use the same routine, but instead you use the last field as a water mark sort of value -- that is, if you come across a value that is lower, it gets set to ++$water_mark.

Also, if the files are completely not ordered, you may want to simply sort them beforehand, this initial cost can easily outway the memory cost of your hash.

Another, much simpler, method is to completely get rid of the serial numbers in the file and just start at 0 or 1 for the first record and count up. This only works if you don't care about your serial numbers changing each time you run this program. This is good because then you can also use the previous mentioned idea of sorting by phone number to do your dup-check for phone numbers, and then avoid the dup-check on serials by simply making up your own.

Like I said, many caveats. But depending on what you are doing, these can really speed things up.

Oh, one other things, if you have multiple files, and you sort beforehand you can keep your files separate, but you'll need to open all the files up at once so that you can read from the current lowest one.

Ciao,
Gryn


In reply to Somethings not mentioned yet by gryng
in thread Efficiency and Large Arrays by Kozz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.