Greetings, esteemed monks!

Interesting problem. Several very good and interesting suggestions. The only thing I would add is that if the kill list is big, it might make the process faster if, as you find and delete the dead serial numbers from the big file, you delete them from the "kill list" also. This might make tests of subsequent lines faster. I believe that reasonablekeith's suggestion is similar, but it's contingent on the files being sorted.

I am not sure how well this would work; would he spend more time adding the code to delete the hash key (assuming he goes that route) than he would save? It would just be one extra line in the loop, and only executed if a deletion happened.

I would NOT use a regex with more than a handful of alternations--now THAT I am pretty sure would be significantly slower than a simple hash key lookup.

It's possible that time spent sorting both lists beforehand would be less than the time saved by the sort (ie sorting would be a good thing). A possible additional benefit of sorting would be that you could use an array to store the kill list (as opposed to a hash) and just increment the array index whenever you delete the currently indexed serial number (or (for a more robust approach if you might have numbers in the kill list that aren't in the big file) when the serial number read from the file is greater than the currently indexed serial number to kill.

Also, if we're talking about spending time preparing the data to make the actual update process faster, the gzip idea might be of benefit, but I am less sure of that, especially if the big file is read in one line at a time.

_________________________________________________________________________________

I like computer programming because it's like Legos for the mind.


In reply to Re: 15 billion row text file and row deletes - Best Practice? by OfficeLinebacker
in thread 15 billion row text file and row deletes - Best Practice? by awohld

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.