A limitation of the O notation is that doesn't take into account the cost of each operation.

Reading all of the data sequentially off of the disk for steps 1,2 is relatively fast. There can be trouble with step 3, because it involves seeking around within the files to gather up the records - reading randomly selected records usually takes a lot longer than reading sequential records. The "wall clock" time of a seek operation and rotational delays is in msec. One of the sort comparisons is in nsecs. Thousands of comparisons can be performed in the time it takes to do one seek.

Which way actually works out to be fastest "wall clock wise" in practice depends upon a lot of variables, including a lot of OS and file system configuration stuff (like details of the disk caching,etc.). As well as the details of exactly what kind of data format we are dealing with - size of records, how many, length of the file. Right now all we know is the the files are <1GB and will not fit into memory.

So like many things with programming, there are a lot of "yeah but's". The OP says he is having trouble solving the problem by any method. It could be that the run time doesn't matter and that what matters is the implementation effort. In that case, I would recommend the easiest to implement - whatever that is for the OP.

The observation that it is always going to be faster if all of the data can fit into memory is absolutely true. This is worth a considerable amount of coding effort if the task is to be performed often. I have no idea of how the OP is going to weigh the different trade-offs. Perhaps he will report back in after he has tried some of these ideas.


In reply to Re^2: Working on huge (GB sized) files by Marshall
in thread Working on huge (GB sized) files by vasavi

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.