Thankyou for this analysis. I have some points to raise and some questions. At the outset, I'd like to say that this isn't intended as a nitpicking exercise (although it may look like that in parts); instead, I'm more interested in learning from it.

"It will attempt to build 4 huge lists:"

Possibly wishful thinking on my part, but I had imagined that as each list was used up, its data would no longer persist and the memory it used could be reclaimed.

To clarify, the first map generates a list then passes the elements of that list onto sort which, in turn, generates another list which is passed to the second map, and so on. When, say, the second map is being processed, $_ is an alias into the sort list (so sort's data is still in use); however, at this point, the list generated by the first map is not being used by anything and its memory could, at least in theory, be reused without causing problems. Anyway, that was my thinking: I haven't seen any documentation indicating whether this is how it works - I could be entirely wrong.

"1. a list of integers, one for every record in the file. Input to the first map."

Quite right. I did think about this when coding it, but clearly misremembered whay I'd previously read. Going back to "perlop: Range Operators", I see "... no temporary array is created when the range operator is used as the expression in foreach loops, ...": I had thought that was more of a general rule for generating lists with the ".." operator (clearly, it isn't). So in retrospect, "for (0 .. $#input_records) {...}", instead of "... 0 .. $#input_records;", would have removed this overhead and been a better choice.

"2. & 3. a list of anonymous arrays -- each containing two elements ... (map1 to sort and sort to map2)"

Actually, that's three elements (numbers): integer, string, string. Looking back at it, I can see that changing "substr $_, 3" to "0 + substr $_, 3" would have produced three integers which would've saved some memory.

"4. An ordered list of all the records. Output from second map, input to for."

That's not a list of the records, it's a list of integers (i.e. (0 .. $#input_records) after being sorted). Possibly what you meant, but I thought I'd just clear that up in case it wasn't.

"If the OPs records average 64 bytes, that would require:"

The OP has indicated "there are abbout 5 mil records (lines) which sum up to 20GB in total." [sic]. I appreciate that was possibly written while you were composing your reply. Anyway, that gives an average closer to 4,295 bytes (20 * 1024**3 / 5e6); and your calculated ~335 million records (instead of the given ~5 million records) will mean the results will be out by a couple of orders of magnitude (where that figure is used).

In the dot points that followed (with the various calculations), I have some questions regarding where some of the numbers came from. Where appropriate, if you could point me to documentation (e.g. some section in perlguts), that would be preferable as I'd be interested in reading about general cases rather than just specifics relating to this question: that might also be a lot less typing for you.

-- Ken


In reply to Re^3: sorting type question- space problems by kcott
in thread sorting type question- space problems by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.