comment on

I was going to mention the advantage of not having to make a copy of each large record but thought maybe you had even avoided that penalty of the classic "decorate the original record" technique. But, looking again, I'm curious about that aspect of:

$data[ $i++ ] = pack "NNA*", m[(\d+)\D+(\d+)], $_ while <>;
[download]

That still involves an extra copying of each record. For long records, I've certainly seen that add up to a performance problem on some platforms. But that was so very long ago that I also wonder if the cost of copying a large string of bytes has become relatively low compared to the cost of other operations on modern platforms. Actually, no, I have seen that be a problem on a modern platform (in a logging system that kept copying strings repeatedly as more information was added that resulted in logging being the majority cost in the system).

For fixed-length records, you could avoid the extra copying via:

for(  $data[$i++]  ) {
    read STDIN, $_, $reclen, $keylen;
    substr( $_, 0, $keylen, pack "NN", m[(\d+)\D+)(\d+)] );
}
[download]

(since "\0"x8 won't match \d -- for other cases, you can set pos and add a /g and a scalar context to the regex)

I wonder if that would ever make a noticeable difference in performance.

Given Perl's unfortunate significantly-slower-than-it-should-be implementation of readline (last I checked and except on ancient versions of Unix that are hardly even used these days), you'd probably get a bigger win by unrolling the classic GRT even further and rolling your own readline using large buffers (since sysread with large buffers and split has been repeatedly shown to be about 2x the speed of Perl's implemented-in-C readline, sadly).

Then you could combine the "copy one record out of the huge buffer" step with the "decorate record with key" step such that together they only copy the record once. But I suspect that would only matter if the cost of reading the records was not insignificant to the cost of sorting the records.

- tye

In reply to Re^3: perl ST sort performance issue for large file? (copying) by tye
in thread perl ST sort performance issue for large file? by rkshyam

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.