perl ST sort performance issue for large file?

rkshyam has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: perl ST sort performance issue for large file? by BrowserUk (Patriarch) on Mar 30, 2012 at 11:48 UTC
As already identified, the ST works fine for smallish datasets, but consumes prodigious amounts of memory for the millions of small anonymous arrays, when used on large datasets. In addition to that, you are compounding things by creating multiple, huge, stack-based lists in the following line: `for $key( map{ $_ -> [0] } sort{ $a->[1] cmp $b->[1] \|\| $a->[2] cm +p $b->[2] } map{ [ $_,(split)[0],(split)[2] ] } keys %hash ) ## ^list1 ^list2 + ^list3 ^list4 ## + ^^^^^^ dups ^^^^^` [download] And duplicating effort by performing the same split twice for every record. If you moved your code to using a GRT (see A brief tutorial on Perl's native sorting facilities.) it would probably use about 1/4 the memory, and run in about 1/10th the time. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^2: perl ST sort performance issue for large file? by AnomalousMonk (Archbishop) on Mar 30, 2012 at 12:06 UTC
salva's suggestion of using the system `sort` followed by duplicate line consolidation seems the ideal way to go. It's a multi-key sort, but the keys don't seem wonky to the point `sort` would choke on them; even if they were, an intermediate processing step to produce a 'normalized' file to feed to `sort` would, while a bit tedious, be fairly straight-forward, quick and highly scalable. Am I missing something?	[reply] [d/l] [select]
Re^3: perl ST sort performance issue for large file? by BrowserUk (Patriarch) on Mar 30, 2012 at 12:28 UTC
salva pretty much wrote the book on non-native sorting, so I wasn't contradicting him. Only trying to explain to the OP why his code might be taking so long. That said, in general it is faster to sort stuff internal to perl (assuming sufficient memory), than to shell out to do it -- otherwise there would be no logic in having a built-in sort at all. With the OP having 12GB of ram; and wanting to sort a datafile of only 3GB; and the GRT usually requiring minimal extra memory over the storage required to build the array in the first place; and sorts ability to do its thing in-place on modern perls; I would not be at all surprised -- though I would have to benchmark to be sure() -- if using the GRT for the OPs purpose wouldn't beat using an external sort. Unless you have access to a version of (gnu) sort that isn't living in the stone age and only using a few 10s of MB when the machine it is running on has 10s of GBs available. And as the OP is a Window's user, unless he has found a version (of gnu sort) that I haven't succeeded in finding, he's out of luck on that score. Whilst the Window's supplied sort utility is very capable, and will use as much memory as is available, it is limited by only allowing a single sort key. ()Which I would do, but I'm currently about 10% of the way thought thrashing the life out of my machine running something that will take around 60 hours to complete. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^4: perl ST sort performance issue for large file? by AnomalousMonk (Archbishop) on Mar 30, 2012 at 13:10 UTC
Re^5: perl ST sort performance issue for large file? by BrowserUk (Patriarch) on Mar 30, 2012 at 13:57 UTC
Some notes below your chosen depth have not been shown here
Re: perl ST sort performance issue for large file? by salva (Canon) on Mar 30, 2012 at 11:28 UTC
The ST uses huge amounts of memory and that may be pushing your script into swap usage. Anyway, it seems to me that in your particular case, just sorting the lines unprocessed will produce the same result: `for my $key(sort keys %hash) { print FH1_sorting "$key -> $hash{$key}"; }` [download] You could also use the external `sort` program and consolidate duplicates afterwards. Or use some sorting module as Sort::External or Sort::Key.	[reply] [d/l] [select]
Re^2: perl ST sort performance issue for large file? by rkshyam (Acolyte) on Mar 31, 2012 at 08:45 UTC
Just an update on this: As of now I am trying with the sort option as mentioned by salva. for my $key(sort keys %hash) { print FH1_sorting "$key -> $hash{$key}"; } This is working.But the problem here is, it is consuming all of system +s memory 12GB and even after sort, the script is not getting terminat +ed, but shows memory consumption at 99% for an indefinite amount of t +ime. I have not looked at the GRT sort yet. Is there any solution for this memory consumption of 99%? Or should I to concentrate on GRT sort? Please advise [download]	[reply] [d/l]
Re^3: perl ST sort performance issue for large file? by BrowserUk (Patriarch) on Mar 31, 2012 at 09:47 UTC
It would be far easier to advise you if you would post the code you are using. Often, the most innocent looking pieces of code conceal things that unnecessarily consume memory. For example, the convenient: `my @array = <$fh>;` uses twice as much memory as: `my @array; my $n = 0; $array[ $n++ ] = $_ while <$fh>;` [download] Because in the first version, `<$fh>` first creates a list of the lines on the stack, which are then assigned to the array. For a breif time, you have two copies of the entire file in memory, with the obvious increase in total memory requirement. In the second version, the array is populated directly thus avoiding that problem. For the majority of uses, the first form is convenient and not a problem, but when you are butting your head against the capacity of your hardware, the change is worth the effort. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^4: perl ST sort performance issue for large file? by rkshyam (Acolyte) on Apr 03, 2012 at 16:15 UTC
Re^3: perl ST sort performance issue for large file? by salva (Canon) on Mar 31, 2012 at 10:26 UTC
then go for an external sort and do the duplicate consolidation as a postprocessing: `open my $fh, '-\|', 'sort', $filename or die; chomp(my $line = <$fh>); my ($prev_key, $prev_value) = split /,,/, $line; while (<$fh>) { chomp; my ($key, $value) = split /,,/; if ($prev_key eq $key) { $prev_value .= $value; } else { print "$prev_key,,$prev_value\n"; ($prev_key, $prev_value) = ($key, $value); } } print "$prev_key,,$prev_value\n";` [download] Read the manual page for your OS `sort` utility in order to find out how to optimize its usage. Or you could also use Sort::External.	[reply] [d/l] [select]
Re^4: perl ST sort performance issue for large file? by rkshyam (Acolyte) on Apr 04, 2012 at 10:04 UTC
Re^5: perl ST sort performance issue for large file? by salva (Canon) on Apr 06, 2012 at 10:42 UTC
Re: perl ST sort performance issue for large file? by JavaFan (Canon) on Mar 30, 2012 at 11:33 UTC
`split(/,,/);` I don't see any `",,"` in the sample data you gave. Do realize, 3Gb is a huge amount of data. It's going to take a long time, no matter what. Some things you may want to try out: Use a GRT instead of a ST. That will save many million times of entering and leaving a scope -- in fact, for a GRT no Perl code needs to be executed to determine which is "smaller". Use an external sort. Oh, and in your map, you're splitting twice. Instead of writing `map {[$_, (split)[0], (split)[2]]}`, why not `map {[$_, (split)[0,2]]}`? But compared to the sort, this will only have a minor effect.	[reply] [d/l] [select]
Re^2: perl ST sort performance issue for large file? by AnomalousMonk (Archbishop) on Mar 30, 2012 at 11:46 UTC
Guttman-Rosler (GRT) paper: A Fresh Look at Efficient Perl Sorting	[reply]
Re: perl ST sort performance issue for large file? (>GRT) by tye (Sage) on Mar 30, 2012 at 13:45 UTC
I find that fast, flexible, stable sort has some important advantages over the GRT technique and it is just as fast. It will use less memory, especially if you have long records (which also means it can be faster than GRT). It is easier to code. - tye	[reply]
Re^2: perl ST sort performance issue for large file? (>memoryGRT) by BrowserUk (Patriarch) on Mar 30, 2012 at 15:11 UTC
The problem with that -- as coded -- is that it will require 7 concurrent lists, each the same size as the input array. Assuming an average 128-byte line length, that means it will need to have room for 175 million items on the stack, for a file that contains 25 million records? The GRT -- code in the same `@out = map() sort map() @in;` way would only require 5 times as many concurrent items. And if the GRT is coded in three steps: Prefix added as the array is built: `my @data; $#data = 25e6; my $i = 0; $data[ $i++ ] = pack "NNA*", m[(\d+)\D+(\d+)], $_ while <>;` [download] sort done in-place: `@data = sort @data;` [download] Prefix trimmed in-place: `substr $_, 0, 8, '' for @data;` [download] It requires (for this example of two numeric keys) just 8 bytes/record extra memory (~6%). With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^3: perl ST sort performance issue for large file? (>memoryGRT) by tye (Sage) on Mar 30, 2012 at 15:42 UTC
More items, smaller items. How that balances out depends (comparing the map+sort+map alternatives). I've never seen what you've described proposed under the heading of "GRT" (even your own demonstration of GRT in A brief tutorial on Perl's native sorting facilities. lacks any of these improvements). It looks to be the most memory efficient way I've seen, though it also places several additional restrictions on how it can be used (so is less general-purpose). Doing the same optimizations to my technique means 2 large lists instead of 1 large list but the contents of the lists are about the same total size. So, if we have a moderately short key for long records, then your above technique might be 10% overhead while mine (optimized) might be 20% overhead (Perl scalars not being thin on overhead). My technique (optimized) would allow the space for constructed sort keys to be freed when no longer in use (more important for larger keys if the sorted list persists long after the sorting). Sorting large records based on a short key, I'm not sure I'd worry about any of your optimizations since even 6 extra small scalars can be a relatively small addition to one huge scalar. Thanks for describing this approach. It seems it could make a significant difference in a lot of situations. You should write it up more prominently (or at least update your existing tutorial to cover this). - tye	[reply]
Re^3: perl ST sort performance issue for large file? (copying) by tye (Sage) on Mar 30, 2012 at 16:45 UTC
I was going to mention the advantage of not having to make a copy of each large record but thought maybe you had even avoided that penalty of the classic "decorate the original record" technique. But, looking again, I'm curious about that aspect of: `$data[ $i++ ] = pack "NNA", m[(\d+)\D+(\d+)], $_ while <>;` [download] That still involves an extra copying of each record. For long records, I've certainly seen that add up to a performance problem on some platforms. But that was so very long ago that I also wonder if the cost of copying a large string of bytes has become relatively* low compared to the cost of other operations on modern platforms. Actually, no, I have seen that be a problem on a modern platform (in a logging system that kept copying strings repeatedly as more information was added that resulted in logging being the majority cost in the system). For fixed-length records, you could avoid the extra copying via: `for( $data[$i++] ) { read STDIN, $_, $reclen, $keylen; substr( $_, 0, $keylen, pack "NN", m[(\d+)\D+)(\d+)] ); }` [download] (since `"\0"x8` won't match `\d` -- for other cases, you can set pos and add a /g and a scalar context to the regex) I wonder if that would ever make a noticeable difference in performance. Given Perl's unfortunate significantly-slower-than-it-should-be implementation of readline (last I checked and except on ancient versions of Unix that are hardly even used these days), you'd probably get a bigger win by unrolling the classic GRT even further and rolling your own readline using large buffers (since sysread with large buffers and split has been repeatedly shown to be about 2x the speed of Perl's implemented-in-C readline, sadly). Then you could combine the "copy one record out of the huge buffer" step with the "decorate record with key" step such that together they only copy the record once. But I suspect that would only matter if the cost of reading the records was not insignificant to the cost of sorting the records. - tye	[reply] [d/l] [select]
Re^4: perl ST sort performance issue for large file? (copying) by BrowserUk (Patriarch) on Mar 30, 2012 at 17:19 UTC
Re^2: perl ST sort performance issue for large file? (>GRT) by AnomalousMonk (Archbishop) on Mar 30, 2012 at 14:29 UTC
I find that fast, flexible, stable sort has some important advantages over the GRT technique... I'm confused. (And so...?) The approach described in the linked node seems to me to be the GRT strategy exactly: "... using the [Perl] sort function with packed sortkeys and without a sortsub." I've also seen this described as the "decorate-sort-undecorate pattern" and I'm sure there are a few other descriptions around, especially since GR apparently weren't the first to describe the GRT. In any event, can you discuss how the approach of fast, flexible, stable sort differs from 'classic' GRT? Update: (Previous to tye's response below.) Well, sometimes reading a bit more helps dispell confusion. On closer inspection, it seems the linked node describes a technique of stabilizing an unstable ~~fixed-field~~ (No – doesn't have to be fixed-field!) sort; it doesn't directly bear upon GRT per se. Am I closer to the mark here?	[reply]
Re^3: perl ST sort performance issue for large file? (>GRT) by tye (Sage) on Mar 30, 2012 at 14:57 UTC
It doesn't "decorate" the original records (thus saving memory) and what would be the "undecorate" step is invariant and so doesn't have to be coded each time. It also allows one to sort indices which can be quite useful, such as when one has multiple lists to be sorted in sync. I see now that one could view it as applying GRT to the list of indices. Though, using pack on the indices adds some minor advantages (especially if you use an old Perl where sort isn't stable). I don't really care to split hairs about whether any particular person considers my technique to be something that they'd call "the GRT", but I have never seen anybody else suggest my technique under any name. I can see my technique as combining three techniques, one of which is GRT (though that wasn't how I ever thought about it before now). But the selection of the three techniques and the details of precisely how to combine them and the fact that that combination has several important advantages (over any expression of GRT that I've seen) makes me consider this as something worth categorizing separately. Certainly, searching for "GRT" isn't enough to ensure somebody discovers this [combination of] technique[s] and the resulting advantages, IME. - tye	[reply]