Re^2: Out of Memory when generating large matrix

Replies are listed 'Best First'.
Re^3: Out of Memory when generating large matrix by Corion (Patriarch) on Mar 06, 2018 at 08:04 UTC
Delegating it to sort - be it the shell command or the Perl built-in - is a waste of resources, which doesn't scale well. Except of course once the hash has become too large for memory. Then a disk-based mergesort (well, mergecount maybe) is the only thing that will still scale, provided that you have enough disk space to write the merged results.	[reply]
Re^4: Out of Memory when generating large matrix by LanX (Saint) on Mar 06, 2018 at 10:38 UTC
Sure, but the OP is talking about counting thru 197x8000 records, that's at max 1.5 million hash entries when every entry is just counted once, IIRC that'll result in 150 MB RAM (at max, that's a pathological case) IF the RAM wasn't even sufficient for counting, presorting a giant file wouldn't help. (wc could help but an output with 1.5 million colons should be avoided...) I'd surely opt for a DB like SQLite. But I suppose only some thousands of the most frequent K-mers are of interest. (I can imagine a solution with Hash Of Hashes for counting. Only the most relevant hashes are kept in memory while the others are swapped out, but I this would extent the scope of this thread.) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re^3: Out of Memory when generating large matrix by Your Mother (Archbishop) on Mar 06, 2018 at 03:01 UTC
Counting is O(n) but sort at best O(n log(n)) Herr LanX! I believe you to be mistaken! I have it on the good authority of a 30-year veteran for whom the impossible is easy—and the legendary is been there done that—that sorting is, in point of factual experiences in those trenches of yore, actually and exactly more-or-less O(logn). I have no idea who you think you are to disagree with the wisdom of the ages! Harrumph!! Bully!!!1!	[reply]
Re^4: Out of Memory when generating large matrix by LanX (Saint) on Mar 06, 2018 at 10:40 UTC
Tongue Haka, again? ;PPP Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re^3: Out of Memory when generating large matrix by Anonymous Monk on Mar 06, 2018 at 13:13 UTC
The OP's problem is counting unique elements. Sorting is implicit in that problem, whether by hashing (sort into buckets according to arbitrary hash function), or straight up sort. Generic sort is `O(n log n)`, but there is also Counting Sort, Radix Sort, and other forms of distribution sorting that are of high utility in real-world applications. It is important to understand that hashing is not algorithmically superior to sorting, indeed it is a specific form of sorting in disguise. For another example of a similar problem, and how sort can save the day, see Perl Hashes in C?.	[reply]
Re^4: Out of Memory when generating large matrix (space complexity) by LanX (Saint) on Mar 07, 2018 at 01:31 UTC
> but there is also Counting Sort, Radix Sort, and other forms of distribution sorting that are of high utility in real-world applications. These algorithms are all limited to elements of fixed length (which is the case), but Sundial suggested unix sort which knows nothing about the nature of the input. And did you try to look into the space complexity of your suggestions? Counting Sort requires 421 buckets. Seriously ???** Radix Sort is still the best in this respect, but has O(l*n) with l length of alphabet (here 21). Even worse identical elements are in worst case only identified in the last step, requiring to be able to hold all elements in memory, while shifting them all l times from bucket to bucket. A hash count can go thru different files without keeping them all in memory, because identical elements collapse into one count. Last but not least, a hash count is much easier implemented. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re^5: Out of Memory when generating large matrix (space complexity) by Anonymous Monk on Mar 07, 2018 at 13:25 UTC
Well, I glanced "(space complexity)" in the title and thought there is a glimmer of hope for you, yet. You have identified the problem area, but again deftly avoid seeing the light. Sorting (or deduplicating) is a problem with `O(n log n)` time complexity. If you have a hash function that successfully distributes the keys, you can cut down the problem and move some of the complexity into space domain. Hashes are `O(n)` both in time and space complexity (list insertion). Streaming merge is `O(1)` in space complexity. Partial hashing is possible. Using a hash table of size `k`, you can modify the algorithm to achieve `O(n log(n/k))` in time complexity, and `O(k)` in space complexity. The `k` scales well until you break out of the CPU caches, after which it scales rather poorly. I referenced another thread where someone run into a brick wall trying to hash just 36M elements. Sort\|uniq proved to be greatly superior in that case. So far, you have veered the topic into discussion of algorithmic complexity, where no-one really asked for it. misapplied the big O notation. Big O does not say whether one solution is faster than other. It tells you about how a problem scales. made the incorrect statement that a hash based solution would scale better than merge sort. In practice, hashing does not scale beyond memory limits. made some ludicrously inappropriate suggestion of using wc; this suggests you did not invest the cycles necessary to understand the problem, let alone offer a solution. applied some technique (hashing) as a magic bullet, without the fundamental grasp of subject matter. This is Cargo Cult by definition. By the way, I never argued that a hash count was unsuitable. By all means, `++$count{$key}` if that works. But you chose to attack a broken clock, and forgot that a broken clock, too, is right two times a day.	[reply] [d/l]
Re^6: Out of Memory when generating large matrix (space complexity) by LanX (Saint) on Mar 07, 2018 at 14:07 UTC
Re^4: Out of Memory when generating large matrix by LanX (Saint) on Mar 06, 2018 at 13:26 UTC
> Generic sort is O(n log n), but there is The best generic sorts are O(n log n) (worst case matters), SD was seemingly talking about the command line utility `sort` , which is generic . > It is important to understand that hashing is not algorithmically superior to sorting, indeed it is a specific form of sorting in disguise. I disagree because as I already said counting by hashing is one pass and is loosing any order information. Sort is about ordering and I don't see a way to make this in one pass. But I'd be interested to see your evidence about how the information loss from hashing can be compensated ... Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^4: Out of Memory when generating large matrix by BrowserUk (Patriarch) on Mar 06, 2018 at 15:42 UTC
It is important to understand that hashing is not algorithmically superior to sorting, indeed it is a specific form of sorting in disguise. What utter baloney! Come out from your cloak and let's argue? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit	[reply]
Re^5: Out of Memory when generating large matrix by Anonymous Monk on Mar 06, 2018 at 16:48 UTC
Hashing in a nutshell: apply hash function f() to the keys, bucket the data records accordingly. Where a radix sort would use part of the key directly (like a hash function that just masks bits), hashing picks a more complicated function. So there's a tradeoff. Your data is no longer sorted by the key, but by f(key). On the other hand, you get a flat distribution that makes the bucketing work. Can you truly not see the similarity between distribution sort and hashing?	[reply]
Re^6: Out of Memory when generating large matrix by BrowserUk (Patriarch) on Mar 06, 2018 at 18:06 UTC
Re^7: Out of Memory when generating large matrix by Marshall (Canon) on Mar 07, 2018 at 20:01 UTC
Some notes below your chosen depth have not been shown here
Re^4: Out of Memory when generating large matrix by LanX (Saint) on Mar 06, 2018 at 13:42 UTC
> counting unique elements What is that supposed to mean? The elements are by definition unique IFF their count is one! Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re^5: Out of Memory when generating large matrix by Anonymous Monk on Mar 06, 2018 at 14:17 UTC
It was supposed to mean that a count is obtained for each of the (unique) element in a set. (Not the number of unique elements.) It doesn't really matter though, what is important is that a set on unique elements has to be constructed. Constructing a set of unique elements is the same (algorithmically) as sorting, with the proviso that an ordered comparison function is available (i.e. set can be ordered). If only compare for equality is available, the construction becomes `O(n*n)`.	[reply]
Re^6: Out of Memory when generating large matrix by LanX (Saint) on Mar 06, 2018 at 14:46 UTC

Last but not least