For one, sort reads its entire input before outputting a single character, so the sort process will grow not only comparably to the hash inside the perl process, it will infact grow larger than the entire input file. It does considerably more work too - the single-pass approach doesn't need to sort the data since it uses a hash to keep words uniqueNotice that I was proposing an external sort, so your first objection is not correct. The external sort does not use much memory at all, only disk space. It will be slower unless the hash runs out of memory, in which case it would be much faster.
Secondly, the hash you're creating in the second pass is exactly as large as the hash would be at the end of the single-pass script - they both contain all unique words in the file.No, read again; my hash contains only a list of the duplicated words. The words that are truly unique will hever be in the hash at all. Of course, it's possible that all the words are duplicated at least once, in which case you are right.
Also, I suspect that he really wants a list of all the unique words. If he doesn't care about the order, the "sort -u" may well be faster.
In reply to Re: Re^2: Removing repeated words
by Thelonius
in thread Removing repeated words
by abitkin
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |