According to Knuth's seminal book Sorting and Searching, an external merge sort has a complexity of O(n log(n)) and this will hold true for any data volume: it will never "hit the wall." Once the data has been [externally ...] sorted in this way, it now becomes trivial to know which URL-keys occur and also to know how many instances exist of each distinct value: a simple sequential read of the sorted file will tell you all of this at once. | [reply] |
| [reply] |