in reply to Contemplating some set comparison tasks
The idea is to read the data from the file once and store in a hash the keys, the first source found for that key and the number of occurrence of that key. Since your data has 20,000 unique sources values, the hash will have at most 20,000 entries, which is manageable (and even scalable to a certain extent).
Update: The paragraph I crossed above is wrong. Therefore my solution, which worked on your small data, would not work for very large data set. I therefore removed it from this post and will come back with a better one a bit later. I guess it is not a problem to remove the content, since nobody seems to have seen it (it has been there for only about 20 minutes).
|
|---|