Comparing large files

Herbert37 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Comparing large files by BrowserUk (Patriarch) on Feb 11, 2014 at 19:30 UTC
I have two large (10Mg plus) 10 Milligram files? Must be light words :) Assuming you mean 10 million, and you have at least 2GB of ram, then: Load the first file of words into a hash. Read the second file line by line and check if the word is in the hash. Don't forget to chomp the newlines. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Comparing large files by LanX (Saint) on Feb 11, 2014 at 19:30 UTC
If 10 "Mg" is just the size this should result in 1e6 words. IIRC does one hash entry result approx. 100 bytes overhead, so putting all errors in a hash should be feasible even on my pity NetBook. Parse the pronunciation-file line by line and build a lookup hash. Then parse the other file per line and look for missing entries. If you really have RAM problems try splitting the hash into several disjunct ones (like for every 10% of the file) and parse the second file once for each hash. Shouldn't take longer then seconds (at most minutes) HTH! :) Cheers Rolf ( addicted to the Perl Programming Language)	[reply]
Re: Comparing large files by wjw (Priest) on Feb 11, 2014 at 19:41 UTC
First I would start the other way around, look for words in words-only file that match those with pronunciations. The assumption is that the smaller set is going to be those with pronunciations. pronunciation-words -> words instead of words->pronunciation-words Next, I think I would look for uniqueness. With a 10Mg file, it is hard to imagine that some words are not in there more than once. That could reduce the whole thing substantially. (guess I could be way wrong there... but... The other thing I might look at if this is not a one-off type thing is using a database if one is handy. Otherwise: pumping comparisons into a simple hash like `$words{$word} = $pronunciation` does a lot of this for you. Hope that is somewhat helpful.. ...the majority is always wrong, and always the last to know about it... Insanity: Doing the same thing over and over again and expecting different results.	[reply] [d/l]
Re^2: Comparing large files by Herbert37 (Novice) on Feb 12, 2014 at 20:44 UTC
Ah, in too much of a hurry... Turns out I am trying to replace a Postgresql database that already sort of does what I want to do (I created it myself), but I believe that Perl is a far more flexible and powerful tool. I can store my hashes, I believe, and once I have done that, I can do whatever I want... in my vague ideation. Any feelings about that? And thanks once again for great help	[reply]
Re^3: Comparing large files by erix (Prior) on Feb 12, 2014 at 20:54 UTC
I don't really see what you are trying to do but if you mention PostgreSQL: perhaps the hstore data type can help? It's a hash-like data type that is indexable (for read-mostly data use the GIN index). Of course, perl hashes will be much faster if memory serves ;-)	[reply]
Re^3: Comparing large files by wjw (Priest) on Feb 13, 2014 at 09:33 UTC
If your words are in your data base already along with your pronunciations, then your job just got much easier. I am just guessing here, as I don't know the db schema you have... The approach remains the same whether you decide to use the DB or not. A database view will give you a long term solution. Combine that with Perl and you can do anything you want very quickly... Based on your last post, you have found what you were looking for which is what counts! Have a good one... ...the majority is always wrong, and always the last to know about it... Insanity: Doing the same thing over and over again and expecting different results.	[reply]
Re: Comparing large files by Laurent_R (Canon) on Feb 11, 2014 at 23:32 UTC
Hmm, not that easy... If your files are 10 MB or probably even 100 MB, then hash lookup is definitely the best solution. If your files are dozens of GB, then sorting them to compare them is the only solution I can think of. Between these two sizes, it is your draw.	[reply]
Re: Comparing large files by Herbert37 (Novice) on Feb 12, 2014 at 05:15 UTC
Thank all of you very much. Really and truly superb help. Thanks. As to Mg, well, I have written mg more often than Mb over the course of time, and I think Freud will give me a pass. Thanks again.	[reply]