In my experience, this is only true if you need to partition both sides into three sets:
You can formulate these as three SQL queries:
SELECT left.keycol as onlyleft FROM left LEFT JOIN right ON left.keycol = right.keycol WHERE right.keycol IS NULL SELECT right.keycol as onlyright FROM right LEFT JOIN left ON left.keycol = right.keycol WHERE left.keycol IS NULL SELECT right.keycol as onlyright FROM right INNER JOIN left ON left.keycol = right.keycol
But you can get the same results by using a hash and "colliding out" the elements from the hash and then looking what's left over. The SQL approach does not handle duplicate keys well, while the hash-based approach is extended trivially to handle duplicate keys by storing arrayrefs of the rows instead of just using the keys as indicators.
I have to confesss, I've done lots and lots of such comparison programs. I haven't collected together the code into a released framework yet, because there are different business rules for when two rows are to be considered equal/identical, and I haven't found a convenient way to hide that in a framework.
In reply to Re^2: Matching data between huge files
by Corion
in thread Matching data between huge files
by est
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |