Re: Matching data between huge files

Update: Disregard, makes no sense. BrowserUk's solution is much better.

~~If you want to do it multiple times, the best solution could be to load it into a relation database (think postgres, mysql) and do a join over the two tables.~~

Comment on Re: Matching data between huge files

Replies are listed 'Best First'.
Re^2: Matching data between huge files by Corion (Patriarch) on Aug 27, 2008 at 08:00 UTC
In my experience, this is only true if you need to partition both sides into three sets: The elements in both sets The elements only in the left set The elements only in the right set You can formulate these as three SQL queries: `SELECT left.keycol as onlyleft FROM left LEFT JOIN right ON left.keycol = right.keycol WHERE right.keycol IS NULL SELECT right.keycol as onlyright FROM right LEFT JOIN left ON left.keycol = right.keycol WHERE left.keycol IS NULL SELECT right.keycol as onlyright FROM right INNER JOIN left ON left.keycol = right.keycol` [download] But you can get the same results by using a hash and "colliding out" the elements from the hash and then looking what's left over. The SQL approach does not handle duplicate keys well, while the hash-based approach is extended trivially to handle duplicate keys by storing arrayrefs of the rows instead of just using the keys as indicators. I have to confesss, I've done lots and lots of such comparison programs. I haven't collected together the code into a released framework yet, because there are different business rules for when two rows are to be considered equal/identical, and I haven't found a convenient way to hide that in a framework.	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: Matching data between huge files
by Corion (Patriarch) on Aug 27, 2008 at 08:00 UTC

In my experience, this is only true if you need to partition both sides into three sets:

The elements in both sets
The elements only in the left set
The elements only in the right set

You can formulate these as three SQL queries:

SELECT left.keycol as onlyleft
FROM left LEFT JOIN right ON left.keycol = right.keycol
WHERE right.keycol IS NULL


SELECT right.keycol as onlyright
FROM right LEFT JOIN left ON left.keycol = right.keycol
WHERE left.keycol IS NULL


SELECT right.keycol as onlyright
FROM right INNER JOIN left ON left.keycol = right.keycol
[download]

But you can get the same results by using a hash and "colliding out" the elements from the hash and then looking what's left over. The SQL approach does not handle duplicate keys well, while the hash-based approach is extended trivially to handle duplicate keys by storing arrayrefs of the rows instead of just using the keys as indicators.

I have to confesss, I've done lots and lots of such comparison programs. I haven't collected together the code into a released framework yet, because there are different business rules for when two rows are to be considered equal/identical, and I haven't found a convenient way to hide that in a framework.

[reply]
[d/l]