in reply to associations and sorting
Sounds to me like your sorting idea will work. The command line sort can handle huge files. Sort both files. Now that you have them in a predictable ascending order, open both files and do a "jagged walk" down both of them. You are looking for equality in two ascending streams, eg. (1,2,4,6,8) vs (5,6,7,8). Start with 1 and 5. Then read lines from first until you either see 5 or something bigger, in this case 6 shows up. Then read lines from second file until you see 6 or something bigger. In this case 6 appears in both. Then advance to next numbers. I guess you have to do something if numbers appears twice.
If you made a pass that added leading zeroes so that all numbers of interest contain the same number of digits, then you can use string compare for gt,lt,eq instead of numeric equality. There is also a bigInt module that handles >32 bit ints.
Update: I don't know how slow the sort will be, but since you are only interested in column one in both files, you could combine the two files and sort by the first column, then you could make one pass through the sorted file and look for repeated numbers that are the same. The format varies between the two files (second column number vs string) so you could tell which file each line came from. In this case, numeric vs ascii sort wouldn't matter as you would just be looking for lines that have the same column one. I don't know how these sorting approaches would compare performance wise with a DB based solution.
|
|---|