in reply to Most efficient record selection method?

It may be impossible to get an optimal result, but I think this heuristic might be quite successful:

Have 3 HashofArrays with the 3 fields as keys and an array of csv lines which have this key. In this array only the lines which are part of the minimal set are collected

Loop through the files with a module like Text::CSV or Parse::CSV

For every line check if any of the three fields is already in one of the three hashes. If any of the three field values is missing, add the line to the 3 hashes (i.e add line to array of hash1{$field_value1})

If all three field values are already in the database, check whether adding this line would allow you to drop two or three other lines out of the hashes. Lets call the three values of your line a b and c. Now hash1 for a should point to (a,x,y). Check if x in hash2 has two lines in the array (can't be more than 2) and y in hash3 has two lines. If that is the case, you could remove (a,x,y). Do the same with b and c. If you collected more than one line to remove, then do it. There is the complication that you could find the same lines more than once in the three searches so you have to be careful about the edge cases.

  • Comment on Re: Most efficient record selection method?