Re: Comparing elements in Array of Hashes (AoH)

The first question in my mind is... of the 10,000 rows, how many duplicates actually exist? And let’s start just with nodes: how many nodes, regardless of link, actually exist in both lists? Now, how about links, regardless of node. In both cases, which set of “hits” is smaller ... in the actual, observed data?

The device of choice is a hash. Loop through the first list and, for each key, set $$hash{$key1} = 0. &nbs; Now, having processed the first list entirely, on to the second list... if (exists($$hash{$key2}) { $$hash{$key2} = 1; }. When that second loop is finished, every hash-table entry having a value of 1 represents entries which occur in both lists. Notice that you can, in just these two loops, simultaneously find present-in-both values for both node and, separately, link.

The next step, then, is to loop through both lists once final time. This time, you consider only node and link values that are known to be present in both lists. You construct a hash-key that consists of a concatenation of both values, e.g. "$node::$link". The keys that you put into this final hash-table are those which are known to be highly likely to be a double-match.

This strategy would achieve the answer having made three full-length passes through both lists. Memory can be reclaimed in each case by deleting hash-entries that do not contain the value 1 after the conclusion of the second loop of each pair.

The algorithm is suggested on the presumption that duplicate keys are a usefully-small percentage of the total.

As a footnote: what you are doing here is called a merge as well as an inner join and this approach is based on the plentiful-memory assumption that omits both of two possible alternatives which would require minimal if any programming:

Putting the data into (say...) an SQLite database file, then executing an INNER JOIN query.
Sorting two files identically using the sort command-line command and then merging the resulting files with the merge command-line command (of a Unix or Linux system).

If you have to do a lot of stuff like this, or you want to do a bunch of ad-hoc work with this data, SQLite (and maybe Perl, of course) might be just the ticket. Quite an amazing piece of software... (And, so is Perl!)