in reply to Re^2: Fastest way to merge (and de-dup) two large arrays?
in thread Fastest way to merge (and de-dup) two large arrays?

Every array element for the lookup will need to be hashed anyway, either for "storage" as a hash key or for lookup as a hash key, so that is a wash.

I usually choose the smaller array for the hash because the smaller array is more likely to fit in memory.

  • Comment on Re^3: Fastest way to merge (and de-dup) two large arrays?

Replies are listed 'Best First'.
Re^4: Fastest way to merge (and de-dup) two large arrays?
by SimonPratt (Friar) on Aug 12, 2016 at 07:59 UTC

    I think perldigious is going with the idea that building the hash from the larger array initially and adding the smaller array to the larger one would be faster, as the initial hash build doesn't run any if checks (saving potentially millions of cycles)

    It would need testing, but on the face of it this seems to be a reasonable idea. Given that all of the records will eventually end up in the hash, if this doesn't fit in memory, then the entire solution should be reworked to find a completely different method

    my %seen = map {$_ => 1} @rows; foreach my $rawData (@data) { push(@rows,$rawData) unless $seen{$rawData}++; }

      Well SimonPratt I wish I had been considering that now that you mention it :-), good catch.

      I was actually just wondering about the difference in efficiency between checking for the existence of a key in a hash of 60k elements 600k times vs. checking for the existence of a key in a hash of 600k elements 60k times. I'm guessing the answer is "it depends," and I'm further guessing that most of that depends on the nature of the keys themselves.

      I asked Corion if he could give me any insight on the efficiency of such things because I'm estimating he has about three orders of magnitude more Perl and coding experience than I do (as do a lot of the other Monks here). Since I have done pretty much this exact same thing before, and do similar things rather frequently, it seemed worth asking these sorts of questions. As it is I usually just try it multiple ways and test to figure out which is best if run time is a major consideration, and maybe that's the only real way to feel certain about anything anyway.

      I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
      I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious