in reply to Re: Fastest way to merge (and de-dup) two large arrays?
in thread Fastest way to merge (and de-dup) two large arrays?

This was where I was going as well, but I was going to suggest hashing the larger array instead of the smaller one. I guess I assumed it would be faster to do fewer loop iterations checking for the existence of a key in a larger hash than the other way around. Is this something that would have to be tested on a data set to determine for sure, or is it a for sure thing to do it faster the way you show?

I ask because I've done this exact thing before and I'm curious for an explanation on the efficiency difference.

I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
  • Comment on Re^2: Fastest way to merge (and de-dup) two large arrays?

Replies are listed 'Best First'.
Re^3: Fastest way to merge (and de-dup) two large arrays?
by Corion (Patriarch) on Aug 11, 2016 at 19:12 UTC

    Every array element for the lookup will need to be hashed anyway, either for "storage" as a hash key or for lookup as a hash key, so that is a wash.

    I usually choose the smaller array for the hash because the smaller array is more likely to fit in memory.

      I think perldigious is going with the idea that building the hash from the larger array initially and adding the smaller array to the larger one would be faster, as the initial hash build doesn't run any if checks (saving potentially millions of cycles)

      It would need testing, but on the face of it this seems to be a reasonable idea. Given that all of the records will eventually end up in the hash, if this doesn't fit in memory, then the entire solution should be reworked to find a completely different method

      my %seen = map {$_ => 1} @rows; foreach my $rawData (@data) { push(@rows,$rawData) unless $seen{$rawData}++; }

        Well SimonPratt I wish I had been considering that now that you mention it :-), good catch.

        I was actually just wondering about the difference in efficiency between checking for the existence of a key in a hash of 60k elements 600k times vs. checking for the existence of a key in a hash of 600k elements 60k times. I'm guessing the answer is "it depends," and I'm further guessing that most of that depends on the nature of the keys themselves.

        I asked Corion if he could give me any insight on the efficiency of such things because I'm estimating he has about three orders of magnitude more Perl and coding experience than I do (as do a lot of the other Monks here). Since I have done pretty much this exact same thing before, and do similar things rather frequently, it seemed worth asking these sorts of questions. As it is I usually just try it multiple ways and test to figure out which is best if run time is a major consideration, and maybe that's the only real way to feel certain about anything anyway.

        I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
        I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious