f77coder has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have two snippets of code which get the intersection of two hash arrays and am wondering about performance on large scale data.

In a nutshell, what kind of performance is to be expected using grep vs looping through hashes? looking at the grep source code, it has a lot of buffering, I've tried to find some details on internals of memory allocation of hashes but ???

I read this "This does mean that hashes eat up a I<lot> of memory, both in memory Devel::Size can track (the memory actually in the structures and strings) and that it can't (the malloc alignment and length overhead)." from here http://cpansearch.perl.org/src/TELS/Devel-Size-0.71/lib/Devel/Size.pm

or perhaps is it possible to unroll the loop and thread it?

my %inter; for (keys %hist1) { if (exists $hist2{$_}) { my $val1 = $hist1{$_}; my $val2 = $hist2{$_}; $inter{$_} = ($val1 <= $val2) ? $val1 : $val2; } }
my @common = grep exists $hist1{$_}, sort keys %hist2;

Thanks for any insight.

Replies are listed 'Best First'.
Re: memory:: grep vs looping through hash
by thomas895 (Deacon) on Aug 18, 2014 at 06:39 UTC
    Premature optimization is the root of all evil.
    — Donald E. Knuth

    As for your memory concern: the best thing to do is just try it. In almost all cases you'll find it won't be a problem.
    I made it a bit shorter for you, though:

    # untested %inter = map { $_ => ($hist1{$_} <= $hist2{$_} ? $hist1{$_} : $hist2{$_}) } grep { $hist2{$_} } keys %hist1;

    Should you run into problems, it's probably less of a headache to throw more memory at it. Purchase it for your PC or server, or rent from one of the popular "cloud" providers that charge only several cents an hour for quite a bit of memory.

    -Thomas
    "Excuse me for butting in, but I'm interrupt-driven..."

      Thomas, thanks for reply/code.

      My system is already maxed and have already hit the wall with programs R and Matlab so I need something brutally fast at strings/IO. In your opinion/experience the grep will be faster?

        My system is already maxed and have already hit the wall with programs R and Matlab so I need something brutally fast at strings/IO
        In that case, try my suggestion of renting a server somewhere to do this for you. Digital Ocean is a popular one, and their prices range from USD 0.007/hour for 512MB of RAM to USD 0.952/hr for 64GB with a 20-core processor. One of their big selling points is that when you're not using it, you simply "turn it off" and don't pay anything.
        You get a virtualised Linux instance to do whatever you'd like with.

        Amazon also offers their "EC2" service which is essentially the same thing -- but beware of large corporations trying to lock you in!
        HP has one as well, I believe, as does Microsoft.

        In your opinion/experience the grep will be faster?
        I really couldn't say. The most largest array I've grep'd was maybe a few hundred members.

        -Thomas
        "Excuse me for butting in, but I'm interrupt-driven..."
Re: memory:: grep vs looping through hash
by Anonymous Monk on Aug 18, 2014 at 12:54 UTC

    To conserve memory, iterate over the hash with each. Invoking keys will create a list, which, for large hashes, could be substantial itself.

    Then again, access to the second hash may hit memory in random patterns and thus perform sub-optimally. Large datasets are best managed using a database.

      Many thanks for the replies.

      I've considered databases but the incoming data is fairly dirty and cannot be dumped into a dB until it has been cleaned/munged quite a bit, but I've only looked at traditional sql and hadoop.

Re: memory:: grep vs looping through hash
by locked_user sundialsvc4 (Abbot) on Aug 18, 2014 at 12:54 UTC

    The admonition to “try it” really is the best advice that can be given.   If your system has its memory-back up against a memory-wall, and if in your case grep is going to return long lists such that your system simply has nowhere to put them ... then you probably can’t use it.   You have to use a loop.   However, in that case, “you have bigger fish to fry,” because you are seriously over-loading your system.

    There’s a reason why early systems used the term “SOS (as in ‘... --- ...’ ) = Short On Storage.”   If you seriously over-commit the RAM resource, then the system is going to start thrashing, and it will be the futile disk-I/O that sinks the boat, not any clever algorithm or the absence thereof.   If you literally have stuffed your machine full of all the RAM that a 64-bit motherboard can possibly take, then you simply have to expand the job to multiple machines, each similarly equipped.

    And you are probably equally screwed whether you use a loop or not, because it is memory references that cause page-faults.   When memory-references that ought to take nanoseconds consistently start taking milliseconds ... “ding dong, you’re dead.”