Jeri has asked for the wisdom of the Perl Monks concerning the following question:

Hi! The code below is in working order, but it is a simpler version of what I'm actually working with. My question to you is, "What method of comparison is the fastest and least straining on memory if each of the strings were at least 1,000 unique values in length?"

I don't think using two arrays would be the most efficient use of my memory nor a hash, but I'm running out of options. Are there any better ideas?

Thanks!

#!usr/bin/perl5.8.8 use strict; use warnings; my @line1 = qw(C A B G F E); my @line2 = qw(D B F); #assume line 2 is always equal in size or smaller foreach my $i (@line2) { foreach my $j (@line1) { if ($i eq $j) {print "$i\n";} } }

The reason this is of concern is because, I'm going to have to do this type of comparison at least 10,000 time (I'm estimating) per file.

Replies are listed 'Best First'.
Re: Iterating through Two Arrays. Is there a better use of memory?
by AR (Friar) on Oct 13, 2011 at 16:56 UTC

    This is probably the least efficient, yet sane way to do this. I recommend using the elements of the shorter array as the keys of a hash. Then iterate over the longer array, checking whether that element has been seen.

    Hash lookups are generally O(1) (if there are few to no collisions), so this reduces your complexity from O(m*n) to O(m+n).

      Nitpick: O(m+n) where m>n is just O(m). The point is, the operation is of linear order in the larger number of items.
      Also, this is computationally O(1), but the original post asks for memory usage, not computation. I'm not a high enough level monk to know what the memory usage of arrays vs. hashes is, but I would hazard a guess that unless you're using an embedded environment or something else with very little available memory, that at 10k items you don't need to care too much about that.

      Ben
      My mission: To boldy split infinitives that have never been split before!

        So, you're saying. If I do just at AR suggested, I will have a linear growth rate instead of an exponential?

        Thank you. Hopefully, this spring I will take some more comp. math so I can become a somewhat better at growth estimates.

      Thanks, O(1) couldn't be more perfect.

      I need some clarification. What does it mean exactly when "hash keys collide"?

Re: Iterating through Two Arrays. Is there a better use of memory?
by jethro (Monsignor) on Oct 13, 2011 at 17:02 UTC

    The fastest is using a hash. For very structured data you might in a few cases find slightly faster methods, but really, nothing normally beats hashes in this discipline.

    But please be more specific: You talk about string lengths, but you show us arrays. Strings are not arrays outside of the C language. And qw() does produce arrays, not strings.

      I worry about hashes because of some previous work I've done and some points made in this article.

      HashMemoryBlog

      Especially where it says "Perl's hashes are inefficient. You need A LOT of memory if you intend to hash tens of millions of key."

      This is a concern for me, because the datasets I'm using are really really gigantic.

      And I meant to say, 1,000 elements in each array.

        "Tens of millions of keys" sounds like a big problem, but it's not a problem that's unique to hashes. If there's the potential of having tens of millions of units of anything (stored in arrays, hashes, whatever), then there's the potential for even more. And suddenly we're talking about a solution that just won't scale well.

        It seems that you're getting deep into the database domain. The data set is growing (or will grow) to a point where holding it all in memory at once becomes a bad design decision.

        How did we get from 1000 elements in the original post to tens of millions of elements? Just because some article points out the obvious; that holding tens of millions of items in memory at once isn't a good solution, doesn't mean that holding thousands is a problem.


        Dave

        Back in the day, I made a traveling salesman problem solver which used Perl hashes to reduce the search space. The script started using gigabytes of memory once you got above a few million hash keys. If you're dealing with thousands of items, I wouldn't worry unless you're in a low-memory environment.
        My mission: To boldy split infinitives that have never been split before!
Re: Iterating through Two Arrays. Is there a better use of memory?
by ikegami (Patriarch) on Oct 13, 2011 at 17:42 UTC

    If memory really is a problem (and I doubt it from the numbers you gave), you can could use

    for (0..$#line2) { my $i = $line2[$_]; for (0..$#line1) { my $j = $line1[$_]; print $i if if $i eq $j; } }

    If memory isn't a problem, you can use a much faster solution:

    my %lookup; ++$lookup{$_} for @line2; for (@line1) { print if $lookup{$_}; }

    Finally, you might want to get rid of duplicate messages.

    my %lookup; $lookup{$_} = -1 for @line2; for (@line1) { print if !++$lookup{$_}; }

      Again, many thanks. Any yes, everyone is correct. The numbers I gave won't effect my memory now, but I'm working on a preliminary model. Which means, I will probably run into some memory issues later. I know for the actual model I am dealing with at least two files that are 3.6GiB and 5.9GiB. Others could potentially be larger. I will heed and take everyone's advice!thanks!

      Jeri

        I know for the actual model I am dealing with at least two files that are 3.6GiB and 5.9GiB.

        Totally irrelevant to your question. The sum of the number of elements in @line1 and @line2 is what matters. You said the product of their size is going to be 10,000, so the sum of their sizes is going to be between 200 and 10,001, rather small numbers.

        If you can fit 10GiB into memory, don't fret about using less than 100KB to loop through the data.

Re: Iterating through Two Arrays. Is there a better use of memory?
by dd-b (Pilgrim) on Oct 13, 2011 at 17:57 UTC

    I'm accepting your analysis that memory use may be an issue, but CPU use probably isn't (though that's a rare enough state that I'm a bit skeptical internally).

    You can store only the smaller file in memory; then read through the larger file, checking each entry against the smaller one in memory just after reading it. That should cut your memory use by half or better (depending how asymmetrical the real pairs of files are), and doesn't cost any performance. (But doesn't leave you with the data in memory, which may be valuable to a later stage of processing.) Write your results as you go to another file. (But this may not fit the overall workflow of your problem. However, in general, if you're approaching memory size limits, adopting designs that stream data through rather than holding it all in memory at once is your winning strategy.)

Re: Iterating through Two Arrays. Is there a better use of memory?
by Anonymous Monk on Oct 13, 2011 at 22:22 UTC

    If you're working with strings, then work with strings. A string uses less memory than an array of characters, and string operations are fast:

    my $line1 = 'CABGFE'; my $line2 = 'DBF'; say for $line1 =~ /[$line2]/g;
    Even if you're working with arrays of characters, joining into strings will probably be faster:
    my @line1 = qw(C A B G F E); my @line2 = qw(D B F); my $line1 = join '', @line1; my $line2 = join '', @line2; say for $line1 =~ /[$line2]/g;

    If you've only simplified your example to arrays of characters, but they're really arrays of something else, then use a hash, as others have already suggested.

      Hi, thanks for replying. I've decided to work with strings instead of arrays. Thanks for the advice. However, I can't get your code to work properly.

      say for $line1 =~ /[$line2]/g;

      it complains...

      Bareword "say" not allowed while "strict subs" in use at evan.pl line +19.

      when I delete say, it complains...

      syntax error at evan.pl line 19, near "$line1 =~"

      What's the trick to this sting comparison?

      thanks!

Re: Iterating through Two Arrays. Is there a better use of memory?
by JavaFan (Canon) on Oct 13, 2011 at 20:47 UTC
    What method of comparison is the fastest and least straining on memory if each of the strings were at least 1,000 unique values in length?"
    Which many problems, you will have to make a trade-off. Do you go for speed, or do you go for less memory. More often than not, there isn't an algorithm that is both the fastest, and uses the least memory.

    It's entirely possible to use an algorithm that uses O(1) memory (not counting disk space). It will be the least straining on memory - but it's not the fastest. Fastest will be to put the strings of one set in a hash, and checking with the other set. But that uses more memory.

    So, make your pick. The least straining on memory, or the fastest. You cannot have both.