Iterating through Two Arrays. Is there a better use of memory?

Jeri has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Iterating through Two Arrays. Is there a better use of memory? by AR (Friar) on Oct 13, 2011 at 16:56 UTC
This is probably the least efficient, yet sane way to do this. I recommend using the elements of the shorter array as the keys of a hash. Then iterate over the longer array, checking whether that element has been seen. Hash lookups are generally O(1) (if there are few to no collisions), so this reduces your complexity from O(m*n) to O(m+n).	[reply]
Re^2: Iterating through Two Arrays. Is there a better use of memory? by DentArthurDent (Monk) on Oct 13, 2011 at 17:09 UTC
Nitpick: O(m+n) where m>n is just O(m). The point is, the operation is of linear order in the larger number of items. Also, this is computationally O(1), but the original post asks for memory usage, not computation. I'm not a high enough level monk to know what the memory usage of arrays vs. hashes is, but I would hazard a guess that unless you're using an embedded environment or something else with very little available memory, that at 10k items you don't need to care too much about that. Ben My mission: To boldy split infinitives that have never been split before!	[reply]
Re^3: Iterating through Two Arrays. Is there a better use of memory? by Jeri (Scribe) on Oct 13, 2011 at 17:13 UTC
So, you're saying. If I do just at AR suggested, I will have a linear growth rate instead of an exponential?	[reply]
Re^4: Iterating through Two Arrays. Is there a better use of memory? by AR (Friar) on Oct 13, 2011 at 17:17 UTC
Re^4: Iterating through Two Arrays. Is there a better use of memory? by DentArthurDent (Monk) on Oct 13, 2011 at 17:21 UTC
Re^5: Iterating through Two Arrays. Is there a better use of memory? by JavaFan (Canon) on Oct 13, 2011 at 20:34 UTC
Re^3: Iterating through Two Arrays. Is there a better use of memory? by Jeri (Scribe) on Oct 13, 2011 at 17:26 UTC
Thank you. Hopefully, this spring I will take some more comp. math so I can become a somewhat better at growth estimates.	[reply]
Re^2: Iterating through Two Arrays. Is there a better use of memory? by Jeri (Scribe) on Oct 13, 2011 at 16:59 UTC
Thanks, O(1) couldn't be more perfect.	[reply]
Re^2: Iterating through Two Arrays. Is there a better use of memory? by Jeri (Scribe) on Oct 13, 2011 at 17:20 UTC
I need some clarification. What does it mean exactly when "hash keys collide"?	[reply]
Re^3: Iterating through Two Arrays. Is there a better use of memory? by AnomalousMonk (Archbishop) on Oct 13, 2011 at 18:48 UTC
Take a look at Hash Table and hash collisions.	[reply]
Re: Iterating through Two Arrays. Is there a better use of memory? by jethro (Monsignor) on Oct 13, 2011 at 17:02 UTC
The fastest is using a hash. For very structured data you might in a few cases find slightly faster methods, but really, nothing normally beats hashes in this discipline. But please be more specific: You talk about string lengths, but you show us arrays. Strings are not arrays outside of the C language. And qw() does produce arrays, not strings.	[reply]
Re^2: Iterating through Two Arrays. Is there a better use of memory? by Jeri (Scribe) on Oct 13, 2011 at 17:10 UTC
I worry about hashes because of some previous work I've done and some points made in this article. HashMemoryBlog Especially where it says "Perl's hashes are inefficient. You need A LOT of memory if you intend to hash tens of millions of key." This is a concern for me, because the datasets I'm using are really really gigantic. And I meant to say, 1,000 elements in each array.	[reply]
Re^3: Iterating through Two Arrays. Is there a better use of memory? by davido (Cardinal) on Oct 13, 2011 at 17:30 UTC
"Tens of millions of keys" sounds like a big problem, but it's not a problem that's unique to hashes. If there's the potential of having tens of millions of units of anything (stored in arrays, hashes, whatever), then there's the potential for even more. And suddenly we're talking about a solution that just won't scale well. It seems that you're getting deep into the database domain. The data set is growing (or will grow) to a point where holding it all in memory at once becomes a bad design decision. How did we get from 1000 elements in the original post to tens of millions of elements? Just because some article points out the obvious; that holding tens of millions of items in memory at once isn't a good solution, doesn't mean that holding thousands is a problem. Dave	[reply]
Re^3: Iterating through Two Arrays. Is there a better use of memory? by DentArthurDent (Monk) on Oct 13, 2011 at 17:29 UTC
Back in the day, I made a traveling salesman problem solver which used Perl hashes to reduce the search space. The script started using gigabytes of memory once you got above a few million hash keys. If you're dealing with thousands of items, I wouldn't worry unless you're in a low-memory environment. My mission: To boldy split infinitives that have never been split before!	[reply]
Re^4: Iterating through Two Arrays. Is there a better use of memory? by Jeri (Scribe) on Oct 13, 2011 at 17:32 UTC
Re: Iterating through Two Arrays. Is there a better use of memory? by ikegami (Patriarch) on Oct 13, 2011 at 17:42 UTC
If memory really is a problem (and I doubt it from the numbers you gave), you can could use `for (0..$#line2) { my $i = $line2[$_]; for (0..$#line1) { my $j = $line1[$_]; print $i if if $i eq $j; } }` [download] If memory isn't a problem, you can use a much faster solution: `my %lookup; ++$lookup{$_} for @line2; for (@line1) { print if $lookup{$_}; }` [download] Finally, you might want to get rid of duplicate messages. `my %lookup; $lookup{$_} = -1 for @line2; for (@line1) { print if !++$lookup{$_}; }` [download]	[reply] [d/l] [select]
Re^2: Iterating through Two Arrays. Is there a better use of memory? by Jeri (Scribe) on Oct 13, 2011 at 17:47 UTC
Again, many thanks. Any yes, everyone is correct. The numbers I gave won't effect my memory now, but I'm working on a preliminary model. Which means, I will probably run into some memory issues later. I know for the actual model I am dealing with at least two files that are 3.6GiB and 5.9GiB. Others could potentially be larger. I will heed and take everyone's advice!thanks! Jeri	[reply]
Re^3: Iterating through Two Arrays. Is there a better use of memory? by ikegami (Patriarch) on Oct 13, 2011 at 17:56 UTC
I know for the actual model I am dealing with at least two files that are 3.6GiB and 5.9GiB. Totally irrelevant to your question. The sum of the number of elements in `@line1` and `@line2` is what matters. You said the product of their size is going to be 10,000, so the sum of their sizes is going to be between 200 and 10,001, rather small numbers. If you can fit 10GiB into memory, don't fret about using less than 100KB to loop through the data.	[reply] [d/l] [select]
Re: Iterating through Two Arrays. Is there a better use of memory? by dd-b (Pilgrim) on Oct 13, 2011 at 17:57 UTC
I'm accepting your analysis that memory use may be an issue, but CPU use probably isn't (though that's a rare enough state that I'm a bit skeptical internally). You can store only the smaller file in memory; then read through the larger file, checking each entry against the smaller one in memory just after reading it. That should cut your memory use by half or better (depending how asymmetrical the real pairs of files are), and doesn't cost any performance. (But doesn't leave you with the data in memory, which may be valuable to a later stage of processing.) Write your results as you go to another file. (But this may not fit the overall workflow of your problem. However, in general, if you're approaching memory size limits, adopting designs that stream data through rather than holding it all in memory at once is your winning strategy.)	[reply]
Re: Iterating through Two Arrays. Is there a better use of memory? by Anonymous Monk on Oct 13, 2011 at 22:22 UTC
If you're working with strings, then work with strings. A string uses less memory than an array of characters, and string operations are fast: `my $line1 = 'CABGFE'; my $line2 = 'DBF'; say for $line1 =~ /[$line2]/g;` [download] Even if you're working with arrays of characters, joining into strings will probably be faster: `my @line1 = qw(C A B G F E); my @line2 = qw(D B F); my $line1 = join '', @line1; my $line2 = join '', @line2; say for $line1 =~ /[$line2]/g;` [download] If you've only simplified your example to arrays of characters, but they're really arrays of something else, then use a hash, as others have already suggested.	[reply] [d/l] [select]
Re^2: Iterating through Two Arrays. Is there a better use of memory? by Jeri (Scribe) on Oct 18, 2011 at 17:41 UTC
Hi, thanks for replying. I've decided to work with strings instead of arrays. Thanks for the advice. However, I can't get your code to work properly. `say for $line1 =~ /[$line2]/g;` it complains... `Bareword "say" not allowed while "strict subs" in use at evan.pl line +19.` [download] when I delete `say`, it complains... `syntax error at evan.pl line 19, near "$line1 =~"` [download] What's the trick to this sting comparison? thanks!	[reply] [d/l] [select]
Re: Iterating through Two Arrays. Is there a better use of memory? by JavaFan (Canon) on Oct 13, 2011 at 20:47 UTC
What method of comparison is the fastest and least straining on memory if each of the strings were at least 1,000 unique values in length?" Which many problems, you will have to make a trade-off. Do you go for speed, or do you go for less memory. More often than not, there isn't an algorithm that is both the fastest, and uses the least memory. It's entirely possible to use an algorithm that uses `O(1)` memory (not counting disk space). It will be the least straining on memory - but it's not the fastest. Fastest will be to put the strings of one set in a hash, and checking with the other set. But that uses more memory. So, make your pick. The least straining on memory, or the fastest. You cannot have both.	[reply] [d/l]