Re: Iterating through Two Arrays. Is there a better use of memory?
by AR (Friar) on Oct 13, 2011 at 16:56 UTC
|
This is probably the least efficient, yet sane way to do this. I recommend using the elements of the shorter array as the keys of a hash. Then iterate over the longer array, checking whether that element has been seen.
Hash lookups are generally O(1) (if there are few to no collisions), so this reduces your complexity from O(m*n) to O(m+n).
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
|
|
|
|
|
|
|
|
| [reply] |
|
|
Thanks, O(1) couldn't be more perfect.
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
Re: Iterating through Two Arrays. Is there a better use of memory?
by jethro (Monsignor) on Oct 13, 2011 at 17:02 UTC
|
The fastest is using a hash. For very structured data you might in a few cases find slightly faster methods, but really, nothing normally beats hashes in this discipline.
But please be more specific: You talk about string lengths, but you show us arrays. Strings are not arrays outside of the C language. And qw() does produce arrays, not strings.
| [reply] |
|
|
I worry about hashes because of some previous work I've done and some points made in this article.
HashMemoryBlog
Especially where it says "Perl's hashes are inefficient. You need A LOT of memory if you intend to hash tens of millions of key."
This is a concern for me, because the datasets I'm using are really really gigantic.
And I meant to say, 1,000 elements in each array.
| [reply] |
|
|
"Tens of millions of keys" sounds like a big problem, but it's not a problem that's unique to hashes. If there's the potential of having tens of millions of units of anything (stored in arrays, hashes, whatever), then there's the potential for even more. And suddenly we're talking about a solution that just won't scale well.
It seems that you're getting deep into the database domain. The data set is growing (or will grow) to a point where holding it all in memory at once becomes a bad design decision.
How did we get from 1000 elements in the original post to tens of millions of elements? Just because some article points out the obvious; that holding tens of millions of items in memory at once isn't a good solution, doesn't mean that holding thousands is a problem.
| [reply] |
|
|
| [reply] |
|
|
Re: Iterating through Two Arrays. Is there a better use of memory?
by ikegami (Patriarch) on Oct 13, 2011 at 17:42 UTC
|
If memory really is a problem (and I doubt it from the numbers you gave), you can could use
for (0..$#line2) {
my $i = $line2[$_];
for (0..$#line1) {
my $j = $line1[$_];
print $i if if $i eq $j;
}
}
If memory isn't a problem, you can use a much faster solution:
my %lookup;
++$lookup{$_} for @line2;
for (@line1) {
print if $lookup{$_};
}
Finally, you might want to get rid of duplicate messages.
my %lookup;
$lookup{$_} = -1 for @line2;
for (@line1) {
print if !++$lookup{$_};
}
| [reply] [d/l] [select] |
|
|
Again, many thanks. Any yes, everyone is correct. The numbers I gave won't effect my memory now, but I'm working on a preliminary model. Which means, I will probably run into some memory issues later. I know for the actual model I am dealing with at least two files that are 3.6GiB and 5.9GiB. Others could potentially be larger. I will heed and take everyone's advice!thanks!
Jeri
| [reply] |
|
|
I know for the actual model I am dealing with at least two files that are 3.6GiB and 5.9GiB.
Totally irrelevant to your question. The sum of the number of elements in @line1 and @line2 is what matters. You said the product of their size is going to be 10,000, so the sum of their sizes is going to be between 200 and 10,001, rather small numbers.
If you can fit 10GiB into memory, don't fret about using less than 100KB to loop through the data.
| [reply] [d/l] [select] |
Re: Iterating through Two Arrays. Is there a better use of memory?
by dd-b (Pilgrim) on Oct 13, 2011 at 17:57 UTC
|
I'm accepting your analysis that memory use may be an issue, but CPU use probably isn't (though that's a rare enough state that I'm a bit skeptical internally).
You can store only the smaller file in memory; then read through the larger file, checking each entry against the smaller one in memory just after reading it. That should cut your memory use by half or better (depending how asymmetrical the real pairs of files are), and doesn't cost any performance. (But doesn't leave you with the data in memory, which may be valuable to a later stage of processing.) Write your results as you go to another file. (But this may not fit the overall workflow of your problem. However, in general, if you're approaching memory size limits, adopting designs that stream data through rather than holding it all in memory at once is your winning strategy.)
| [reply] |
Re: Iterating through Two Arrays. Is there a better use of memory?
by Anonymous Monk on Oct 13, 2011 at 22:22 UTC
|
If you're working with strings, then work with strings. A string uses less memory than an array of characters, and string operations are fast:
my $line1 = 'CABGFE';
my $line2 = 'DBF';
say for $line1 =~ /[$line2]/g;
Even if you're working with arrays of characters, joining into strings will probably be faster:
my @line1 = qw(C A B G F E);
my @line2 = qw(D B F);
my $line1 = join '', @line1;
my $line2 = join '', @line2;
say for $line1 =~ /[$line2]/g;
If you've only simplified your example to arrays of characters, but they're really arrays of something else, then use a hash, as others have already suggested. | [reply] [d/l] [select] |
|
|
Bareword "say" not allowed while "strict subs" in use at evan.pl line
+19.
when I delete say, it complains...
syntax error at evan.pl line 19, near "$line1 =~"
What's the trick to this sting comparison?
thanks!
| [reply] [d/l] [select] |
Re: Iterating through Two Arrays. Is there a better use of memory?
by JavaFan (Canon) on Oct 13, 2011 at 20:47 UTC
|
What method of comparison is the fastest and least straining on memory if each of the strings were at least 1,000 unique values in length?"
Which many problems, you will have to make a trade-off. Do you go for speed, or do you go for less memory. More often than not, there isn't an algorithm that is both the fastest, and uses the least memory.
It's entirely possible to use an algorithm that uses O(1) memory (not counting disk space). It will be the least straining on memory - but it's not the fastest. Fastest will be to put the strings of one set in a hash, and checking with the other set. But that uses more memory.
So, make your pick. The least straining on memory, or the fastest. You cannot have both.
| [reply] [d/l] |