Howto Avoid Memory Problem in List::MoreUtils

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Howto Avoid Memory Problem in List::MoreUtils by salva (Canon) on May 04, 2006 at 09:33 UTC
The algorithm used by List::MoreUtils::uniq internally is fast (O(N)) but uses a lot of memory. You can use an alternative algorithm as sort-and-remove-consecutive-duplicates that is slower (O(NlogN)) but that doesn't require additional memory: `@array = sort @array; # this is optimized by perl as a sort # in place operation that doesn't # require additional memory my $last = $array[0]; my $i = 1; while($i<@array) { if ($array[$i] eq $last) { splice @array, $i, 1 } else { $last = $array[$i]; $i++; } }` [download]	[reply] [d/l]
Re^2: Howto Avoid Memory Problem in List::MoreUtils by salva (Canon) on May 04, 2006 at 09:48 UTC
well, the algorithm on my previous post is actually O(N^2) on the worst case because of the splice. A better version is: `@array = sort @array; my $j = 0; my $last; for my $e (@array) { if (!$j or $last ne $e) { $array[$j] = $last = $e; $j++; } } $#array = $j-1;` [download]	[reply] [d/l]
Re^3: Howto Avoid Memory Problem in List::MoreUtils by Anonymous Monk on May 09, 2006 at 05:59 UTC
Hi salva, Thanks so much for your reply. Just a minor question. What's the meaning of this line: `$array[$j] = $last = $e;` [download] Why not simply: `$array[$j] = $e;` [download]	[reply] [d/l] [select]
Re^4: Howto Avoid Memory Problem in List::MoreUtils by ikegami (Patriarch) on May 09, 2006 at 06:03 UTC
Re^2: Howto Avoid Memory Problem in List::MoreUtils by BrowserUk (Patriarch) on May 04, 2006 at 10:31 UTC
this is optimized by perl as a sort in place operation In which version of Perl did that become true? Unless I'm doing it wrong, it doesn't seem to be true in 5.8.8? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: Howto Avoid Memory Problem in List::MoreUtils by dave_the_m (Monsignor) on May 04, 2006 at 10:39 UTC
n which version of Perl did that become true? Since 5.8.4. `$ perl583 -le '@a=(1,2); $r1 = \$a[0]; @a=sort@a; $r2 = \$a[0];print " +yes"if $r1 == $r2' $ perl584 -le '@a=(1,2); $r1 = \$a[0]; @a=sort@a; $r2 = \$a[0];print " +yes"if $r1 == $r2' yes $` [download] Dave.	[reply] [d/l]
Re^3: Howto Avoid Memory Problem in List::MoreUtils by salva (Canon) on May 04, 2006 at 11:07 UTC
well, you are right, I forgot that mergesort needs additional memory to perform the merge, but yet `@a = sort @a` uses half the memory required by other sorting expressions, for instance: `use Memchmark 'cmpthese'; # use sort '_quicksort'; cmpthese( none => sub { my @a = map { int rand 100 } 1..1000000; }, sort0 => sub { my @a = map { int rand 100 } 1..1000000; @a = + sort { $a <=> $b } @a }, sort1 => sub { my @a = map { int rand 100 } 1..1000000; @a = + sort { $a <=> $b } @a, 1 } ); __END__ test: none, memory used: 3997696 bytes test: sort0, memory used: 7999488 bytes test: sort1, memory used: 12390400 bytes` [download] And if the quicksort algorithm is selected, then the sentence on my previous post becomes really true: `use sort '_quicksort'; ... __END__ test: none, memory used: 3997696 bytes test: sort0, memory used: 3997696 bytes test: sort1, memory used: 12390400 bytes` [download]	[reply] [d/l] [select]
Re^4: Howto Avoid Memory Problem in List::MoreUtils by BrowserUk (Patriarch) on May 04, 2006 at 11:48 UTC
Re^5: Howto Avoid Memory Problem in List::MoreUtils by salva (Canon) on May 04, 2006 at 12:13 UTC
Some notes below your chosen depth have not been shown here
Re: Howto Avoid Memory Problem in List::MoreUtils by graff (Chancellor) on May 05, 2006 at 04:04 UTC
Just curious... would you get the "Out of memory!" result if you simply did this instead of the "uniq" call (other things being equal up to this point): `my @uvery_big_array = @very_big_array;` [download] In any case, since one way or another you'll need to sacrifice some runtime to economize on memory usage, a last resort might be to simply save @very_big_array to a disk file, run unix "sort -u" on that, and read it back in to the same array variable. (Obviously not an attractive option if portability is an issue, but if this is just an "in-shop" process, unix "sort" has been ported to every popular OS.)	[reply] [d/l]
Re^2: Howto Avoid Memory Problem in List::MoreUtils by salva (Canon) on May 05, 2006 at 09:06 UTC
Sort::External could be a good option for that also. Or, comming back to the hash aproach used by List::MoreUtils::uniq, using an in disk tree as provided by DB_File could be a faster solution specially if the number of duplicates is high.	[reply]
Re^2: Howto Avoid Memory Problem in List::MoreUtils by Anonymous Monk on May 26, 2006 at 02:55 UTC
Hi graff, Sorry for coming back to you again. I guess I have to resort to your way of doing it. As for your suggestion below: save @very_big_array to a disk file How could one implement it efficiently directly from the existing non-uniq array (as @very_big_array)?	[reply]
Re^3: Howto Avoid Memory Problem in List::MoreUtils by salva (Canon) on May 26, 2006 at 08:46 UTC
`use DB_File; tie my %uniq, "DB_File", "/tmp/uniq"; # you should use File::Temp here +! $uniq{$_} = 1 for @data; my @uniq = keys %uniq;` [download]	[reply] [d/l]
Re^4: Howto Avoid Memory Problem in List::MoreUtils by Anonymous Monk on May 27, 2006 at 14:34 UTC
Re: Howto Avoid Memory Problem in List::MoreUtils by codeacrobat (Chaplain) on May 05, 2006 at 06:27 UTC
Just a reminder. You can prevent the addition of duplicates with Array::Unique. `use Array::Unique; tie @a, 'Array::Unique'; @a = qw(a b c a d e f); push @a, qw(x b z); print "@a\n"; # a b c d e f x z` [download]	[reply] [d/l]
Re^2: Howto Avoid Memory Problem in List::MoreUtils by salva (Canon) on May 05, 2006 at 09:00 UTC
Array::Unique uses the same approach as List::MoreUtils::uniq to eliminate duplicates, so it is not going to solve the out of memory problem.	[reply]
Re: Howto Avoid Memory Problem in List::MoreUtils by explorer (Chaplain) on May 05, 2006 at 17:38 UTC
For crunching number... use PDL;	[reply]