Match speed of R in array procesing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Match speed of R in array procesing by choroba (Cardinal) on Mar 28, 2012 at 15:38 UTC
Use hashes: `#!/usr/bin/perl use warnings; use strict; my @Array_1 = qw(a1 a2 a3 a4 a5 a6); my @Array_2 = qw(a1 b2 c3 a4 f5 a6); my @Array_3 = qw( 1 2 3 4 5 6); # convert to hashes my (%h1, %h2, %h3); undef @h1{@Array_1}; my @idxs = 1 .. @Array_1; @h2{@Array_2} = @idxs; @h3{@idxs} = @Array_3; # do the calculations my %h_tmp = %h2; delete @h_tmp{@Array_1}; delete @h2{keys %h_tmp}; # output ($,, $\) = (' ', "\n"); print keys %h2; print @h3{values %h2};` [download]	[reply] [d/l]
Re: Match speed of R in array procesing by BrowserUk (Patriarch) on Mar 28, 2012 at 16:34 UTC
You don't say how big your arrays are, but once the data is generated, this script takes just over 3/4 of a second to index 1,000,000 items (array_1) and then to process 100,000 items (array_1) against that index; deleting 198 items from the two arrays: #! perl -slw use strict; use Time::HiRes qw[ time ]; our $A //= 1e6; our $B //= 1e5; my @alphas = 'aaa'..'zzz'; my @numers = '0'..'9'; print "Gening data..."; my @a = map{ $alphas[ rand @alphas] . $numers[rand @numers] } 1 .. $A; my @b = map{ $alphas[ rand @alphas] . $numers[rand @numers] } 1 .. $B; my @c = 1 .. @b; print "Starting..."; my $start = time; print "Indexing..."; ## index array @a my %a; undef $a{ $_ } for @a; print "Removing..."; for my $i ( reverse 0 .. $#b ) { exists $a{ $b[ $i ] } and delete $b[$i] and delete $c[$i]; } printf "Took %.6f seconds to check %d (\@b) against %d (\@a) and remov +e %d items\n", time() - $start, $B, $A, $B - @b __END__ C:\test>962209 -A=1e6 -B=1e5 Gening data... Starting... Indexing... Removing... Took 0.771435 seconds to check 100000 (@b) against 1000000 (@a) and re +move 198 items [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^2: Match speed of R in array procesing by Anonymous Monk on Mar 28, 2012 at 17:41 UTC
The arrays have couple of milions elements but the scripts above are a big imporvement in speed over what I have been using so far. Thanks for help to all of you.	[reply]
Re: Match speed of R in array procesing by tobyink (Canon) on Mar 28, 2012 at 16:00 UTC
R is specifically optimized for this sort of thing. Perl is not. As choroba says, you'll probably be able to do it faster using hashes. You might also want to look at PDL which is a Perl extension for working on large, N-dimensional numeric arrays. It takes a bit of getting used to, but is fast. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply]
Re: Match speed of R in array procesing by davido (Cardinal) on Mar 28, 2012 at 16:21 UTC
Using a hash as a lookup table you end up iterating over @array_1 once (creating the hash), and the indices of @array_2 once. That's O(m+n) `my @Array_1 = ("a1","a2","a3","a4","a5","a6"); my @Array_2 = ("a1","b2","c3","a4","f5","a6"); my @Array_3 = ("1","2","3","4","5","6"); my %crossref; @crossref{@array_1} = (); my( @result_2, @result_3 ); foreach my $ix ( 0 .. $#array_2 ) { next if ! exists $crossref{$array_2[$ix]}; push @result_2, $array_2[$ix]; push @result_3, $array_3[$ix]; }` [download] Dave	[reply] [d/l]
Re: Match speed of R in array procesing by moritz (Cardinal) on Mar 28, 2012 at 15:22 UTC
Please show your code. Maybe there's a way to speed it up, but we can't know that if we can't see your code. Perl 6 - second systems done right	[reply]
Re^2: Match speed of R in array procesing by Anonymous Monk on Mar 28, 2012 at 15:40 UTC
I haven't tried it yet but I would usually do it like this; `foreach $i (@Array_1) { $n = grep {$Array_2[$_] == "$i"} 0 .. $#Array_2; push(@m, $n); }` [download] Once I have the indexes I just delete all the elements with that index in Array 2 and 3. I haven't used perl for a while but as far as I remember I was shocked how fast R does that sort of things. But that must be my not knowing perl enough.	[reply] [d/l]
Re^3: Match speed of R in array procesing by moritz (Cardinal) on Mar 28, 2012 at 16:08 UTC
One problem with your code is that it scales as O(m * n), where `m == scalar @Array_1` and `n == scalar @Array_2` Here's a solution that runs in O(m + n) instead, which should be much faster for large arrays: `use strict; use warnings; use 5.010; # only needed for say() my @Array_1 = ("a1","a2","a3","a4","a5","a6"); my @Array_2 = ("a1","b2","c3","a4","f5","a6"); my @Array_3 = ("1","2","3","4","5","6"); my %seen; @seen{@Array_2} = undef; my @idx = grep exists $seen{$Array_1[$_]}, 0..$#Array_1; @Array_1 = @Array_1[@idx]; @Array_3 = @Array_3[@idx]; say "@Array_1"; say "@Array_3";` [download] There are several other ways to write that same algorithm (for example you could use splice to delete array elements one by one in-place, or push onto two new arrays in parallel), but only a benchmark shows which one is fastest. Perl 6 - second systems done right	[reply] [d/l] [select]