Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have two arrays. I want to delete all elements in Array 2 that are not in Array 1. On top of that delete elements from Array 3 with the same index as those deleted in Array 2. This is very easy but the problem lies in speed. In R I can do it in seconds whereas if I do it in normal loop way in perl it will take ages. Is there a method to match R speed in this operation? Short example below;

@Array_1 = ("a1","a2","a3","a4","a5","a6"); @Array_2 = ("a1","b2","c3","a4","f5","a6"); @Array_3 = ("1","2","3","4","5","6"); output @Array_2 = ("a1","a4","a6"); @Array_3 = ("1","4","6");

If I am doing operation on very large arrays is there a faster method than doing loops (I usually used foreach)? Thanks for sharing your wisdom.

Replies are listed 'Best First'.
Re: Match speed of R in array procesing
by choroba (Cardinal) on Mar 28, 2012 at 15:38 UTC
    Use hashes:
    #!/usr/bin/perl use warnings; use strict; my @Array_1 = qw(a1 a2 a3 a4 a5 a6); my @Array_2 = qw(a1 b2 c3 a4 f5 a6); my @Array_3 = qw( 1 2 3 4 5 6); # convert to hashes my (%h1, %h2, %h3); undef @h1{@Array_1}; my @idxs = 1 .. @Array_1; @h2{@Array_2} = @idxs; @h3{@idxs} = @Array_3; # do the calculations my %h_tmp = %h2; delete @h_tmp{@Array_1}; delete @h2{keys %h_tmp}; # output ($,, $\) = (' ', "\n"); print keys %h2; print @h3{values %h2};
Re: Match speed of R in array procesing
by BrowserUk (Patriarch) on Mar 28, 2012 at 16:34 UTC

    You don't say how big your arrays are, but once the data is generated, this script takes just over 3/4 of a second to index 1,000,000 items (array_1) and then to process 100,000 items (array_1) against that index; deleting 198 items from the two arrays:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; our $A //= 1e6; our $B //= 1e5; my @alphas = 'aaa'..'zzz'; my @numers = '0'..'9'; print "Gening data..."; my @a = map{ $alphas[ rand @alphas] . $numers[rand @numers] } 1 .. $A; my @b = map{ $alphas[ rand @alphas] . $numers[rand @numers] } 1 .. $B; my @c = 1 .. @b; print "Starting..."; my $start = time; print "Indexing..."; ## index array @a my %a; undef $a{ $_ } for @a; print "Removing..."; for my $i ( reverse 0 .. $#b ) { exists $a{ $b[ $i ] } and delete $b[$i] and delete $c[$i]; } printf "Took %.6f seconds to check %d (\@b) against %d (\@a) and remov +e %d items\n", time() - $start, $B, $A, $B - @b __END__ C:\test>962209 -A=1e6 -B=1e5 Gening data... Starting... Indexing... Removing... Took 0.771435 seconds to check 100000 (@b) against 1000000 (@a) and re +move 198 items

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      The arrays have couple of milions elements but the scripts above are a big imporvement in speed over what I have been using so far. Thanks for help to all of you.

Re: Match speed of R in array procesing
by tobyink (Canon) on Mar 28, 2012 at 16:00 UTC

    R is specifically optimized for this sort of thing. Perl is not. As choroba says, you'll probably be able to do it faster using hashes.

    You might also want to look at PDL which is a Perl extension for working on large, N-dimensional numeric arrays. It takes a bit of getting used to, but is fast.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Match speed of R in array procesing
by davido (Cardinal) on Mar 28, 2012 at 16:21 UTC

    Using a hash as a lookup table you end up iterating over @array_1 once (creating the hash), and the indices of @array_2 once. That's O(m+n)

    my @Array_1 = ("a1","a2","a3","a4","a5","a6"); my @Array_2 = ("a1","b2","c3","a4","f5","a6"); my @Array_3 = ("1","2","3","4","5","6"); my %crossref; @crossref{@array_1} = (); my( @result_2, @result_3 ); foreach my $ix ( 0 .. $#array_2 ) { next if ! exists $crossref{$array_2[$ix]}; push @result_2, $array_2[$ix]; push @result_3, $array_3[$ix]; }

    Dave

Re: Match speed of R in array procesing
by moritz (Cardinal) on Mar 28, 2012 at 15:22 UTC

      I haven't tried it yet but I would usually do it like this;

      foreach $i (@Array_1) { $n = grep {$Array_2[$_] == "$i"} 0 .. $#Array_2; push(@m, $n); }

      Once I have the indexes I just delete all the elements with that index in Array 2 and 3. I haven't used perl for a while but as far as I remember I was shocked how fast R does that sort of things. But that must be my not knowing perl enough.

        One problem with your code is that it scales as O(m * n), where  m == scalar @Array_1 and n == scalar @Array_2

        Here's a solution that runs in O(m + n) instead, which should be much faster for large arrays:

        use strict; use warnings; use 5.010; # only needed for say() my @Array_1 = ("a1","a2","a3","a4","a5","a6"); my @Array_2 = ("a1","b2","c3","a4","f5","a6"); my @Array_3 = ("1","2","3","4","5","6"); my %seen; @seen{@Array_2} = undef; my @idx = grep exists $seen{$Array_1[$_]}, 0..$#Array_1; @Array_1 = @Array_1[@idx]; @Array_3 = @Array_3[@idx]; say "@Array_1"; say "@Array_3";

        There are several other ways to write that same algorithm (for example you could use splice to delete array elements one by one in-place, or push onto two new arrays in parallel), but only a benchmark shows which one is fastest.