Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

I'm trying to figure out the fastest way to search over an array a number of times looking for different things. Once I've found what I want, I wont look for them again. e.g. first iteration I look for all elements with A in them, second iteration I look for all elements with B in them, and so on.

I thought grepping the array for what I was looking for then ! grepping the same search and putting those results back into my original array would be fastest, but on paper its probably slower than leaving the array intact and grepping the entire thing every time.

Can you grep and remove simultaneously? Is there another way to do this faster than grepping the entire array every time (my array is around 200M?

Thanks!

  • Comment on Fastest way to grep multiple time on some array

Replies are listed 'Best First'.
Re: Fastest way to grep multiple time on some array
by GrandFather (Saint) on Feb 19, 2009 at 07:48 UTC

    If you want good answers then you need to ask good questions. "best", "fastest" and similar ...st criteria generally depend on the details of what you want to do. Very often there is no '...st' answer, even in a well defined context. Taking that on board, would you care to rephrase your question providing a little more detail concerning the matches you wish to perform? Some important details are:

    • Are you matching fixed strings?
    • How many different match strings are there?
    • How many elements do you need to match against (is 200M 200e6 elements, 200 MB, something else)?
    • Is this a one time task, or do you need to run the matching process many times?
    • What are the actual time constraints?
    • Is there existing code that this needs to work with?
    • Does the data come from something like a file or do they come from a database?
    • Is it possible to put the data into a database if it isn't already?

    True laziness is hard work
Re: Fastest way to grep multiple time on some array
by CountZero (Bishop) on Feb 19, 2009 at 07:21 UTC
    It is not really fully clear what you want:
    • grep through the list and then further grep within the results from the first grep: use a chain of grep:
      @final_result = grep /first condition/, grep /second condition/, @orig +inal_array;
      With a little bit of effort you can probably "fold" both conditions into one. That would be more efficient.
    • grep through the list and then grep through the same list excluding already found elements: I suggest that you work with an array of arrays. Make each element into an anonymous array, with element 0 (zero) the original data and element 1 a "flag" indicating that it is an element that was already selected in a previous search (1 would mean found in the first search; 2 means found in the second search; ...). Only you should use map rather than search in this case as you will loop over the same list each time. Finally you have to loop over the whole list once more to put each found element in its own "bin" based upon the value of the flag.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Fastest way to grep multiple time on some array
by grizzley (Chaplain) on Feb 19, 2009 at 07:03 UTC
    Use hash to store under key 'A' values matching A, under key 'B' values matching B, etc.
Re: Fastest way to grep multiple time on some array
by Porculus (Hermit) on Feb 19, 2009 at 09:26 UTC

    From the description, it sounds like you're taking a big array and trying to split it into a number of smaller arrays. I'd have expected the most efficient approach to be to do everything in a single pass, rather than looping over the array several times.

    That is to say, instead of

    my @As = grep { /A/ } @bigarray; my @Bs = grep { /B/ } @bigarray; my @Cs = grep { /C/ } @bigarray;

    I would write

    my (@As, @Bs, @Cs); for my $item (@bigarray) { if ($item =~ /A/) { push @As, $item } elsif ($item =~ /B/) { push @Bs, $item } elsif ($item =~ /C/) { push @Cs, $item } }
      or even
      my %H; for my $item (@bigarray) { push @{$H{$1}}, $item if ($item =~ /(A|B|C)/); }


      holli

      When you're up to your ass in alligators, it's difficult to remember that your original purpose was to drain the swamp.
        I think this would fail if one of the items in bigarray was "AB", as it would only add this item to $H{"A"} and not $H{"B"}.

        Alas we don't know for sure whether this would be a problem for the OP or not, but re-reading the post suggests maybe not.

        --
        use JAPH;
        print JAPH::asString();

Re: Fastest way to grep multiple time on some array
by ikegami (Patriarch) on Feb 19, 2009 at 18:22 UTC

    Can you grep and remove simultaneously?

    sub grep_in_place(&\@) { my ($cb, $array) = @_; my $src = -1; my $dst = -1; while (++$src < @$array) { my $keep; $keep = $cb->() for $array->[$src]; $array->[++$dst] = $array->[$src] if $keep; } $#$array = $dst; } grep_in_place { ... } @array;

    Uses O(1) memory

    This could be done faster using XS since the the elements could be moved instead of copied.

    Update: Fixed the proto. Added disappeared if $keep.

      It doesn't quite work. grep_and_remove() should do the trick:
      use warnings; use strict; use Data::Dumper; sub grep_in_place(&@) { my ($cb, $array) = @_; my $src = -1; my $dst = -1; while (++$src < @$array) { my $keep; $keep = $cb->() for $array->[$src]; $array->[++$dst] = $array->[$src]; } $#$array = $dst; } sub grep_and_remove (&\@) { my ($cb, $array_ref) = @_; my @found; my $i = 0; GREP: while ($i < @{ $array_ref }) { $cb->() and push(@found, splice(@{ $array_ref }, $i, 1)) and next GREP for $array_ref->[$i]; ++$i; } return @found; } my @arr1 = (1,2,3,4,5,6,7,8); my @arr2 = grep_in_place { $_ > 4 } @arr1; my @arr3 = (1,2,3,4,5,6,7,8); my @arr4 = grep_and_remove { $_ > 4 } @arr3; print Dumper(\@arr1, \@arr2, \@arr3, \@arr4); __END__ $VAR1 = [ 1, 2, 3, 4, 5, 6, 7, 8 ]; $VAR2 = [ '-1' ]; $VAR3 = [ 1, 2, 3, 4 ]; $VAR4 = [ 5, 6, 7, 8 ];

        You used it wrong, but I see I used the wrong prototype, and my "if $keep" disappeared.

        use strict; use warnings; sub grep_in_place(&\@) { my ($cb, $array) = @_; my $src = -1; my $dst = -1; while (++$src < @$array) { my $keep; $keep = $cb->() for $array->[$src]; $array->[++$dst] = $array->[$src] if $keep; } $#$array = $dst; } my @a = (1,2,3,4,5,6,7,8); print("Before: @a\n"); grep_in_place { $_ > 4 } @a; print("After: @a\n");
        Before: 1 2 3 4 5 6 7 8 After: 5 6 7 8

        To get your (and probably the OP's) semantics:

        sub grep_and_remove(&\@) { my ($cb, $array) = @_; my $src = -1; my $dst = -1; my @deleted; while (++$src < @$array) { my $keep; $keep = $cb->() for $array->[$src]; ($keep ? $deleted[@deleted] : $array->[++$dst] ) = $array->[$src]; } $#$array = $dst; return @deleted; } my @a = (1,2,3,4,5,6,7,8); print("Before: @a\n"); my @b = grep_and_remove { $_ > 4 } @a; print("After: @a\n"); print("Returned: @b\n");
        Before: 1 2 3 4 5 6 7 8 After: 1 2 3 4 Returned: 5 6 7 8

        No crazy splice!

Re: Fastest way to grep multiple time on some array
by gone2015 (Deacon) on Feb 19, 2009 at 17:15 UTC
    Can you grep and remove simultaneously?

    I don't think so... I tried:

    my @A = grep m/A/ && !defined($_ = undef), @foo ;
    and ended up with @A containing 'n' undef values...

    The following at least works:

    use strict ; use warnings ; my @foo = ('A', 'B', 'C', 'A', 'B', 'A', 'D') ; my $q ; my @A = map defined($_) && m/A/ ? scalar($q = $_, $_ = undef, $q) : +(), @foo ; my @B = map defined($_) && m/B/ ? scalar($q = $_, $_ = undef, $q) : +(), @foo ; my @C = map defined($_) && m/C/ ? scalar($q = $_, $_ = undef, $q) : +(), @foo ; @foo = grep defined($_), @foo ; print "A: @A\n" ; print "B: @B\n" ; print "C: @C\n" ; print " @foo\n" ;
    BUT I'm making no claim for it efficiency-wise. (Also, it's ugly as sin.)

Re: Fastest way to grep multiple time on some array
by johngg (Canon) on Feb 20, 2009 at 00:29 UTC

    If you work your way forwards from the end of the array, you can splice out the elements that match as you go, thus removing them from the array so they are not examined again. I store the results in a HoA keyed by letter.

    use strict; use warnings; use Data::Dumper; my @arr; push @arr , ( q{A} .. q{Z}, q{0} .. q{9} )[ int rand 36 ] . sprintf( q{%04d}, int rand 10000 ) for 1 .. 25; print Data::Dumper ->Dumpxs( [ \ @arr ], [ qw{ *arr } ] ); my %letters = (); foreach my $letter ( q{A} .. q{Z} ) { last unless @arr; my $rxLetter = qr{$letter}; foreach my $idx ( reverse 0 .. $#arr ) { next unless $arr[ $idx ] =~ $rxLetter; push @{ $letters{ $letter } }, splice @arr, $idx, 1; } } print Data::Dumper ->new( [ \ %letters, \ @arr ], [ qw{ *letters *arr } ] ) ->Sortkeys( 1 ) ->Dumpxs();

    The results.

    I hope this is of interest. I'm not sure what the performance will be like, I gather that splice can involve a lot of work behind the scenes.

    Cheers,

    JohnGG