Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Fast Way to find one array in second array

by melmoth (Acolyte)
on Aug 14, 2016 at 05:53 UTC ( #1169736=perlquestion: print w/replies, xml ) Need Help??

melmoth has asked for the wisdom of the Perl Monks concerning the following question:

I have an array of array references or two dimensional array called paths. It can contain anywhere from 2 - 163888 array references, and each reference points to an array with anywhere from 2 - 12 elements. The code below works but when paths contains a lot of references it is too slow. My purpose is to remove arrays that have all their elements in another array. For example given

@array1 = [A,B,F,G] @array2 = [A, B, C, D, E, F, G]

then array1 should be removed because all of its elements are in array2. My strategy was to sort paths by the size of the arrays ( smallest to largest ) and then for each array loop over all the other arrays to check if the smaller array is contained in a larger array. As soon as we find that it is contained in another array we remove it, and then check the next smallest array, and so on. But 32488 arrays is a lot to go over like this. I need a faster way. If someone knows how to do this I'd really appreciate it. thanks. - Robert

my @filtered; @paths = sort { @$a <=> @$b } @paths; LINE: for ( my $i = 0; $i < scalar @paths; $i++ ) { my $path = $paths[$i]; my %nodes; @nodes{@$path} = (); for ( my $j = $i + 1; $j < scalar @paths; $j++ ) { my $path_b = $paths[$j]; my $c = grep { exists $nodes{$_} } @$path_b; next LINE if $c == scalar @$path; } push @filtered, $path; }

Replies are listed 'Best First'.
Re: Fast Way to find one array in second array (Update: < 35 seconds.)
by BrowserUk (Patriarch) on Aug 14, 2016 at 07:35 UTC

    Running the inner loop forward ensure earliest short-circuit and cut time to 1/4.

    This will work regardless of the data and takes just under 2 minutes 35 seconds for a maximum sized AoA.

    #! perl -slw use strict; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; ## Generate test data my @paths = map[ map chr(65+rand 26), 2 .. 2+rand(11) ], 1 .. 32448; my $start = time; ## Sort by length of subarray; shortest last to facilitate deletion in + a loop @paths = sort{ $#{ $b } <=> $#{ $a } } @paths; #pp \@paths; ## Build an index to the elements in all the subarrays my( $i, %index ) = 0; for my $ref ( @paths ){ $index{ $_ } //= ++$i for @$ref; } #pp \%index; ## Use the index to build bitmaps representing the subarrays my @bitmaps = ( '' ) x @paths; for my $n ( 0 .. $#paths ) { vec( $bitmaps[ $n ], $index{ $_ }, 1 ) = 1 for @{ $paths[ $n ] }; } #pp \@bitmaps; ## bitwise OR the bitmaps to discover wholly contained subarrays and r +emove them OUTER:for my $i ( reverse 0 .. $#paths ) { for my $j ( 0 .. $i-1 ) { ## Remove rev +erse; short-circuit earlier. if( ( $bitmaps[ $j ] | $bitmaps[ $i ] ) eq $bitmaps[ $j ] ) { # warn "deleting $i"; delete $paths[ $i ]; next OUTER; } } } ## remove undefs @paths = grep defined, @paths; printf "Deleted %u subarrays\n", 32448 - @paths; printf "Took %.6f seconds\nEnter to see results:", time() - $start; <STDIN>; pp \@paths; __END__ C:\test>1169736.pl Deleted 22228 subarrays Took 107.267549 seconds Enter to see results:Terminating on signal SIGINT(2) C:\test>1169736.pl Deleted 22024 subarrays Took 108.820184 seconds Enter to see results:Terminating on signal SIGINT(2)

    It works by building an index to the items in the subarrays; uses that index to build bitmaps representing the subarrays; the cross compares the bitmaps using bitwise-OR to determine wholly contained subarrays and deletes them.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Interesting.
      I ran this on my ancient Win32 XP laptop and got:
      C:\Projects_Perl\testing>perl browseruk.pl Deleted 22240 subarrays Took 39.828125 seconds Enter to see results:
      this code is "fast".

      this solution is definitely one that works as some of the other solutions do not preserve the order of the input arrays. this solution both removes the arrays whose set of elemetns are continaed within a second array, and does so without changing the order the elements in the filtered arrays.

Re: Fast Way to find one array in second array
by Athanasius (Archbishop) on Aug 14, 2016 at 07:13 UTC

    Hello melmoth,

    Here’s a more Perlish version of your code, using Set::Tiny for the convenience of its is_subset method:

    use strict; use warnings; use Set::Tiny 'set'; my @paths = ( [ qw(A B F G) ], [ qw(A B) ], [ qw(A B X) ], [ qw(A B C D E F G) ], [ qw(I J K L M) ], ); my @sets; push @sets, set(@$_) for @paths; @sets = sort { $a->size <=> $b->size } @sets; # UPDATED my @filtered; LINE: for my $i (0 .. $#sets) { my $set_i = $sets[$i]; for my $j ($i + 1 .. $#sets) { next LINE if $set_i->is_subset($sets[$j]); } push @filtered, $set_i; } print $_->as_string, "\n" for @filtered;

    <Update> Added sort line marked # UPDATED as per BrowserUk’s post below. </Update>

    Now (other things being equal), the larger the set J, the more likely it is that J is a superset of a given set I. So an obvious speedup to try is to test I against the larger sets first, by reverseing the inner loop:

    for my $j (reverse($i + 1 .. $#sets))

    Of course, you’ll need to Benchmark the code to determine whether this makes a significant difference for your input data.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      That doesn't seem to produce the correct results. Given this input:

      C:\test>1169736-st.pl [ ["S", "N", "H", "A", "C", "J", "D", "E", "Z", "L"], ["S", "Y", "Y", "O", "D", "M", "G", "W", "F", "U"], ["Z", "Z", "P", "K", "G", "H", "V", "A", "J", "C"], ["P", "E", "E", "V", "M", "E", "N", "T", "K", "H"], ["I", "L", "H", "Z", "H", "T", "O", "F", "T", "V"], ["I", "Y", "H", "I", "D", "T", "V", "S", "P"], ["G", "D", "A", "B", "U", "W", "F", "D", "O"], ["L", "J", "B", "P", "U", "U", "N", "H"], ["B", "A", "X", "H", "H", "P", "R", "V"], ["M", "F", "T", "M", "L", "Y", "T", "C"], ["L", "C", "Y", "X", "O", "I", "M", "J"], ["K", "T", "P", "O", "J", "D", "F"], ["R", "T", "S", "M", "D", "J", "V"], ["D", "Y", "D", "X", "S", "H"], ["N", "O", "P", "P", "E"], ["U", "N", "Z", "T", "I"], ["B", "Z", "R", "D", "W"], ["N", "X", "A", "Z", "O"], ["T", "R", "O", "B", "L"], ["X", "V", "T", "E"], ["V", "P", "M"], ["V", "Q", "R"], ["V", "D", "C"], ["S", "H", "L"], ["P", "N", "Q"], ["A", "A"], ["R", "M"], ["O"], ["S"], ["N"], ["N"], ["C"], ]

      It produces:

      [ bless({ A => undef, C => undef, D => undef, E => undef, H => undef, +J => undef, L => undef, N => undef, S => undef, Z => undef }, "Set::T +iny"), bless({ D => undef, F => undef, G => undef, M => undef, O => undef, +S => undef, U => undef, W => undef, Y => undef }, "Set::Tiny"), bless({ A => undef, C => undef, G => undef, H => undef, J => undef, +K => undef, P => undef, V => undef, Z => undef }, "Set::Tiny"), bless({ E => undef, H => undef, K => undef, M => undef, N => undef, +P => undef, T => undef, V => undef }, "Set::Tiny"), bless({ F => undef, H => undef, I => undef, L => undef, O => undef, +T => undef, V => undef, Z => undef }, "Set::Tiny"), bless({ D => undef, H => undef, I => undef, P => undef, S => undef, +T => undef, V => undef, Y => undef }, "Set::Tiny"), bless({ A => undef, B => undef, D => undef, F => undef, G => undef, +O => undef, U => undef, W => undef }, "Set::Tiny"), bless({ B => undef, H => undef, J => undef, L => undef, N => undef, +P => undef, U => undef }, "Set::Tiny"), bless({ A => undef, B => undef, H => undef, P => undef, R => undef, +V => undef, X => undef }, "Set::Tiny"), bless({ C => undef, F => undef, L => undef, M => undef, T => undef, +Y => undef }, "Set::Tiny"), bless({ C => undef, I => undef, J => undef, L => undef, M => undef, +O => undef, X => undef, Y => undef }, "Set::Tiny"), bless({ D => undef, F => undef, J => undef, K => undef, O => undef, +P => undef, T => undef }, "Set::Tiny"), bless({ D => undef, J => undef, M => undef, R => undef, S => undef, +T => undef, V => undef }, "Set::Tiny"), bless({ D => undef, H => undef, S => undef, X => undef, Y => undef } +, "Set::Tiny"), bless({ E => undef, N => undef, O => undef, P => undef }, "Set::Tiny +"), bless({ I => undef, N => undef, T => undef, U => undef, Z => undef } +, "Set::Tiny"), bless({ B => undef, D => undef, R => undef, W => undef, Z => undef } +, "Set::Tiny"), bless({ A => undef, N => undef, O => undef, X => undef, Z => undef } +, "Set::Tiny"), bless({ B => undef, L => undef, O => undef, R => undef, T => undef } +, "Set::Tiny"), bless({ E => undef, T => undef, V => undef, X => undef }, "Set::Tiny +"), bless({ M => undef, P => undef, V => undef }, "Set::Tiny"), bless({ Q => undef, R => undef, V => undef }, "Set::Tiny"), bless({ C => undef, D => undef, V => undef }, "Set::Tiny"), bless({ H => undef, L => undef, S => undef }, "Set::Tiny"), bless({ N => undef, P => undef, Q => undef }, "Set::Tiny"), bless({ A => undef }, "Set::Tiny"), bless({ M => undef, R => undef }, "Set::Tiny"), bless({ O => undef }, "Set::Tiny"), bless({ S => undef }, "Set::Tiny"), bless({ N => undef }, "Set::Tiny"), bless({ C => undef }, "Set::Tiny"), ]

      At least the last 6 subsets should not have made it into @filtered.

      It should produce (+-the blessing):

      [ ["S", "N", "H", "A", "C", "J", "D", "E", "Z", "L"], ["S", "Y", "Y", "O", "D", "M", "G", "W", "F", "U"], ["Z", "Z", "P", "K", "G", "H", "V", "A", "J", "C"], ["P", "E", "E", "V", "M", "E", "N", "T", "K", "H"], ["I", "L", "H", "Z", "H", "T", "O", "F", "T", "V"], ["I", "Y", "H", "I", "D", "T", "V", "S", "P"], ["G", "D", "A", "B", "U", "W", "F", "D", "O"], ["L", "J", "B", "P", "U", "U", "N", "H"], ["B", "A", "X", "H", "H", "P", "R", "V"], ["M", "F", "T", "M", "L", "Y", "T", "C"], ["L", "C", "Y", "X", "O", "I", "M", "J"], ["K", "T", "P", "O", "J", "D", "F"], ["R", "T", "S", "M", "D", "J", "V"], ["D", "Y", "D", "X", "S", "H"], ["N", "O", "P", "P", "E"], ["U", "N", "Z", "T", "I"], ["B", "Z", "R", "D", "W"], ["N", "X", "A", "Z", "O"], ["T", "R", "O", "B", "L"], ["X", "V", "T", "E"], ["V", "Q", "R"], ["V", "D", "C"], ["P", "N", "Q"], ]

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
      U, Science is about questioning the status quo. Questioning authority

        You’re right, thanks! I forgot to include the OP’s code for sorting the path arrays according to number of elements. Node updated:

        ... @paths = sort { @$a <=> @$b } @paths; my @sets; push @sets, set(@$_) for @paths; my @filtered; ...

        Now the output for your input data is as follows:

        Update: Since sets contain no duplicate elements, it will be better to sort on number of elements after the arrays have been converted into sets:

        ... my @sets; push @sets, set(@$_) for @paths; @sets = sort { $a->size <=> $b->size } @sets; my @filtered; ...

        Output:

        Cheers,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Fast Way to find one array in second array
by Laurent_R (Canon) on Aug 14, 2016 at 07:50 UTC
    I know this does not answer your immediate question, but rather tries to address the underlying problem.

    If it is too slow, then I presume that you are repeatedly traversing sequentially the outer array over and over again to look up for data. Depending on what your inner data elements really are, you might consider another data structure.

    One possibility is to use a hash instead of an array for your outer "paths" data structure. The idea would be to sort the inner arrays, stringify them (with an appropriate data separator) and use that stringified version as keys for a hash (hash values would probably be insignificant, or they could contain array refs, depending on how you are using your data structure).

    The structure might look like this:

    my %paths = ( "A B F G" => 1, "A B" => 1, "A B X" => 1, "A B C D E F G" => 1, "I J K L M" => 1 );
    Then, when you want to do a look up for some data elements, you just need to sort and stringify your data element list and you'll have instant access to the relevant data.

    This may or may not be an appropriate solution to your performance problem, it really depends on what you are normally doing with your data structure, on the nature of your data items, and quite a few other things that you did not describe or specify. So much more details would be needed to figure out whether this solution might be workable. If not, there may still be a variation on that solution which would work for your process.

      Yes, with a nested loop the problem is that you have to complete the inner loop before iterating the outer loop. So the inner loop is the problem. I've had time to look more closely at the data and have discovered some cases where there's a directed acyclic graph ( no multiedges ) with one source vertex, 21 vertices, 206 edges, and 165888 different paths!! The outer loop will traverse all those different arrays once whereas the inner loop will traverse them repeatedly doing list comparisons each time. Some of the suggestions here involve traversing the arrays just once and converting the data into a matrix that can be easily used to do list comparisons. Each of those 21 vertices will have a coordinate xy or list of coordinates giving its location in the matrix and so we can skip all the travsering. Makes sense to me so I'll try that one first. Thanks everyone for your help some good ideas.

Re: Fast Way to find one array in second array
by BrowserUk (Patriarch) on Aug 14, 2016 at 06:47 UTC

    Are the inner elements single characters as shown; or is that a placeholder for longer strings or numbers or ...?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      not sure what your asking, or if it's for me, but in the problem I raised each vertex is a string of alphanumeric characters

        or if it's for me

        I replied to your post; it is for you.

        not sure what your asking,

        Your data showed single characters: @array1 = [A,B,F,G]; if that was representative of your real data, there is an optimisation that could be used.

        each vertex is a string of alphanumeric characters

        The solution I posted will work for strings (or any other kind of data) and is the fastest solution possible. Over 50 times faster than 1169738; and correct.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fast Way to find one array in second array
by LanX (Sage) on Aug 14, 2016 at 18:28 UTC
    Please edit your node and put code tags around your arrays , besides @array=[] is broken syntax.

    Concerning your question, the fastest way would be to mimic the algorithm index is using in its implementation. (Forgot the name...)

    For this you need to prepare jump tables for each array.

    edit

    A pragmatic approach would be to generate a string for each array, something like concatenation of the first character.

    If the first-character string isn't found with index you can proceed, otherwise index gives you a hint where to compare both arrays!

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

    ) here it is Boyer-Moore_string_search_algorithm

      right so if one string is: ABCDFG and the second is ABCDEFG then index will return -1 and I won't know that the first array represented by ABCDFG should be deleted. That would work if I were looking for paths that are segments of larger paths.

Re: Fast Way to find one array in second array (updated)
by LanX (Sage) on Aug 14, 2016 at 21:20 UTC
    Let X < Y denote set X included in set Y.

    Instead of speeding up the process to check X < Y , let's reduce the potential number of such checks.

    At the moment you are ordering all sets S and testing for each X if you find a bigger Y with X < Y, and then you go on with the next X'.

    This means a worst case of n(n-1)/2 tests for n sets (no inclusions found)

    And at some point you start checking all sets bigger Y to find a Z with Y < Z.

    BUT

    X < Y < Z   =>   X < Z

    So instead of stopping after finding Y record all X* := { S | X < S } and push X into Y := { S | S < Y }.

    Once you reach Y you'll only have to check Y against the sets S in the intersection of all X* for X in Y.

    The worst case of this approach is the same, but I'm sure it'll be faster on average.

    Such algorithms can certainly be found in literature, but I don't have the time to check now.

    HTH

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

    ) I suppose this approach is more efficient if you stop growing a X* after it reached a certain size - like n/2 or so - and take X out of consideration.

    ) probably looking for keyword "posets" = partially ordered sets

    update

    On second thought, restricting to calculate X* only for those X ( atoms ) which don't have a lower neighbor, i.e. there is no X' < X  <=>   X is empty so far , should be good enough.

    Then - visually speaking - those X with an X* form "axes" of the poset and the Y are the "coordinates" of Y - like when embedded in a n* dimensional cube.

    I'll implement this after YAPC, feel free to do earlier...

    update

    In hindsight for this particular task it might be more efficient to apply this to the dual poset. That is starting with the biggest sets and on the fly calculating the coatoms Y, which have no superset. For any S with S < X there must be a coatom with S < X <= Y.

      Let X < Y denote set X included in set Y.

      This is called a subset and there is already a character used for it so you don't actually need to overload the < character. Even better, it is a named entity in HTML so even unicode-unfriendly sites can use it. The entity is called "sub" and here it is in use:

      X ⊂ Y

      If you use this to mean "X is a subset of Y" then the meaning should be clear to anyone familiar with set theory. HTH.

        If you use this to mean "X is a subset of Y" then the meaning should be clear to anyone familiar with set theory.

        All of which will make his diatribe more 'correct' in certain circles but just as meaningless.

        Set theory is as much help to anyone implementing an efficient practical algorithm as Bernoulli principle is to someone constructing a model plane.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
        In mathematics it's common to redefine symbols if needed.

        Like when you can't easily access unicode characters ... :)

        There are also already other symbols in use for X and X*

        edit

        Anyway < is the common symbol for the binary relation of posets ...

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: Fast Way to find one array in second array
by Anonymous Monk on Aug 15, 2016 at 20:23 UTC

    Across the "arrays", the problem size is around 160k. What about the other dimension? How many different "path elements"?

    In case the dimension of path elements is large, then the (bit)vectors to represent your sets are going to be very sparse (with up to 12 elements chosen). See Comparing two arrays for a treatment of a similar problem.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1169736]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2023-05-29 19:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?