Fast Way to find one array in second array

melmoth has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Fast Way to find one array in second array (Update: < 35 seconds.)
by BrowserUk (Patriarch) on Aug 14, 2016 at 07:35 UTC

Running the inner loop forward ensure earliest short-circuit and cut time to 1/4.

This will work regardless of the data and takes just under ~~2 minutes~~ 35 seconds for a maximum sized AoA.

#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use Data::Dump qw[ pp ];

## Generate test data
my @paths = map[ map chr(65+rand 26), 2 .. 2+rand(11) ], 1 .. 32448;

my $start = time;

## Sort by length of subarray; shortest last to facilitate deletion in
+ a loop
@paths = sort{ $#{ $b } <=> $#{ $a } } @paths;
#pp \@paths;

## Build an index to the elements in all the subarrays
my( $i, %index ) = 0;
for my $ref ( @paths ){
    $index{ $_ } //= ++$i for @$ref;
}
#pp \%index;

## Use the index to build bitmaps representing the subarrays
my @bitmaps = ( '' ) x @paths;
for my $n ( 0 .. $#paths ) {
    vec( $bitmaps[ $n ], $index{ $_ }, 1 ) = 1 for @{ $paths[ $n ] };
}
#pp \@bitmaps;

## bitwise OR the bitmaps to discover wholly contained subarrays and r
+emove them
OUTER:for my $i ( reverse 0 .. $#paths ) {
    for my $j ( 0 .. $i-1 ) {                            ## Remove rev
+erse; short-circuit earlier.
        if( ( $bitmaps[ $j ] | $bitmaps[ $i ] ) eq $bitmaps[ $j ] ) {
#            warn "deleting $i";
            delete $paths[ $i ];
            next OUTER;
        }
    }
}

## remove undefs
@paths = grep defined, @paths;

printf "Deleted %u subarrays\n", 32448 - @paths;
printf "Took %.6f seconds\nEnter to see results:", time() - $start;

<STDIN>;
pp \@paths;

__END__
C:\test>1169736.pl
Deleted 22228 subarrays
Took 107.267549 seconds
Enter to see results:Terminating on signal SIGINT(2)

C:\test>1169736.pl
Deleted 22024 subarrays
Took 108.820184 seconds
Enter to see results:Terminating on signal SIGINT(2)
[download]

It works by building an index to the items in the subarrays; uses that index to build bitmaps representing the subarrays; the cross compares the bitmaps using bitwise-OR to determine wholly contained subarrays and deletes them.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]

Re^2: Fast Way to find one array in second array (Update: < 35 seconds.)

by Marshall (Canon) on Aug 14, 2016 at 23:08 UTC

C:\Projects_Perl\testing>perl browseruk.pl
Deleted 22240 subarrays
Took 39.828125 seconds
Enter to see results:
[download]

[reply]
[d/l]

Re^2: Fast Way to find one array in second array (Update: < 35 seconds.)

by melmoth (Acolyte) on Oct 12, 2016 at 14:48 UTC

this solution is definitely one that works as some of the other solutions do not preserve the order of the input arrays. this solution both removes the arrays whose set of elemetns are continaed within a second array, and does so without changing the order the elements in the filtered arrays.

[reply]

Re: Fast Way to find one array in second array
by Athanasius (Archbishop) on Aug 14, 2016 at 07:13 UTC

Hello melmoth,

Here’s a more Perlish version of your code, using Set::Tiny for the convenience of its is_subset method:

use strict;
use warnings;
use Set::Tiny 'set';
 
my @paths =
(
    [ qw(A B F G)       ],
    [ qw(A B)           ],
    [ qw(A B X)         ],
    [ qw(A B C D E F G) ],
    [ qw(I J K L M)     ],
);

my   @sets;
push @sets, set(@$_) for @paths;
     @sets = sort { $a->size <=> $b->size } @sets;    # UPDATED
my   @filtered;

LINE: for my $i (0 .. $#sets)
{
    my $set_i = $sets[$i];

    for my $j ($i + 1 .. $#sets)
    {
        next LINE if $set_i->is_subset($sets[$j]);
    }

    push @filtered, $set_i;
}

print $_->as_string, "\n" for @filtered;
[download]

<Update> Added sort line marked # UPDATED as per BrowserUk’s post below. </Update>

Now (other things being equal), the larger the set J, the more likely it is that J is a superset of a given set I. So an obvious speedup to try is to test I against the larger sets first, by reverseing the inner loop:

for my $j (reverse($i + 1 .. $#sets))
[download]

Of course, you’ll need to Benchmark the code to determine whether this makes a significant difference for your input data.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Fast Way to find one array in second array

by BrowserUk (Patriarch) on Aug 14, 2016 at 16:50 UTC

That doesn't seem to produce the correct results. Given this input:

C:\test>1169736-st.pl
[
  ["S", "N", "H", "A", "C", "J", "D", "E", "Z", "L"],
  ["S", "Y", "Y", "O", "D", "M", "G", "W", "F", "U"],
  ["Z", "Z", "P", "K", "G", "H", "V", "A", "J", "C"],
  ["P", "E", "E", "V", "M", "E", "N", "T", "K", "H"],
  ["I", "L", "H", "Z", "H", "T", "O", "F", "T", "V"],
  ["I", "Y", "H", "I", "D", "T", "V", "S", "P"],
  ["G", "D", "A", "B", "U", "W", "F", "D", "O"],
  ["L", "J", "B", "P", "U", "U", "N", "H"],
  ["B", "A", "X", "H", "H", "P", "R", "V"],
  ["M", "F", "T", "M", "L", "Y", "T", "C"],
  ["L", "C", "Y", "X", "O", "I", "M", "J"],
  ["K", "T", "P", "O", "J", "D", "F"],
  ["R", "T", "S", "M", "D", "J", "V"],
  ["D", "Y", "D", "X", "S", "H"],
  ["N", "O", "P", "P", "E"],
  ["U", "N", "Z", "T", "I"],
  ["B", "Z", "R", "D", "W"],
  ["N", "X", "A", "Z", "O"],
  ["T", "R", "O", "B", "L"],
  ["X", "V", "T", "E"],
  ["V", "P", "M"],
  ["V", "Q", "R"],
  ["V", "D", "C"],
  ["S", "H", "L"],
  ["P", "N", "Q"],
  ["A", "A"],
  ["R", "M"],
  ["O"],
  ["S"],
  ["N"],
  ["N"],
  ["C"],
]
[download]

It produces:

[
  bless({ A => undef, C => undef, D => undef, E => undef, H => undef, 
+J => undef, L => undef, N => undef, S => undef, Z => undef }, "Set::T
+iny"),
  bless({ D => undef, F => undef, G => undef, M => undef, O => undef, 
+S => undef, U => undef, W => undef, Y => undef }, "Set::Tiny"),
  bless({ A => undef, C => undef, G => undef, H => undef, J => undef, 
+K => undef, P => undef, V => undef, Z => undef }, "Set::Tiny"),
  bless({ E => undef, H => undef, K => undef, M => undef, N => undef, 
+P => undef, T => undef, V => undef }, "Set::Tiny"),
  bless({ F => undef, H => undef, I => undef, L => undef, O => undef, 
+T => undef, V => undef, Z => undef }, "Set::Tiny"),
  bless({ D => undef, H => undef, I => undef, P => undef, S => undef, 
+T => undef, V => undef, Y => undef }, "Set::Tiny"),
  bless({ A => undef, B => undef, D => undef, F => undef, G => undef, 
+O => undef, U => undef, W => undef }, "Set::Tiny"),
  bless({ B => undef, H => undef, J => undef, L => undef, N => undef, 
+P => undef, U => undef }, "Set::Tiny"),
  bless({ A => undef, B => undef, H => undef, P => undef, R => undef, 
+V => undef, X => undef }, "Set::Tiny"),
  bless({ C => undef, F => undef, L => undef, M => undef, T => undef, 
+Y => undef }, "Set::Tiny"),
  bless({ C => undef, I => undef, J => undef, L => undef, M => undef, 
+O => undef, X => undef, Y => undef }, "Set::Tiny"),
  bless({ D => undef, F => undef, J => undef, K => undef, O => undef, 
+P => undef, T => undef }, "Set::Tiny"),
  bless({ D => undef, J => undef, M => undef, R => undef, S => undef, 
+T => undef, V => undef }, "Set::Tiny"),
  bless({ D => undef, H => undef, S => undef, X => undef, Y => undef }
+, "Set::Tiny"),
  bless({ E => undef, N => undef, O => undef, P => undef }, "Set::Tiny
+"),
  bless({ I => undef, N => undef, T => undef, U => undef, Z => undef }
+, "Set::Tiny"),
  bless({ B => undef, D => undef, R => undef, W => undef, Z => undef }
+, "Set::Tiny"),
  bless({ A => undef, N => undef, O => undef, X => undef, Z => undef }
+, "Set::Tiny"),
  bless({ B => undef, L => undef, O => undef, R => undef, T => undef }
+, "Set::Tiny"),
  bless({ E => undef, T => undef, V => undef, X => undef }, "Set::Tiny
+"),
  bless({ M => undef, P => undef, V => undef }, "Set::Tiny"),
  bless({ Q => undef, R => undef, V => undef }, "Set::Tiny"),
  bless({ C => undef, D => undef, V => undef }, "Set::Tiny"),
  bless({ H => undef, L => undef, S => undef }, "Set::Tiny"),
  bless({ N => undef, P => undef, Q => undef }, "Set::Tiny"),
  bless({ A => undef }, "Set::Tiny"),
  bless({ M => undef, R => undef }, "Set::Tiny"),
  bless({ O => undef }, "Set::Tiny"),
  bless({ S => undef }, "Set::Tiny"),
  bless({ N => undef }, "Set::Tiny"),
  bless({ C => undef }, "Set::Tiny"),
]
[download]

At least the last 6 subsets should not have made it into @filtered.

It should produce (+-the blessing):

[
  ["S", "N", "H", "A", "C", "J", "D", "E", "Z", "L"],
  ["S", "Y", "Y", "O", "D", "M", "G", "W", "F", "U"],
  ["Z", "Z", "P", "K", "G", "H", "V", "A", "J", "C"],
  ["P", "E", "E", "V", "M", "E", "N", "T", "K", "H"],
  ["I", "L", "H", "Z", "H", "T", "O", "F", "T", "V"],
  ["I", "Y", "H", "I", "D", "T", "V", "S", "P"],
  ["G", "D", "A", "B", "U", "W", "F", "D", "O"],
  ["L", "J", "B", "P", "U", "U", "N", "H"],
  ["B", "A", "X", "H", "H", "P", "R", "V"],
  ["M", "F", "T", "M", "L", "Y", "T", "C"],
  ["L", "C", "Y", "X", "O", "I", "M", "J"],
  ["K", "T", "P", "O", "J", "D", "F"],
  ["R", "T", "S", "M", "D", "J", "V"],
  ["D", "Y", "D", "X", "S", "H"],
  ["N", "O", "P", "P", "E"],
  ["U", "N", "Z", "T", "I"],
  ["B", "Z", "R", "D", "W"],
  ["N", "X", "A", "Z", "O"],
  ["T", "R", "O", "B", "L"],
  ["X", "V", "T", "E"],
  ["V", "Q", "R"],
  ["V", "D", "C"],
  ["P", "N", "Q"],
]
[download]

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]

Re^3: Fast Way to find one array in second array

by Athanasius (Archbishop) on Aug 15, 2016 at 06:20 UTC

You’re right, thanks! I forgot to include the OP’s code for sorting the path arrays according to number of elements. Node updated:

...
     @paths = sort { @$a <=> @$b } @paths;
my   @sets;
push @sets, set(@$_) for @paths;
my   @filtered;
...
[download]

Now the output for your input data is as follows:

Read more... (581 Bytes)

Cheers,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Fast Way to find one array in second array
by Laurent_R (Canon) on Aug 14, 2016 at 07:50 UTC

If it is too slow, then I presume that you are repeatedly traversing sequentially the outer array over and over again to look up for data. Depending on what your inner data elements really are, you might consider another data structure.

One possibility is to use a hash instead of an array for your outer "paths" data structure. The idea would be to sort the inner arrays, stringify them (with an appropriate data separator) and use that stringified version as keys for a hash (hash values would probably be insignificant, or they could contain array refs, depending on how you are using your data structure).

The structure might look like this:

my %paths =
(
    "A B F G" => 1,
    "A B" => 1,
    "A B X" => 1,
    "A B C D E F G" => 1,
    "I J K L M" => 1
);
[download]

This may or may not be an appropriate solution to your performance problem, it really depends on what you are normally doing with your data structure, on the nature of your data items, and quite a few other things that you did not describe or specify. So much more details would be needed to figure out whether this solution might be workable. If not, there may still be a variation on that solution which would work for your process.

[reply]
[d/l]

Re^2: Fast Way to find one array in second array

by melmoth (Acolyte) on Aug 14, 2016 at 16:46 UTC

Yes, with a nested loop the problem is that you have to complete the inner loop before iterating the outer loop. So the inner loop is the problem. I've had time to look more closely at the data and have discovered some cases where there's a directed acyclic graph ( no multiedges ) with one source vertex, 21 vertices, 206 edges, and 165888 different paths!! The outer loop will traverse all those different arrays once whereas the inner loop will traverse them repeatedly doing list comparisons each time. Some of the suggestions here involve traversing the arrays just once and converting the data into a matrix that can be easily used to do list comparisons. Each of those 21 vertices will have a coordinate x y or list of coordinates giving its location in the matrix and so we can skip all the travsering. Makes sense to me so I'll try that one first. Thanks everyone for your help some good ideas.

[reply]

Re: Fast Way to find one array in second array
by BrowserUk (Patriarch) on Aug 14, 2016 at 06:47 UTC

Are the inner elements single characters as shown; or is that a placeholder for longer strings or numbers or ...?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]

Re^2: Fast Way to find one array in second array

by melmoth (Acolyte) on Aug 14, 2016 at 16:51 UTC

not sure what your asking, or if it's for me, but in the problem I raised each vertex is a string of alphanumeric characters

[reply]

Re^3: Fast Way to find one array in second array

by BrowserUk (Patriarch) on Aug 14, 2016 at 21:51 UTC

or if it's for me

I replied to your post; it is for you.

not sure what your asking,

Your data showed single characters: @array1 = [A,B,F,G]; if that was representative of your real data, there is an optimisation that could be used.

each vertex is a string of alphanumeric characters

The solution I posted will work for strings (or any other kind of data) and is the fastest solution possible. Over 50 times faster than 1169738; and correct.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]

Re^4: Fast Way to find one array in second array

by melmoth (Acolyte) on Aug 14, 2016 at 23:04 UTC

Re: Fast Way to find one array in second array (updated)
by LanX (Saint) on Aug 14, 2016 at 21:20 UTC

Instead of speeding up the process to check X < Y , let's reduce the potential number of such checks.

At the moment you are ordering all sets S and testing for each X if you find a bigger Y with X < Y, and then you go on with the next X'.

This means a worst case of n(n-1)/2 tests for n sets (no inclusions found)

And at some point you start checking all sets bigger Y to find a Z with Y < Z.

BUT

X < Y < Z => X < Z

So instead of stopping after finding Y record all X* := { S | X < S } and push X into Y� := { S | S < Y }.

Once you reach Y you'll only have to check Y against the sets S in the intersection of all X* for X in Y�.

The worst case of this approach is the same, but I'm sure it'll be faster on average.�

Such algorithms can certainly be found in literature, but I don't have the time to check now. �

HTH

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

�) I suppose this approach is more efficient if you stop growing a X* after it reached a certain size - like n/2 or so - and take X out of consideration.

�) probably looking for keyword "posets" = partially ordered sets

update

On second thought, restricting to calculate X* only for those X ( atoms ) which don't have a lower neighbor, i.e. there is no X' < X <=> X� is empty so far , should be good enough.

Then - visually speaking - those X with an X* form "axes" of the poset and the Y� are the "coordinates" of Y - like when embedded in a n* dimensional cube.

I'll implement this after YAPC, feel free to do earlier...

update

In hindsight for this particular task it might be more efficient to apply this to the dual poset. That is starting with the biggest sets and on the fly calculating the coatoms Y, which have no superset. For any S with S < X there must be a coatom with S < X <= Y.

[reply]

Re^2: Fast Way to find one array in second array

by hippo (Bishop) on Aug 15, 2016 at 08:25 UTC

Let X < Y denote set X included in set Y.

This is called a subset and there is already a character used for it so you don't actually need to overload the < character. Even better, it is a named entity in HTML so even unicode-unfriendly sites can use it. The entity is called "sub" and here it is in use:

X ⊂ Y

If you use this to mean "X is a subset of Y" then the meaning should be clear to anyone familiar with set theory. HTH.

[reply]

Re^3: Fast Way to find one array in second array

by BrowserUk (Patriarch) on Aug 15, 2016 at 09:15 UTC

If you use this to mean "X is a subset of Y" then the meaning should be clear to anyone familiar with set theory.

All of which will make his diatribe more 'correct' in certain circles but just as meaningless.

Set theory is as much help to anyone implementing an efficient practical algorithm as Bernoulli principle is to someone constructing a model plane.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]

Re^3: Fast Way to find one array in second array

by LanX (Saint) on Aug 15, 2016 at 11:30 UTC

Like when you can't easily access unicode characters ... :)

There are also already other symbols in use for X� and X*

edit

Anyway < is the common symbol for the binary relation of posets ...

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

[reply]

Re: Fast Way to find one array in second array
by LanX (Saint) on Aug 14, 2016 at 18:28 UTC

@array=[]

Concerning your question, the fastest way would be to mimic the algorithm index is using in its implementation. (Forgot the name...)�

For this you need to prepare jump tables for each array.

edit

A pragmatic approach would be to generate a string for each array, something like concatenation of the first character.

If the first-character string isn't found with index you can proceed, otherwise index gives you a hint where to compare both arrays!

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

�) here it is Boyer-Moore_string_search_algorithm

[reply]
[d/l]

Re^2: Fast Way to find one array in second array

by melmoth (Acolyte) on Aug 14, 2016 at 18:45 UTC

right so if one string is: ABCDFG and the second is ABCDEFG then index will return -1 and I won't know that the first array represented by ABCDFG should be deleted. That would work if I were looking for paths that are segments of larger paths.

[reply]

Re^3: Fast Way to find one array in second array

by LanX (Saint) on Aug 14, 2016 at 18:52 UTC

edit

In this case using hash slices should do.

update

See Using hashes for set operations...

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

[reply]

Re^4: Fast Way to find one array in second array

by melmoth (Acolyte) on Aug 14, 2016 at 19:20 UTC

Re: Fast Way to find one array in second array
by Anonymous Monk on Aug 15, 2016 at 20:23 UTC

Across the "arrays", the problem size is around 160k. What about the other dimension? How many different "path elements"?

In case the dimension of path elements is large, then the (bit)vectors to represent your sets are going to be very sparse (with up to 12 elements chosen). See Comparing two arrays for a treatment of a similar problem.

[reply]


Perl-Sensitive Sunglasses
	PerlMonks