BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:
Given four arrays, each containing a sorted (ascending) sequence of non-contiguous, unique integers (unique within each array and across all arrays), search those arrays to find sequence(s) (one or more) of 4 contiguous values.
Example:
my @a1 = ( 100, 204, 312 ); my @a2 = ( 102, 313, 409 ); my @a3 = ( 205, 206, 315 ); my @a4 = ( 207, 210, 314 ); ## Two contiguous sequences of 4 values: 204 .. 207 & 312 .. 315
Restructure as AoA or HoA or any other way that retains set/position information.
Easy if you can just sort them all into a single array, but I need to retain knowledge of which set each value in each contiguous sequence comes from, and it's position?
The four sets could be quite large, but not so large that memory for 2 or 3 copies would be a problem.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Searching parallel arrays.
by jbert (Priest) on Dec 08, 2006 at 11:22 UTC | |
Sort them into a single array and retain knowledge of where they came from. i.e. use a hash (or even an array) to map number -> [ $ref_to_source_array, $index_in_source_array ]. Then you can look for a sequence in the sorted array and, for each number in the sequence, look up where it came from. No code, but it's straightforward I think. Update: If the memory consumption is excessive, you could encode an identifier for the source array and the index in a single scalar, e.g. "a1:1", "a3:4". You should then be using about 4 times your original memory, if that's too much you could destroy your original arrays and recreate them afterwards from the lookup info :-) | [reply] [d/l] |
by jbert (Priest) on Dec 08, 2006 at 11:52 UTC | |
which prints: Update: Limbic~Region++ noted that the sort in the above (more mediocre than I thought :-) code should be numeric, rather than the default sort. Update 2:And there's another bug, found by Not_a_Number, fixes below. | [reply] [d/l] [select] |
by Not_a_Number (Prior) on Dec 08, 2006 at 16:19 UTC | |
Not sure why, but I get no output from your code when I use these test data:
| [reply] [d/l] |
by jbert (Priest) on Dec 08, 2006 at 16:35 UTC | |
|
Re: Searching parallel arrays.
by grinder (Bishop) on Dec 08, 2006 at 14:17 UTC | |
Here's make take on the problem. I haven't looked at the other solutions, so it will be interesting to see how the others compare. Since the numbers are I then take the sequence numbers from an array, and look for sequences of four. When I find one, I see how far it can be extended whilst keeping the sequence contiguous.
(update: erm, and here's the output:)
• another intruder with the mooring in the heart of the Perl | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Dec 08, 2006 at 21:53 UTC | |
Very neat++. Took me a while to grok the indexing, but it's really cool. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
Re: Searching parallel arrays.
by Melly (Chaplain) on Dec 08, 2006 at 11:34 UTC | |
Hmm... I'd be tempted to copy them to a single array, but as anon. arrays of value + source - e.g.:
Then you can sort on @all->[1] and still have access to the original array name (oh, and thinking about it, you could just create references to the original array elements, rather than copying the values). <update>mediocre minds think alike...</update> Tom Melly, pm@tomandlu.co.uk | [reply] [d/l] [select] |
|
Re: Searching parallel arrays.
by Corion (Patriarch) on Dec 08, 2006 at 14:09 UTC | |
I have the idea of implementing something like a modified Boyer-Moore search on your data structure, but your description and data don't match: The rule is each array contains a sorted (ascending) sequence of non-contiguous, unique integers, but my @a3 = ( 205, 206, 315 ); contains the two contiguous numbers 205 and 206. If the data is wrong and the description is right, my algorithm would be to I think this algorithm optimizes discarding non-matches over finding matches. | [reply] [d/l] |
by Not_a_Number (Prior) on Dec 08, 2006 at 15:46 UTC | |
I'm not sure I fully understand, but wouldn't that approach fail for a configuration like this:
| [reply] [d/l] |
by Corion (Patriarch) on Dec 08, 2006 at 15:48 UTC | |
Indeed - my approach will fail for such configurations, thanks for pointing that out! | [reply] |
by BrowserUk (Patriarch) on Dec 08, 2006 at 20:54 UTC | |
Corion++ You're right. My description was lacking in clarity. I was trying to indicate that any sequence of four values found would definitely be spread across 2 or more of the arrays. In this case, the example is king. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
Re: Searching parallel arrays.
by Limbic~Region (Chancellor) on Dec 08, 2006 at 13:47 UTC | |
For reasons I don't want to go into, I don't currently have the ability run or test any code so these are ideas you would have to implement yourself:
Idea 1:
Idea 2:
Cheers - L~R | [reply] |
by BrowserUk (Patriarch) on Dec 08, 2006 at 21:50 UTC | |
Both these are along the lines of what I was thinking of when I posted, but couldn't wrap my brain around the implementation. Per your /msg, I agree that as the individual arrays are already sorted, there is no need for the binary search, and so a merge algorithm with a small buffer can step through the arrays picking out the next highest value and tracking where it's got to in each array. In theory, this can be extended to handle any number of arrays, though this isn't (currently) a requirement, but my code below does do that. It struck me that this is probably an algorithm to which tye's Algorithm::Loops::loops code would lend itself, and I was hoping that someone who groks that, (which still defeats me for some reason?), would post some code. I understand your frustration at your current Perless state. The problem I had with implementing it revolved around how to detect the end of the main loop. Ie. How to detect when the indexes into the individual arrays had reached the end--across all of the arrays. I eventually came up with a grep in a scalar context solution that works, but I'm still not sure that there isn't a better way? I keep thinking about initialising the array indexes to $#array and stepping them backwards, so that I can use their transition to zero to detect the ends... Anyway, this is what I came up with so far. I'm not sure how it compares efficiency-wise with the sort them all together solutions, but it does avoid the need for a large AoA or AoCompositeTags. In theory it's O(n), but as we have both meditated on in the past, O(n) Perl code isn't necessarially quicker than O(n log n) C-via-builtins code. The tricky bits are: Update: Added fixes 1 & 2 found by Not a Number++ below
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by Limbic~Region (Chancellor) on Dec 09, 2006 at 00:32 UTC | |
I haven't had a chance to optimize this or even to test it thoroughly but I wanted to share. It is commented but if you need further explanation, please don't hesitate. Read more... (3 kB)
I didn't use List::Util's reduce() because apparently it can handle multiple return values. The docs (Returns the result of the last call to BLOCK.) should probably be clarified on this point Cheers - L~R | [reply] [d/l] |
by BrowserUk (Patriarch) on Dec 09, 2006 at 02:12 UTC | |
by Limbic~Region (Chancellor) on Dec 09, 2006 at 02:42 UTC | |
by Not_a_Number (Prior) on Dec 09, 2006 at 19:29 UTC | |
Hmmm, here are some test cases where your code appears to fail.
I'm off out now, so I've got no time to look for a pattern (or to attempt to understand your code :-)... Quick update: Limbic~Region's code, above, gets these right. | [reply] [d/l] |
by BrowserUk (Patriarch) on Dec 09, 2006 at 22:20 UTC | |
|
Re: Searching parallel arrays.
by johngg (Canon) on Dec 08, 2006 at 14:10 UTC | |
The Dumper output is
I hope this is of use. Cheers, JohnGG Update: Removed superfluous variable $elemNo left over from development | [reply] [d/l] [select] |
|
Re: Searching parallel arrays.
by pajout (Curate) on Dec 08, 2006 at 15:13 UTC | |
| [reply] |
by BrowserUk (Patriarch) on Dec 08, 2006 at 19:27 UTC | |
Good question++. Longer sequences are not expected but could occur. They can be reported as a single long sequence, or as multiple sequences of 4, whichever is easier for the algorithm. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by Not_a_Number (Prior) on Dec 08, 2006 at 20:34 UTC | |
Thanks for that clarification, but you still haven't answered Corion's equally good objection++ above, namely: The rule is each array contains a sorted (ascending) sequence of non-contiguous, unique integers, but my @a3 = ( 205, 206, 315 ); contains the two contiguous numbers 205 and 206. Should we be reasoning based on your descripition of the problem, or on your supplied dataset? | [reply] |
by pajout (Curate) on Dec 09, 2006 at 13:46 UTC | |
updated: output is
| [reply] [d/l] [select] |
|
Re: Searching parallel arrays. (just merge)
by tye (Sage) on Dec 09, 2006 at 06:37 UTC | |
I'd just do a standard merge but throw away the results as you go unless the next output is 1+previous.
- tye | [reply] [d/l] |
by BrowserUk (Patriarch) on Dec 09, 2006 at 09:48 UTC | |
I don't quite follow how you track which array to read from next (yet), but godamn it's fast. The following times yours, Limbic~Region's and mine. The data is 1 .. $N (minus anything with a 9 to stop it being a single contiguous sequence), randomly distributed across 4 arrays. The only tweaks I've made are to accumulate the results into an array rather than printing them out.
By way of comparison, I also tested one of the 'sort them all together' solutions (grinder's, with apologies to the other contributors!) against yours, though I had to drop $N to stop it from blowing the memory:
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by tye (Sage) on Dec 09, 2006 at 16:35 UTC | |
@next is a linked list of the lists sorted by their first remaining item. So $p is the (index of the) list to pick from (with the smallest item). $next[$p] is the (index of the) next list, the one with the next smallest first remaining item. $next[$next[$p]] is the list after that. And -1 marks the end of the linked list. After processing one item, we move that list's index in the linked list by moving down the linked list until we find the spot where its new first remaining item falls between two other lists' first remaining items. The linked list means that we don't need to shift a whole block of array elements to do this "delete and re-insert later in the list" operation. Of course, if you used a plain array, then you could use a binary search for finding the next place to insert. And, since this is Perl, the fastest solution will likely be the one that involves the fewest opcodes instead of the one with the smallest "big O" complexity. If I were writing this in C, I'd use a heap; but those are rarely the fastest solution in Perl because slow opcode dispatch makes the more complex algorithm too heavy of a speed penalty -- you'd need a huge number of lists for a heap to pay off speed-wise in Perl. - tye | [reply] [d/l] [select] |
|
Re: Searching parallel arrays.
by NiJo (Friar) on Dec 08, 2006 at 20:35 UTC | |
The input phase reads the arrays to switch on the N-th bit of a vector. After a run length encoding you are looking for every 2nd value (that represents a block of 1's) >3 . Depending on the expected hit density it pays to loose the relation to the input array and 'grep' it again. | [reply] |
by BrowserUk (Patriarch) on Dec 08, 2006 at 21:14 UTC | |
That's a really interesting idea. In this case I'm not sure its workable as, unlike my example, the integers in question can potentially take the full (unsigned) range of values--they are in fact addresses--so the bit vectors could get pretty large. Even if I scale the values (the addresses will always be at least 4-byte aligned), the range is still potentially 2**29 on a 32-bit cpu, which would make for some pretty sparse bitvectors. RLE might help, but then you loose the advantage of being able to use bitstring logical operations on them. Later on I might be able to fursther reduce the range by excluding a large chunks at the bottom (and maybe top) of memory, and I might be able to rely on 12 or even 16 byte alignment. That would hugely reduce the size of the bitvectors and coud bring this into the realms of feasability. Thanks. If anyone knows of how to obtain the lower and upper bounds of the dynamic memory of the current Perl process, I'd love to hear from them. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by NiJo (Friar) on Dec 09, 2006 at 19:35 UTC | |
http://www.perl.com/pub/a/2004/04/08/bloom_filters.html You need to loop through the arrays once to create the vector and a second time for the test [N-2..N+2]. | [reply] |
by BrowserUk (Patriarch) on Dec 10, 2006 at 02:33 UTC | |
|
Re: Searching parallel arrays.
by Anonymous Monk on Dec 11, 2006 at 10:35 UTC | |
| [reply] [d/l] |