Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Algorithm inspiration required.

by BrowserUk (Patriarch)
on Jun 18, 2018 at 11:16 UTC ( [id://1216847]=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking to detect a repeating sequence in a stream of data.

That is, given an essentially endless source of data that is known to eventually repeat; detect the start and length of the repeating sequence.

In simplistic terms, I want to program the equivalent of the algorithm that solve this regex: $str =~ m[(.+).*?(\1)]g;

Except that the values aren't characters or anything easily translated to characters and I cannot see the entire stream at any given time.

And the data stream and the length of the repeated sequence can be big. Very big. Likely bigger than I can hold in memory, so all of those Liv-Zempel windowed and sliding buffer compression algorithms go out of the window; as do any complex tree structures (trees/tries/graphs et. al).

The thought in my mind at the moment is (an adaption of) the rolling checksum algorithm used by rsync; but I'm not seeing how to apply it to my problem.

Any thoughts on that; or any pointers to any other algorithms that might be adaptable to this problem gratefully received.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

Replies are listed 'Best First'.
Re: Algorithm inspiration required.
by hdb (Monsignor) on Jun 18, 2018 at 12:30 UTC

    Just some random thoughts and questions:

    • You do have the stream stored somehow, even if not in memory, ie you can replay it?
    • How large is your alphabet? Even if it is not characters are there 100s, 1000s or how many values?
    • How do you distinguish short from long repeated sequences? In other words, what is your minimum length for a successful repeat? There could be a repeat of 5 items, is that a success?
    • If you have a minimum length, try to do running checksums or such. Obviously, you would need to store position as well, otherwise you cannot go back and retrieve the sequence corresponding to that checksum.
    • Do you need perfect matches or is it ok if there are a few deviations? This might be used in the construction of the checksum.

    Not very helpful, I am afraid.

      First, let me clarify (perhaps I should at this to the root node), I'm not trying to solve one specific problem right now, but rather I'm looking to come up with a generic solution (something I rarely try), to a class of problems I've encountered many times over the last 30 odd years. In most of those cases I've solved the immediate need using some more or less ad-hoc solution that allowed me to get back to the real problem I wanted to solve.

      When this came up again just recently, I again knocked up an ad-hoc solution based on one I've used before. I store the 'things' -- the 'values' that make up the sequence that are generally more complex than a single value -- in a hash and allocate a number or letter to represent it. I then construct a string from those letters and numbers and use the regex engine to detect repeating sequences.

      That works (if done with care) for streams that consist of 256 values or less. It might be adaptable to larger numbers of values by using unicode 'characters'; but there are several problems with this approach:

      • Unicode!

        Just trying to visually inspect (debug, trace) unicode encoded data is a pain. Even if you find a font that can handle all possible values, trying to 'see' stuff when you might have a mix of double-wide Kanji, urdu, cerillic, runes etc. will give you a headache fast. And then there's trying to get to grips with teh behemoth that is the "Unicode api". Pointless.

      • Memory.

        The need to store the entire dataset in memory to be able to run regex on it.

      • Control.

        Once you set the regex engine in motion, you loose any control until it stops. And sometimes it never stops; or would take so long it amounts to the same thing.

      • It is still limited to somewhat less than 2^31 which wouldn't be big enough for some problems.

      So, when the problem came up again, I started to think about the possibilities for a 'generic' solution.

      1. You do have the stream stored somehow, even if not in memory, ie you can replay it?

        It would certainly be possible to store the sequence to disk as it is read.

      2. How large is your alphabet? Even if it is not characters are there 100s, 1000s or how many values?

        It depends upon the specific problem, but for the latest potential application I encountered for something like this, the 'alphabet' would need to be at least 2^32; with 2^48 an ultimate goal (for now).

        As soon as you find a longer repeat sequence, any shorter ones are of no interest.

        How do you decide when you have found the longest possible sequence? I'm not yet sure that you ever can!

        For example, if you get: pqrABCDExyzABCDEpqr then you might conclude that 'ABCDExyzABCDEpqr' is probably the longest sequence and stop.

        And if you went on to get:pqrABCDExyzABCDEpqrABCDExyzABCDEpqr then the odds are strongly in your favour;

        but if you continued, you might get:pqrABCDExyzABCDEpqrABCDExyzABCDEpqrABCDExyzABCDEpqrABCDExyzABCDEpqrs which would mean that this is part of a larger still sequence.

        At some point, you have to how long you are prepared to keep looking.

      3. How do you distinguish short from long repeated sequences? In other words, what is your minimum length for a successful repeat? There could be a repeat of 5 items, is that a success?

        No minimum length. If a longer repeat is there, shorter ones have no purpose.

      4. If you have a minimum length, try to do running checksums or such. Obviously, you would need to store position as well, otherwise you cannot go back and retrieve the sequence corresponding to that checksum.

        If the sequence repeats endlessly (as it has for the applications I've encountered where this would be useful) then having determined the checksum for the repeat sequence, you can just read on and store until you've stored the full repeat.

      5. Do you need perfect matches or is it ok if there are a few deviations? This might be used in the construction of the checksum.

        Perfect matches are a must; the whole idea doesn't make sense otherwise I think.

      For most of the applications I've encountered, the sequence consists entirely of full-repeats with no intervening junk. So if you find a repeat that isn't immediately followed, by those characters at it start -- ie. in  ...abcde*** if '...' <> '***' then you aren't done.

      But just because '...' == '***', it doesn't mean you are. It might be: ...ABCDE***ABCDEF... or ...ABCDE***ABCDE...PQR...ABCDE***ABCDE...ABCDE***ABCDE...PQR...ABCDE***ABCDE


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: Algorithm inspiration required. (Further thoughts on rolling checksums)
by BrowserUk (Patriarch) on Jun 18, 2018 at 12:18 UTC

    Imagine the repeat sequence we are trying to detect is:

    Abcdeabcdefabcdefg

    So the input stream will look something like:

    ...AbcdeabcdefabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcdefg...

    but we may start sampling at any point. eg.

    deabcdefabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcd +efgAbcdeabcdefabcdefgAbcdeabcdefabcdefg 1 2 3 4 5 6
    • We won't see the start of a full sequence repeat until 2, but we won't have enough information to know that.
    • We will have seen a full-but-disjoint sequence at 3, but will not have enough information to know that.
    • We will have seen a full repeat -- start to end -- and point 4, but probably won't have enough info to know that?
    • At point 5 we will have seen two, full-but-disjoint repeats; we should be able to detect this?
    • At point 6 we will have seen two full, complete repeats and we should definitely be able to detect this.

    If we maintain rolling checksums, then:

    • At point 3 we will have a checksum for a full-but-disjoint sequence, but will not be able to match it until point 5.
    • At point 4 we will have a checksum for a full sequence, but will not be able to match it until point 6.

    But this would require us to store checksums covering every start-point and every end-point, for 2 full-but-disjoint repeats. That's N^2 checksums, which is more expensive than storing the data.

    Unless there is some way to determine that we can discard those checksums that cannot be the one we want.

    Ie. If when we read the 3rd value, and calculate the 2-value checksum for that value and its immediate predecessor and it doesn't match the checksum for the first two values, we know that our sequence must be longer than 2, and can discard both 2-value checksums.

    Ditto, when we read the fourth value, and calculate the checksum for the last 3 values and it does not match that of the first 3, we can discard both.

    But, when we calculated the 3-value checksum having read the fourth value, we needed the 2-value checksums; and when we come to calculate the 4-value checksums having read the 5 value, we need the 3-value checksums;

    So the discarding of earlier checksums has to be deferred by one iteration cycle.

      Your regex:  $str =~ m[(.+).*?(\1)]g would accept "abcd" as a repeating pattern in "abcdefabcd", but your example shows no extra data between the repetitions (like in your similar question here). Which interpretation is correct?

      If your data is only one pattern repeated without intermediate data, any subsequence in that pattern can be interpreted as the start of the pattern (eg, in your example above it would be deab, rather than Abcd). You could keep the checksums for the value, or subpattern with the lower occurence. Eg, 'A' has a ratio of 1/18, so only keeping a checksum for every instance of 'A' would divide the number of checksums to keep in memory by 18. (Ie: that's trying to fit Boyer Moore as much as possible into your problem)

      Also, if you have a good idea of the ratio for each value beforehand, or can preprocess the input data, n-ary Huffman coding might reduce the size greatly, while choosing n = 16 (with hex representation) or 256 would still keep value aligned on a byte (so convenient for processing).

      Edited my first link, I wanted to link to the thread, not my post.

        Your regex: $str =~ m(.+).*?(\1)g would accept "abcd" as a repeating pattern in "abcdefabcd", but your example shows no extra data between the repetitions (like in your similar question here). Which interpretation is correct?

        (For now :) ) NO intervening junk is the correct interpretation. (The regex is a throw-back to early attempts to solve similar problems non-generically.

        (And yes; that earlier node is one of the many times I've encountered the find repeats problem.

        If your data is only one pattern repeated without intermediate data, any subsequence in that pattern can be interpreted as the start of the pattern (eg, in your example above it would be deab, rather than Abcd).

        Mmmm. Yes, but is that useful to know?

        What I mean is, wherever you start in the sequence can be considered the start of the repeat, and is in fact indistinguishable from it. That is, unless you happened to witness the start of the sequence or the end -- and know you had done so -- any possible start point in a cyclic sequence is also an end point.

        Please note: I'm not saying you're wrong, only that I'm not sure how useful it is. My nomenclature above was just me trying to describe the problem.

        You could keep the checksums for the value, or subpattern with the lower occurence. Eg, 'A' has a ratio of 1/18, so only keeping a checksum for every instance of 'A' would divide the number of checksums to keep in memory by 18. (Ie: that's trying to fit Boyer Moore as much as possible into your problem)

        I'm not sure what you mean by "'A' has a ratio of 1/18". Ratio of what to what? The penny dropped! That's actually a really good idea :)

        I could (for example), sample some number of values and choose the value with the lowest occurrence as my anchor point.

        Or take it further, and count pairs of values, and use the lowest occurrence.

        Or triples ....

        But where is the ultimate limit to this idea?

        Accumulate counts of single values until you have one that is less than any other; then discard all the others and start counting occurrences of that character paired with other characters until one pair distinguish themselves as being lesser occurring; then discard the other counts and start counting that pair tripled with other characters...

        I'm not sure where that idea goes?

        Of course, as soon as you've decided that 'A' occurs least often and discarded the other counts, Sod's Law says that the next N characters in the stream will all be 'A's until it has become the most frequent. But then, any sequence of just 'A's would be a pretty good anchor itself, so Sod's Law says that it won't be followed by a sequence of contiguous 'A's, but rather a sequence of 'A.'s, where '.' represents every other value in the stream!

        Hm. More thought required! Later :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: Algorithm inspiration required.
by talexb (Chancellor) on Jun 18, 2018 at 12:30 UTC

    Take smaller bites of your data, and checksum those bites. Then see if you can see a pattern in at least some of those bites.

    Say you take four byte cheksums at every byte, and the pattern is fourteen bytes long and repeats every 30 bytes. You start collecting at offset 0, 1, 2 .. and collect checksums for those spots. By the time you get to offset 30, you'll start to see the same checksums. Now, only three of these will match, but that positive result would be enough to backtrack to 0 and compare the stream byte by byte from there and from 30.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      The problem is the sequence being sought could be huge -- bigger than available memory -- and retaining all the checksums for every position will require even more memory than the sequence itself. SO the problem becomes: how to recognise which checksums must be retained and which can be discarded.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: Algorithm inspiration required.
by bliako (Monsignor) on Jun 18, 2018 at 13:36 UTC

    Make a Finite State Automaton (what a mouthful) as a chain of nodes where, in this particular case, each node can only move to the next node in the chain. The length of the chain is equal to the length of the pattern you try to detect. Also, i^th node knows about the i^th data point (DP) in your pattern in order to compare whatever data it is holding and make a response if same.

    Generally this is how it works:

    First DP in the stream enters the 1st node. That node checks if entered DP is the same as the DP from your pattern it knows. Depending on yes/no, it turns a bit on/off in a global message buffer (buffer is of length equal to your pattern's). Then pass the DP to the next node of the chain and load next DP from the stream.

    As a new DP is extracted from the stream, it enters the 1st node, that node does the check and propagates the DP to the second node. Second node does the same until the DP comes out from the last node.

    In the meantime, someone keeps an eye on the message buffer and if/when all bits are on, you got a pattern detected.

    Sometimes you will get 90% of the bits on, this is when a partial pattern match occured. Set a minimum match percentage to trigger a response.

    I think you are right to mention regex, I have just described how a regex works.

    If you go that way then of course there are plenty of modules in cpan, see Finite State Automata / Transducers

    Forgot to say, this is also common in electronics where data moves on a clock-pulse. Do not do any asynchronous message buffer checking, but instead follow a clock. Data moves, bits flipped, message buffer checked each new clock pulse.

    hope that helps, bliako

      this is not suitable for this problem because we have an "infinite" and unknown pattern to detect. The above is useful only when the pattern to detect is known in advance.

Re: Algorithm inspiration required.
by kennethk (Abbot) on Jun 18, 2018 at 21:40 UTC
    So here's my thinking. What you are really trying to figure out is how long your sequence is; the concept of start is arbitrary. Therefore, what you should be doing is a rolling bitwise xor; whenever your checksum is zero, you have potentially passed through two cycles. I don't know what your alphabet looks like, but presumably you could do better than ASCII encoding to lower your collision rate.
    #!/usr/bin/perl use strict; use warnings; use 5.10.0; my $seq = 'deabcdefabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcdefgAbcdeabc +defabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcdefg'; my %val = do{my $cnt = 0; map {$_ => ++$cnt} split //, 'Aabcdefg'}; my $xor = 0; my $loop = 0; for (split //, $seq) { $loop++; $xor ^= $val{$_}; say $loop if !$xor; }
    outputs
    9 36 45 72 81
    Note that the only candidate that has both 2*x and 4*x in the list is the correct answer, 18. Setting the %val hash to one-hots instead of consecutive integers gets rid of the noise, but also chews up your bit vector pretty quickly for large alphabets.
    my %val = do{my $cnt = 0; map {$_ => 2**(++$cnt)} split //, 'Aabcdefg' +};

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Algorithm inspiration required.
by hexcoder (Curate) on Jun 19, 2018 at 13:20 UTC
    Hi,

    the best method I know of is using a suffix array (and longest common prefix array) with external memory. Here is a link to a research paper and the website. According to the paper it can handle a text size of 80 GiB using only 4 GiB of RAM. Since endless streams may contain infinite information, this algorithm cannot handle that. It only can work on a snapshot of the stream limited by memory (internal and external).

    I would be surprised if that could be beaten by a different algorithm.

    hexcoder

      Thank you for this.

      As a first pass (of what is perhaps the most opaque algorithm description I've yet encountered), this does far more work than I see being necessary for my particular problem; as such I think it would be necessary to adapt it quite heavily to only do that which I need.

      My, currently fragile, implementation of the running checksum algorithm works (so far tested on a very limited set of cases):

      [15:15:42.17] C:\test>rollingChksum -MAX=1e6 -BUF=1e3 -WRAP=9.123e5 6891 910000 Found repeat starting at value:912300 [15:15:48.65] C:\test> [15:16:17.32] C:\test>rollingChksum -MAX=1e7 -BUF=1e3 -WRAP=9.123e6 6891 9120000 Found repeat starting at value:9123000 [15:17:22.49] C:\test>

      Needs a lot of work, and a lot more testing, but it is currently processing around 6MB(16bit values, not bytes)/second; and I know that can be sped up by at least 10 times by optimising the Perl; and much further still, once I recode the algorithm into C.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
        ok, I see check summing works quite well. I had expected more false positives because different sequences can lead to the same checksum. The suffix array method (from bio informatics) described above OTOH is exact, but admittedly more difficult to implement efficiently. And if you periodically need to throw out old values and feed new values in, that is a requirement for the algorithm that is probably not yet implemented.
Re: Algorithm inspiration required.
by pryrt (Abbot) on Jun 18, 2018 at 22:08 UTC

    My thought process on this: assume your stream was the lyrics for the Twelve Days of Christmas, on infinite repeat. You come in on "four calling birds" -- at this point, you don't know whether you've just started the Fourth Day, or whether you've nearly finished the Twelfth Day... and I believe (like the others have mentioned) that it doesn't matter whether it's the "true" start, since that's rather arbitrary. (If it does matter, once you've found a huge repeating sequence, you can spawn something else to find the official "start" of that huge sequence). At some point, you know you'll get back to the _same_ "four calling birds" in the next repeat of the song, rather than just one of the nine from the fourth to twelfth day).

    Using a hash function that easily allows streamed input (rather than requiring the whole file already: I think most of the SHA and related accept a stream, though I don't remember off the top of my head if most of the perl implementations do), start hashing from wherever you came in as $huge_hash; also, keep a count of the number of bytes hashed in $huge_hash as $huge_bytes. After you've grabbed the first phrase (in this instance, I picked 18 characters), use that as your $trigger. If you find this $trigger again, you _might_ have started over; if you haven't found this trigger, you know you haven't started over yet.

    So, in my example, you continue reading the stream: "... on the fifth day... four calling birds".... and you've found the $trigger: set $archive_hash = $huge_hash; $archive_bytes = $huge_bytes; start a $repeat_hash from this instance of $trigger, and count $repeat_bytes, but continue with the $huge_hash and $huge_bytes (just in case you haven't found the biggest length yet). When you get to "... the sixth day ... four calling birds", you've found $trigger again, but your $repeat_hash and $repeat_bytes won't match the $archive_hash and $archive_bytes. Thus, you'll know that you haven't really found a repeating sequence. set $archive_hash = $huge_hash; $archive_bytes = $huge_bytes; reset $repeat_hash and $reset_bytes starting on this instance of $trigger, and continue.

    At some point, you might want to decide "there have been too many false triggers, so I want a longer $trigger to avoid false triggers": maybe double the length of the trigger to look for. Unfortunately, we've lost the initial trigger information. Since you know you are always somewhere within the largest repeated sequence, and you don't (yet) care what the official "start" is, just start over from the beginning: reset $huge_hash from here, clear the $archive_hash, and set $trigger to 36 bytes instead of 18 bytes (for example).

    Eventually, you'll hit a trigger that really is the start of the next time through, so your hashes and lengths will match.

    Another sequence that you might want to experiment with (assume all whitespace equal -- and could be collapsed to no whitespace at all):

    1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 12345678 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 12345678 123456789 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 12345678 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 1 1 12 1 1 12 123 1 1 12 1 1 12 123 1234 1 1 12 1 1 12 123 1 1 12 1 1 +12 123 1234 12345 123456 1234567 12345678 123456789 1234567890

    Then try starting randomly in that sequence somewhere, and see if the algorithm I presented (or someone else presented) works.

    (rereading some of Eily's suggestions, I think maybe this is the same idea... but I'm not sure I fully understood, so maybe mine's a little different. It's presented differently, either way, so maybe will help or will trigger a better thought process in you or someone else.)

    edit 1: rephrase first paragraph for clarity; also, fix the spelling of "twelfth"

Proposed algorithm for the 'simple' version of the problem (Verification sought.)
by BrowserUk (Patriarch) on Jun 19, 2018 at 00:12 UTC

    For the simple version (sequence endlessly repeats with no intervening junk).

    If I read a (manageable) sized buffer of data from the stream, there are 3 possibilities:

    1. The entire sequence is exactly contained within the buffer, with no excess. (Unlikely, but needs to be considered.) [start...end]start...endstart...

      This is easily discounted by checking the next value from the stream. Or verified by checking the next <buffersize> values from the stream.

      Of course, it is possible that the buffer contents are an exact copy of a repeating subsequence within the larger repeating sequence, but for a buffer of anything over (say) 1000 values, this is so unlikely as to be next to impossible.

    2. The entire sequence is contained within the buffer, but it does not constitute the entire contents of the buffer.

      This possibility has 3 sub-possibilities:

      1. [start...endstart...]...endstart...endstart...

        A sequence of values at the end of the buffer exactly matches those at the start. Easily detected.

      2. [...endstart...end]start...endstart...

        A sequence of values at the start of the buffer matches a sequence ending at the end of the buffer.

      3. [...endstart...endstart...]...endstart...endstart... < >< ><

      As any position in a endlessly repeating sequence constitutes a possible start position, this means I would have at least two full repeats plus a partial at the end that would match the start of the buffer.

      This can be differentiated from 2.1) above by checking if the residual appears twice in the buffer. 3) The entire buffer constitutes only a part of the repeat. (This is likely to be the common case.) start...[...]...end

      But of course, as any point in the sequence can be considered a start, this is really just [start...]...endstart...endstart...end

    So, if I chose a manageable number of values (say 100) at the start of the buffer as my anchor, and calculate a checksum for those 100 values (CHK100START).

    Then calculate a running checksum of the full buffer(CHKFULL).

    Now read from the stream, and maintain a 100 value FIFO.

    Calculate a 100 value running checksum (CHK100RUNNING) as I put the values into the FIFO; And add the value's distinctiveness to the collective (CHKFULL) as I discard them.

    Whenever CHK100RUNNING matches CHK100START, then I have potentially found the start of the next repeat; and the value of CHKFULL (lagging CH100RUNNING by 100) represents the checksum for the full sequence.

    At this point, I save a copy of CHKFULL, and start a new checksum (CHKFULL2) from the values exiting the FIFO.

    Now I read on again until CHK100RUNNING once again matches CH100START.

    At this point, if CHKFULL2 == CHKFULL then (with a very high degree of probability*) I have traversed the full sequence twice.

    If not, then the CHK100* match is a false one -- quite possible with a fast running checksum algorithm -- so I add CHKFULL2 back to CHKFULL, and continue through the stream trying to match CHK100START.

    *To validate the highly probable, but still not 100% certain, conclusion, I read the (now known) length of the potential sequence twice more, writing each copy to a separate file and then compare the files.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: Algorithm inspiration required.
by bliako (Monsignor) on Jun 18, 2018 at 16:26 UTC

    This is my take on solving this (if I understand the spec right that is) using a finite state automaton to look for a specific arbitrary-length sequence of characters from a stream. It does not need to store any data at all except the string to look for and an array of bits of same length.

    As an example the input stream is a seemingly infinite sequence of random numbers between 0 and 9.The sequence to detect is '0,1,2,3,4,5'.

    For a seed of '123' this is encountered on the 151436th rand() call!

    ./detector.pl

      Sorry, but I think you may have misunderstood the problem. Or perhaps I am misunderstanding your code.

      What you appear to be doing is searching the stream for a known 5-value sequence, using a FSM that uses 24,664 bytes of memory to represent those 5 characters and an extra 1056 bytes of memory for every extra byte; and takes (on my machine) 202 seconds to locate those 5 bytes 4,312,277 bytes into the stream.

      The actual problem is that the sequence I am looking for is in the stream, and can be huge; as in greater than memory. eg 100GB.

      For the 'simple' version of the problem, the sequence to find starts at the beginning of the stream, but extends to a size greater than memory, and once complete, will repeat immediately, (and obviously extend as far again). The problem is how to retain enough information about a huge, and ever growing sequence to recognise when you have seen it twice.

      It is only once you have see it twice that it is possible to know how long the sought after sequence is.

      The more complex version can have additional just existing between occurrences of the repetition(s) and you don't know whether the first characters you see constitute a part of the repetition, or junk. (But that's for another day :) ).


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

        I COMPLETELY misunderstood what you wanted to do, my take is for a finite and known pattern to detect. sorry

Re: Algorithm inspiration required.
by taint (Chaplain) on Jun 19, 2018 at 17:05 UTC
    OK, seems I'm coming to this party late.

    But as I ponder this problem; it feels like a Fraction -- an attempt to find the lowest common denominator. So in an attempt to make the problem even easier (more simplistic) it occurred to be that the easiest (least complex) solution could probably be found by converting said stream to binary. I couldn't quire gather from your OP whether the data was strictly limited to AlphaNumeric, or not. This solved that for me. IMHO, and off the top of my point head. Only needing to deal with exactly 2 known characters should make finding a pattern near trivial, and should even reduce the record size you would need to maintain, to find that pattern.

    Again, these are just my initial thoughts on the matter, and while I would normally have waited until I had produced a workable example. I thought I might throw this out there, in case I was actually on to something, and it lit a light in someoneelses head, (maybe yours) before mine. :-)

    --Chris

    Evil is good, for without it, Good would have no value
    ¡λɐp ʇɑəɹ⅁ ɐ əʌɐɥ puɐ ʻꜱdləɥ ꜱᴉɥʇ ədoH

A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1216847]
Approved by Discipulus
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-03-28 14:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found