Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Algorithm inspiration required. (Further thoughts on rolling checksums)

by BrowserUk (Patriarch)
on Jun 18, 2018 at 12:18 UTC ( [id://1216848]=note: print w/replies, xml ) Need Help??


in reply to Algorithm inspiration required.

Imagine the repeat sequence we are trying to detect is:

Abcdeabcdefabcdefg

So the input stream will look something like:

...AbcdeabcdefabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcdefg...

but we may start sampling at any point. eg.

deabcdefabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcd +efgAbcdeabcdefabcdefgAbcdeabcdefabcdefg 1 2 3 4 5 6
  • We won't see the start of a full sequence repeat until 2, but we won't have enough information to know that.
  • We will have seen a full-but-disjoint sequence at 3, but will not have enough information to know that.
  • We will have seen a full repeat -- start to end -- and point 4, but probably won't have enough info to know that?
  • At point 5 we will have seen two, full-but-disjoint repeats; we should be able to detect this?
  • At point 6 we will have seen two full, complete repeats and we should definitely be able to detect this.

If we maintain rolling checksums, then:

  • At point 3 we will have a checksum for a full-but-disjoint sequence, but will not be able to match it until point 5.
  • At point 4 we will have a checksum for a full sequence, but will not be able to match it until point 6.

But this would require us to store checksums covering every start-point and every end-point, for 2 full-but-disjoint repeats. That's N^2 checksums, which is more expensive than storing the data.

Unless there is some way to determine that we can discard those checksums that cannot be the one we want.

Ie. If when we read the 3rd value, and calculate the 2-value checksum for that value and its immediate predecessor and it doesn't match the checksum for the first two values, we know that our sequence must be longer than 2, and can discard both 2-value checksums.

Ditto, when we read the fourth value, and calculate the checksum for the last 3 values and it does not match that of the first 3, we can discard both.

But, when we calculated the 3-value checksum having read the fourth value, we needed the 2-value checksums; and when we come to calculate the 4-value checksums having read the 5 value, we need the 3-value checksums;

So the discarding of earlier checksums has to be deferred by one iteration cycle.

Replies are listed 'Best First'.
Re^2: Algorithm inspiration required. (Further thoughts on rolling checksums)
by Eily (Monsignor) on Jun 18, 2018 at 13:58 UTC

    Your regex:  $str =~ m[(.+).*?(\1)]g would accept "abcd" as a repeating pattern in "abcdefabcd", but your example shows no extra data between the repetitions (like in your similar question here). Which interpretation is correct?

    If your data is only one pattern repeated without intermediate data, any subsequence in that pattern can be interpreted as the start of the pattern (eg, in your example above it would be deab, rather than Abcd). You could keep the checksums for the value, or subpattern with the lower occurence. Eg, 'A' has a ratio of 1/18, so only keeping a checksum for every instance of 'A' would divide the number of checksums to keep in memory by 18. (Ie: that's trying to fit Boyer Moore as much as possible into your problem)

    Also, if you have a good idea of the ratio for each value beforehand, or can preprocess the input data, n-ary Huffman coding might reduce the size greatly, while choosing n = 16 (with hex representation) or 256 would still keep value aligned on a byte (so convenient for processing).

    Edited my first link, I wanted to link to the thread, not my post.

      Your regex: $str =~ m(.+).*?(\1)g would accept "abcd" as a repeating pattern in "abcdefabcd", but your example shows no extra data between the repetitions (like in your similar question here). Which interpretation is correct?

      (For now :) ) NO intervening junk is the correct interpretation. (The regex is a throw-back to early attempts to solve similar problems non-generically.

      (And yes; that earlier node is one of the many times I've encountered the find repeats problem.

      If your data is only one pattern repeated without intermediate data, any subsequence in that pattern can be interpreted as the start of the pattern (eg, in your example above it would be deab, rather than Abcd).

      Mmmm. Yes, but is that useful to know?

      What I mean is, wherever you start in the sequence can be considered the start of the repeat, and is in fact indistinguishable from it. That is, unless you happened to witness the start of the sequence or the end -- and know you had done so -- any possible start point in a cyclic sequence is also an end point.

      Please note: I'm not saying you're wrong, only that I'm not sure how useful it is. My nomenclature above was just me trying to describe the problem.

      You could keep the checksums for the value, or subpattern with the lower occurence. Eg, 'A' has a ratio of 1/18, so only keeping a checksum for every instance of 'A' would divide the number of checksums to keep in memory by 18. (Ie: that's trying to fit Boyer Moore as much as possible into your problem)

      I'm not sure what you mean by "'A' has a ratio of 1/18". Ratio of what to what? The penny dropped! That's actually a really good idea :)

      I could (for example), sample some number of values and choose the value with the lowest occurrence as my anchor point.

      Or take it further, and count pairs of values, and use the lowest occurrence.

      Or triples ....

      But where is the ultimate limit to this idea?

      Accumulate counts of single values until you have one that is less than any other; then discard all the others and start counting occurrences of that character paired with other characters until one pair distinguish themselves as being lesser occurring; then discard the other counts and start counting that pair tripled with other characters...

      I'm not sure where that idea goes?

      Of course, as soon as you've decided that 'A' occurs least often and discarded the other counts, Sod's Law says that the next N characters in the stream will all be 'A's until it has become the most frequent. But then, any sequence of just 'A's would be a pretty good anchor itself, so Sod's Law says that it won't be followed by a sequence of contiguous 'A's, but rather a sequence of 'A.'s, where '.' represents every other value in the stream!

      Hm. More thought required! Later :)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

        But where is the ultimate limit to this idea?
        I didn't find an answer at first, but for me you would want the anchor to be at most the repeating pattern (because a longer anchor will have the same ratio in the data). So you could keep as an anchor either the longest non repeating sequence so far, or as big a sequence as memory will allow.

        Actually if your anchor isn't too uniform, rather than compute checksums at each successive position you could then use Boyer Moore to "jump" to the next occurence of the pattern.

        If the candidate pattern is too small, let's say N values, just read N values at a time in a loop until you find something that's different from your candidate. That something is probably a good choice of anchor (or as you said /($pattern)*($something)/).

        NO intervening junk is the correct interpretation

        What about presence or absence of preceding junk? If it's absent, i.e. sequence starts at 0, and if stream allows re-reading previous data, then read 2 bytes at 2N, look back at byte at N. Make N = N + 1 now. Update checksums for 0..(N-1) and N..(2N-1) halves (rolling hash). Byte read at "previous N" adds to checksum of half1, and deletes from checksum of half2. 2 bytes add to checksum for half2. Compare 2 checksums. And so on. Only 2 checksums stored at any time. I hope I didn't misunderstood the task to be too simple nor said anything too trivial.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1216848]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (7)
As of 2024-03-28 16:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found