Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Imagine the repeat sequence we are trying to detect is:

Abcdeabcdefabcdefg

So the input stream will look something like:

...AbcdeabcdefabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcdefg...

but we may start sampling at any point. eg.

deabcdefabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcdefgAbcdeabcdefabcd +efgAbcdeabcdefabcdefgAbcdeabcdefabcdefg 1 2 3 4 5 6
  • We won't see the start of a full sequence repeat until 2, but we won't have enough information to know that.
  • We will have seen a full-but-disjoint sequence at 3, but will not have enough information to know that.
  • We will have seen a full repeat -- start to end -- and point 4, but probably won't have enough info to know that?
  • At point 5 we will have seen two, full-but-disjoint repeats; we should be able to detect this?
  • At point 6 we will have seen two full, complete repeats and we should definitely be able to detect this.

If we maintain rolling checksums, then:

  • At point 3 we will have a checksum for a full-but-disjoint sequence, but will not be able to match it until point 5.
  • At point 4 we will have a checksum for a full sequence, but will not be able to match it until point 6.

But this would require us to store checksums covering every start-point and every end-point, for 2 full-but-disjoint repeats. That's N^2 checksums, which is more expensive than storing the data.

Unless there is some way to determine that we can discard those checksums that cannot be the one we want.

Ie. If when we read the 3rd value, and calculate the 2-value checksum for that value and its immediate predecessor and it doesn't match the checksum for the first two values, we know that our sequence must be longer than 2, and can discard both 2-value checksums.

Ditto, when we read the fourth value, and calculate the checksum for the last 3 values and it does not match that of the first 3, we can discard both.

But, when we calculated the 3-value checksum having read the fourth value, we needed the 2-value checksums; and when we come to calculate the 4-value checksums having read the 5 value, we need the 3-value checksums;

So the discarding of earlier checksums has to be deferred by one iteration cycle.


In reply to Re: Algorithm inspiration required. (Further thoughts on rolling checksums) by BrowserUk
in thread Algorithm inspiration required. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-04-24 11:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found