in reply to Algorithm inspiration required.

For the simple version (sequence endlessly repeats with no intervening junk).

If I read a (manageable) sized buffer of data from the stream, there are 3 possibilities:

  1. The entire sequence is exactly contained within the buffer, with no excess. (Unlikely, but needs to be considered.) [start...end]start...endstart...

    This is easily discounted by checking the next value from the stream. Or verified by checking the next <buffersize> values from the stream.

    Of course, it is possible that the buffer contents are an exact copy of a repeating subsequence within the larger repeating sequence, but for a buffer of anything over (say) 1000 values, this is so unlikely as to be next to impossible.

  2. The entire sequence is contained within the buffer, but it does not constitute the entire contents of the buffer.

    This possibility has 3 sub-possibilities:

    1. [start...endstart...]...endstart...endstart...

      A sequence of values at the end of the buffer exactly matches those at the start. Easily detected.

    2. [...endstart...end]start...endstart...

      A sequence of values at the start of the buffer matches a sequence ending at the end of the buffer.

    3. [...endstart...endstart...]...endstart...endstart... < >< ><

    As any position in a endlessly repeating sequence constitutes a possible start position, this means I would have at least two full repeats plus a partial at the end that would match the start of the buffer.

    This can be differentiated from 2.1) above by checking if the residual appears twice in the buffer. 3) The entire buffer constitutes only a part of the repeat. (This is likely to be the common case.) start...[...]...end

    But of course, as any point in the sequence can be considered a start, this is really just [start...]...endstart...endstart...end

So, if I chose a manageable number of values (say 100) at the start of the buffer as my anchor, and calculate a checksum for those 100 values (CHK100START).

Then calculate a running checksum of the full buffer(CHKFULL).

Now read from the stream, and maintain a 100 value FIFO.

Calculate a 100 value running checksum (CHK100RUNNING) as I put the values into the FIFO; And add the value's distinctiveness to the collective (CHKFULL) as I discard them.

Whenever CHK100RUNNING matches CHK100START, then I have potentially found the start of the next repeat; and the value of CHKFULL (lagging CH100RUNNING by 100) represents the checksum for the full sequence.

At this point, I save a copy of CHKFULL, and start a new checksum (CHKFULL2) from the values exiting the FIFO.

Now I read on again until CHK100RUNNING once again matches CH100START.

At this point, if CHKFULL2 == CHKFULL then (with a very high degree of probability*) I have traversed the full sequence twice.

If not, then the CHK100* match is a false one -- quite possible with a fast running checksum algorithm -- so I add CHKFULL2 back to CHKFULL, and continue through the stream trying to match CHK100START.

*To validate the highly probable, but still not 100% certain, conclusion, I read the (now known) length of the potential sequence twice more, writing each copy to a separate file and then compare the files.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit