in reply to Algorithm inspiration required.
For the simple version (sequence endlessly repeats with no intervening junk).
If I read a (manageable) sized buffer of data from the stream, there are 3 possibilities:
This is easily discounted by checking the next value from the stream. Or verified by checking the next <buffersize> values from the stream.
Of course, it is possible that the buffer contents are an exact copy of a repeating subsequence within the larger repeating sequence, but for a buffer of anything over (say) 1000 values, this is so unlikely as to be next to impossible.
This possibility has 3 sub-possibilities:
A sequence of values at the end of the buffer exactly matches those at the start. Easily detected.
A sequence of values at the start of the buffer matches a sequence ending at the end of the buffer.
[...endstart...endstart...]...endstart...endstart... < >< ><
As any position in a endlessly repeating sequence constitutes a possible start position, this means I would have at least two full repeats plus a partial at the end that would match the start of the buffer.
This can be differentiated from 2.1) above by checking if the residual appears twice in the buffer. 3) The entire buffer constitutes only a part of the repeat. (This is likely to be the common case.) start...[...]...end
But of course, as any point in the sequence can be considered a start, this is really just [start...]...endstart...endstart...end
So, if I chose a manageable number of values (say 100) at the start of the buffer as my anchor, and calculate a checksum for those 100 values (CHK100START).
Then calculate a running checksum of the full buffer(CHKFULL).
Now read from the stream, and maintain a 100 value FIFO.
Calculate a 100 value running checksum (CHK100RUNNING) as I put the values into the FIFO; And add the value's distinctiveness to the collective (CHKFULL) as I discard them.
Whenever CHK100RUNNING matches CHK100START, then I have potentially found the start of the next repeat; and the value of CHKFULL (lagging CH100RUNNING by 100) represents the checksum for the full sequence.
At this point, I save a copy of CHKFULL, and start a new checksum (CHKFULL2) from the values exiting the FIFO.
Now I read on again until CHK100RUNNING once again matches CH100START.
At this point, if CHKFULL2 == CHKFULL then (with a very high degree of probability*) I have traversed the full sequence twice.
If not, then the CHK100* match is a false one -- quite possible with a fast running checksum algorithm -- so I add CHKFULL2 back to CHKFULL, and continue through the stream trying to match CHK100START.
*To validate the highly probable, but still not 100% certain, conclusion, I read the (now known) length of the potential sequence twice more, writing each copy to a separate file and then compare the files.
|
---|