comment on

First, let me clarify (perhaps I should at this to the root node), I'm not trying to solve one specific problem right now, but rather I'm looking to come up with a generic solution (something I rarely try), to a class of problems I've encountered many times over the last 30 odd years. In most of those cases I've solved the immediate need using some more or less ad-hoc solution that allowed me to get back to the real problem I wanted to solve.

When this came up again just recently, I again knocked up an ad-hoc solution based on one I've used before. I store the 'things' -- the 'values' that make up the sequence that are generally more complex than a single value -- in a hash and allocate a number or letter to represent it. I then construct a string from those letters and numbers and use the regex engine to detect repeating sequences.

That works (if done with care) for streams that consist of 256 values or less. It might be adaptable to larger numbers of values by using unicode 'characters'; but there are several problems with this approach:

Unicode!
Just trying to visually inspect (debug, trace) unicode encoded data is a pain. Even if you find a font that can handle all possible values, trying to 'see' stuff when you might have a mix of double-wide Kanji, urdu, cerillic, runes etc. will give you a headache fast. And then there's trying to get to grips with teh behemoth that is the "Unicode api". Pointless.
Memory.
The need to store the entire dataset in memory to be able to run regex on it.
Control.
Once you set the regex engine in motion, you loose any control until it stops. And sometimes it never stops; or would take so long it amounts to the same thing.
It is still limited to somewhat less than 2^31 which wouldn't be big enough for some problems.

So, when the problem came up again, I started to think about the possibilities for a 'generic' solution.

You do have the stream stored somehow, even if not in memory, ie you can replay it?

It would certainly be possible to store the sequence to disk as it is read.
How large is your alphabet? Even if it is not characters are there 100s, 1000s or how many values?

It depends upon the specific problem, but for the latest potential application I encountered for something like this, the 'alphabet' would need to be at least 2^32; with 2^48 an ultimate goal (for now).
As soon as you find a longer repeat sequence, any shorter ones are of no interest.
How do you decide when you have found the longest possible sequence? I'm not yet sure that you ever can!
For example, if you get: pqrABCDExyzABCDEpqr then you might conclude that 'ABCDExyzABCDEpqr' is probably the longest sequence and stop.
And if you went on to get:pqrABCDExyzABCDEpqrABCDExyzABCDEpqr then the odds are strongly in your favour;
but if you continued, you might get:pqrABCDExyzABCDEpqrABCDExyzABCDEpqrABCDExyzABCDEpqrABCDExyzABCDEpqrs which would mean that this is part of a larger still sequence.
At some point, you have to how long you are prepared to keep looking.
How do you distinguish short from long repeated sequences? In other words, what is your minimum length for a successful repeat? There could be a repeat of 5 items, is that a success?

No minimum length. If a longer repeat is there, shorter ones have no purpose.
If you have a minimum length, try to do running checksums or such. Obviously, you would need to store position as well, otherwise you cannot go back and retrieve the sequence corresponding to that checksum.

If the sequence repeats endlessly (as it has for the applications I've encountered where this would be useful) then having determined the checksum for the repeat sequence, you can just read on and store until you've stored the full repeat.
Do you need perfect matches or is it ok if there are a few deviations? This might be used in the construction of the checksum.
Perfect matches are a must; the whole idea doesn't make sense otherwise I think.

For most of the applications I've encountered, the sequence consists entirely of full-repeats with no intervening junk. So if you find a repeat that isn't immediately followed, by those characters at it start -- ie. in ...abcde*** if '...' <> '***' then you aren't done.

But just because '...' == '***', it doesn't mean you are. It might be: ...ABCDE***ABCDEF... or ...ABCDE***ABCDE...PQR...ABCDE***ABCDE...ABCDE***ABCDE...PQR...ABCDE***ABCDE

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.

In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

In reply to Re^2: Algorithm inspiration required. by BrowserUk
in thread Algorithm inspiration required. by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


"be consistent"
	PerlMonks