Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Your regex: $str =~ m(.+).*?(\1)g would accept "abcd" as a repeating pattern in "abcdefabcd", but your example shows no extra data between the repetitions (like in your similar question here). Which interpretation is correct?

(For now :) ) NO intervening junk is the correct interpretation. (The regex is a throw-back to early attempts to solve similar problems non-generically.

(And yes; that earlier node is one of the many times I've encountered the find repeats problem.

If your data is only one pattern repeated without intermediate data, any subsequence in that pattern can be interpreted as the start of the pattern (eg, in your example above it would be deab, rather than Abcd).

Mmmm. Yes, but is that useful to know?

What I mean is, wherever you start in the sequence can be considered the start of the repeat, and is in fact indistinguishable from it. That is, unless you happened to witness the start of the sequence or the end -- and know you had done so -- any possible start point in a cyclic sequence is also an end point.

Please note: I'm not saying you're wrong, only that I'm not sure how useful it is. My nomenclature above was just me trying to describe the problem.

You could keep the checksums for the value, or subpattern with the lower occurence. Eg, 'A' has a ratio of 1/18, so only keeping a checksum for every instance of 'A' would divide the number of checksums to keep in memory by 18. (Ie: that's trying to fit Boyer Moore as much as possible into your problem)

I'm not sure what you mean by "'A' has a ratio of 1/18". Ratio of what to what? The penny dropped! That's actually a really good idea :)

I could (for example), sample some number of values and choose the value with the lowest occurrence as my anchor point.

Or take it further, and count pairs of values, and use the lowest occurrence.

Or triples ....

But where is the ultimate limit to this idea?

Accumulate counts of single values until you have one that is less than any other; then discard all the others and start counting occurrences of that character paired with other characters until one pair distinguish themselves as being lesser occurring; then discard the other counts and start counting that pair tripled with other characters...

I'm not sure where that idea goes?

Of course, as soon as you've decided that 'A' occurs least often and discarded the other counts, Sod's Law says that the next N characters in the stream will all be 'A's until it has become the most frequent. But then, any sequence of just 'A's would be a pretty good anchor itself, so Sod's Law says that it won't be followed by a sequence of contiguous 'A's, but rather a sequence of 'A.'s, where '.' represents every other value in the stream!

Hm. More thought required! Later :)


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

In reply to Re^3: Algorithm inspiration required. (Further thoughts on rolling checksums) by BrowserUk
in thread Algorithm inspiration required. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2024-03-28 10:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found