QM has asked for the wisdom of the Perl Monks concerning the following question:

This comes from a real-world, icky problem I'm facing, but I'm simplifying it to avoid the mucky details.

If I had the string:

# 10 quoted chars my $string = 'ABABCABCAC';
I could save some space by noting that 'ABCABC' could be changed to 'ABC'x2: 1,2
# 7 quoted chars my $string = 'AB'.'ABC'x2.'AC';
Note that if I had just taken the first repeated string, 'ABAB', I would have this:
# 8 quoted chars my $string = ('AB'x2).'CABCAC'x2;
which is not optimal.

Additionally, there is a limit to the number of x$N elements allowed, $Max_Repeats. So the final solution would have to weigh the number of repeats used against the character savings.

The approach I'm considering first walks through the string, building a table of substrings seen. Each entry would be a substring, along with each starting position where it was found. 3 Then walk the table, making a 2nd table of all repeated, concatenated subsequences, which would have the base sequence, and a substructure for each base sequence indicating starting position(s), repeat count(s), and characters saved. Then sorting by characters saved, I should be able to minimize the number of characters without exceeding the repeat code limit. 4

But I can't seem to get past "build a table of substrings seen". If the real problem were small enough that this wouldn't run out of memory, then I could probably do it with paper and pencil anyway.

Update: One point I didn't make is that "chars" in the example above translate to statements in a simple language in the real problem. These statements are easily matched with a regex, but they are somewhat complicated. Trying to use BrowserUK's approach directly forces the regex engine to backtrack mercilessly through all of the subexpressions in a single statement match (and not spend much time on the repeat search.]

One obvious reduction is to translate the real input to a fundamental form that doesn't need a fancy regex to match each element. Perhaps something like a comma-separated list of a hash function in hex format.


1 I don't really have "just a string", and there is a magic way to indicate repeated concatenated items that doesn't take up "item space", but does takes up "repeat space".

2 I can't use the coding table approach of ordinary compression algorithms to, for instance, substitute a code for each occurrence of 'AB'. All reductions must consist of a repeated, concatenated sequence.

3 Obviously I have a complexity flaw. I've played with a regex on toy input, but it was just real enough that it wouldn't have finished before the next blackout.

4 Flaws 2 and 3 -- overlapping sequences; and the bin packing problem.


-QM
--
Quantum Mechanics: The dreams stuff is made of

Replies are listed 'Best First'.
Re: String Compression Optimization (repeated concatenated subsequences)
by Aristotle (Chancellor) on Sep 11, 2004 at 23:07 UTC

    Finding the optimal partitioning is going to be at least O(n2) no matter what. That is the reason realworld compression dictionary compression algorithms don't try to find the optimal partitioning. LZ77 works with a fixed-size window into the input that is usually relatively small. LZ78 works greedily (and takes a while to build up a usefully populated dictionary).

    In general, your description sounds like something close to the LZ77 family. Typically, achieving speed with these algorithms is done by building a binary tree out of the window's contents, which allows rapid lookup of possible string matches and can be maintained very cheaply with a few tricks. It also means you can grow the buffer to decent sizes (typically 64k or so) because time spent on lookups only grows by log2 n in relation to the buffer size, which easily amortizes the memory and cycles for the tree.

    What constraints are you working within? They have a huge impact on solutions, and you haven't told us much yet. Is your input size bounded in any way? What ballpark is the maximum repetition counter you can store, and the repeated string's maximum length?

    Makeshifts last the longest.

      Is your input size bounded in any way?
      Input size bound is usually in the million "char" range, though it could be 10M or so.
      What ballpark is the maximum repetition counter you can store, and the repeated string's maximum length?

      Maximum repeat counter is 64K. Maximum repeated string length is ~10M, though typically less than 10K.

      The maximum number of distinct repeat groups is ~4K, but I think that's only for sequential repeated groups. What actually matters is repeated groups plus transitions to "normal" groups. So it's data dependent, with worst case being ~2K.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: String Compression Optimization (repeated concatenated subsequences)
by BrowserUk (Patriarch) on Sep 12, 2004 at 01:56 UTC

    If I've understood the problem, this is much less complex than the LZ case as your looking to replace sequential repeats rather sub-sequences anywhere in the string.

    I won't swear the results of the below are optimal in all cases-- that's a huge task to verify, and damned hard to analyse--but on the various tests I've run it seems to a fair job pretty quickly.

    Some results (I used an input range of 'a'..'b' so as to produce a large number of overlapps for testing).

    [ 2:44:41.87] P:\test>390333 -L=70 -RANGE='a'..'b' abaabbaaaabbaaaabaabaabaaaababaababbbbbaababbaabbaaaaabbbbaabbaababbba + AABBAAaabbaa AABaabaab AABABaabab Bbb ABBAabba Aaa BBAAbbaa Bbb [ 2:44:42.69] P:\test>390333 -L=70 -RANGE='a'..'b' abbbaaabbaabbbbbbbbbaaabbaaaaabbbabaaaabaaababaababbabbaaaaaabbbbbabbb + Bbb AABBaabb BBAAAbbaaa Aa Bbb AABABaabab Bb AAAaaa BBbb Bbb [ 2:44:43.55] P:\test>390333 -L=70 -RANGE='a'..'b' aabbbaaaabaabbbbbbbabaaabaaaaaabaabbbabaabbaabbaababbaaabaabaaabbababb + Aa Bbb AABaab BBbb BAAAbaaa AABaab BAABbaabbaab Bb AABaab Aaa BAba Bb [ 2:44:44.37] P:\test>390333 -L=70 -RANGE='a'..'b' aabbbbaaabbbabababbbabbbbbababababbaabaabaaaabaababbaabbabbabbbbaabbbb + Aa BBbb Aaa BAbaba BBbb BABAbaba BAAbaabaa AABaab ABBabbabb Bb Aa BBbb

    UPDATED: Here's a somewhat improved version that recursively processes the string so that sub-runs rejected because they overlapped are reprocessed.

    Some output showing the encoded and decoded versions match:

    [ 6:10:08.93] P:\test>390333 -L=70 -RANGE='a'..'b' Sun Sep 12 06:11:42 2004 : aabbbabababaaabaaaabababbabaaabaaabaabbabbbabababaabbaaabbaabbbabaaaab + Sun Sep 12 06:11:42 2004 : (2:a)(2:b)(2:baba)(2:a)b(3:a)(3:ab)ba(2:baaa)ba(2:abb)(3:ba)b(2:aabba) +a(3:b)ab(2:aa)b Sun Sep 12 06:11:42 2004 : aabbbabababaaabaaaabababbabaaabaaabaabbabbbabababaabbaaabbaabbbabaaaab + [ 6:11:42.57] P:\test>390333 -L=70 -RANGE='a'..'b' Sun Sep 12 06:11:46 2004 : bbaaaabbabbbbababaabbbaaababbababbbbbbbabbbbabaaaabaababaaaabbaabbbaaa + Sun Sep 12 06:11:46 2004 : (2:b)(3:a)(2:abb)b(3:ba)a(3:b)(2:a)(2:ababb)b(2:bbbba)b(2:a)(2:aab)ab( +2:a)(2:aabb)b(3:a) Sun Sep 12 06:11:46 2004 : bbaaaabbabbbbababaabbbaaababbababbbbbbbabbbbabaaaabaababaaaabbaabbbaaa

    A longer (10,000 random 'a's & 'b's) string processed in around 11 seconds.

    Funky formatting courtesy of PM's codewrap "feature"!


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      Thanks for your code. It seems to work as advertised, but I'll have to run more tests ;)

      I've listed some bounds above, and an update to the original node on a detail I forgot to mention, which may complicate it further.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        Those bounds (or perhaps the near lack of them:) will require some optimisations to be applied in order to achieve reasonable runtimes--depending upon what you designate as reasonable?

        One difficulty with looking at a problem that is as open spec'd as this is knowing where to concetrate your efforts. Do you have any "typical" samples of the datastreams that you could make accessible? It's an interesting problem that I'd enjoy pursuing a bit further.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: String Compression Optimization (repeated concatenated subsequences)
by demerphq (Chancellor) on Sep 12, 2004 at 08:18 UTC

    Have a look at Data::Dump. In its stringification routine (quote()) it does this:

    if (length($_) > 20) { # Check for repeated string if (/^(.{1,5}?)(\1*)$/s) { my $base = quote($1); my $repeat = length($2)/length($1) + 1; return "($base x $repeat)"; } }

    I imagine you could do something similar.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi

      Flux8


Re: String Compression Optimization (repeated concatenated subsequences)
by hv (Prior) on Sep 12, 2004 at 01:34 UTC

    I don't get much feel for your real world problem from this description. Perhaps it would be worth devising a little test harness with test strings ranging from 'toy' to 'realistic', that also encapsulates realistic limits and shows both the optimal result and the range of suboptimal solutions that would be good enough to be useful.

    That would also give you a consistent basis from which to evaluate any solutions you or the assembled monks might devise.

    Hugo

Re: String Compression Optimization (repeated concatenated subsequences)
by ihb (Deacon) on Sep 12, 2004 at 08:18 UTC

    I'm not sure this is relevant, but perhaps this solution of mine to a similar problem could give you some ideas: Re: Minimal password check, again.

    Here's the sample use and output from Re: Minimal password check, again:

    check_reps('xattattttatty'); __END__ att repeated 2 times, covers 46% ttatt repeated 2 times, covers 76% tt repeated 2 times, covers 30% t repeated 3 times, covers 23% t repeated 2 times, covers 15% t repeated 2 times, covers 15%

    Hope this helps,
    ihb

    Read argumentation in its context!

      Thanks for the suggestion. I've tried something essentially similar. The problem is one of scaling, and the fact that the real problem doesn't use chars and strings (see Update 1).

      I'm not sure you need the lookahead, but I'm more worried about how it affects execution time. Maybe I'll benchmark something on that while I'm taking a break from this problem ;)

      Update: I see that the lookahead captures overlaps, while the other doesn't.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: String Compression Optimization (repeated concatenated subsequences)
by Caron (Friar) on Sep 12, 2004 at 23:02 UTC