in reply to String Compression Optimization (repeated concatenated subsequences)

Finding the optimal partitioning is going to be at least O(n2) no matter what. That is the reason realworld compression dictionary compression algorithms don't try to find the optimal partitioning. LZ77 works with a fixed-size window into the input that is usually relatively small. LZ78 works greedily (and takes a while to build up a usefully populated dictionary).

In general, your description sounds like something close to the LZ77 family. Typically, achieving speed with these algorithms is done by building a binary tree out of the window's contents, which allows rapid lookup of possible string matches and can be maintained very cheaply with a few tricks. It also means you can grow the buffer to decent sizes (typically 64k or so) because time spent on lookups only grows by log2 n in relation to the buffer size, which easily amortizes the memory and cycles for the tree.

What constraints are you working within? They have a huge impact on solutions, and you haven't told us much yet. Is your input size bounded in any way? What ballpark is the maximum repetition counter you can store, and the repeated string's maximum length?

Makeshifts last the longest.

  • Comment on Re: String Compression Optimization (repeated concatenated subsequences)

Replies are listed 'Best First'.
Re^2: String Compression Optimization (repeated concatenated subsequences)
by QM (Parson) on Sep 14, 2004 at 00:48 UTC
    Is your input size bounded in any way?
    Input size bound is usually in the million "char" range, though it could be 10M or so.
    What ballpark is the maximum repetition counter you can store, and the repeated string's maximum length?

    Maximum repeat counter is 64K. Maximum repeated string length is ~10M, though typically less than 10K.

    The maximum number of distinct repeat groups is ~4K, but I think that's only for sequential repeated groups. What actually matters is repeated groups plus transitions to "normal" groups. So it's data dependent, with worst case being ~2K.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of