in reply to Re^2: String Compression Optimization (repeated concatenated subsequences)
in thread String Compression Optimization (repeated concatenated subsequences)

Those bounds (or perhaps the near lack of them:) will require some optimisations to be applied in order to achieve reasonable runtimes--depending upon what you designate as reasonable?

One difficulty with looking at a problem that is as open spec'd as this is knowing where to concetrate your efforts. Do you have any "typical" samples of the datastreams that you could make accessible? It's an interesting problem that I'd enjoy pursuing a bit further.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
  • Comment on Re^3: String Compression Optimization (repeated concatenated subsequences)

Replies are listed 'Best First'.
Re^4: String Compression Optimization (repeated concatenated subsequences)
by QM (Parson) on Sep 14, 2004 at 01:33 UTC
    I'll try and post something here tomorrow, under appropriate readmore tags...

    Update: It's tomorrow, and here's a toy example:

    and a more realistic piece (200 lines):

    The full file is 400MB -- I don't know where I would upload that, or if it would matter.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Maybe your scratchpad + and a /msg?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      For comparison purposes:

      u: 1 wallclock secs ( 0.53 usr + 0.00 sys = 0.53 CPU) @ 112.78/s (n +=60)

      The length and complexity of the real data elements are the cause of all your back tracking woe's. So, do away with them:)

      What I did was to extract an array of the significant bits of the data ("L 1 0 0 0 L 0 L L H 0 L 0 L 0 L 1 L 1 0 1 0" etc.), then uniq those using a hash and build a substitution or placeholder table using 'aa;', 'ab;', 'ac;' etc. I then concatenate these into a single string and use my findRuns() (with appropriately modified regexes).

      Once the placeholder string has been compressed, I use a reverse lookup table to put back the original data elements into the compressed string.

      As you can see below, the placeholder sequence is much shorter , and the regular nature of the placeholders means the regex engine has a much simpler task finding the sequences.

      I haven't bothered to re-integrate the compressed sequences with the header and trailer information, as you haven't said what format would be used for the repeat sequences.

      I've also injected some newlines and tabs to make visual verification easier.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      Funny, I was working on the same problem when I read your node. This will compress the 200 line example six times a second:

      Search, Ask, Know