Re^3: String Compression Optimization (repeated concatenated subsequences)

Those bounds (or perhaps the near lack of them:) will require some optimisations to be applied in order to achieve reasonable runtimes--depending upon what you designate as reasonable?

One difficulty with looking at a problem that is as open spec'd as this is knowing where to concetrate your efforts. Do you have any "typical" samples of the datastreams that you could make accessible? It's an interesting problem that I'd enjoy pursuing a bit further.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re^3: String Compression Optimization (repeated concatenated subsequences)

Replies are listed 'Best First'.
Re^4: String Compression Optimization (repeated concatenated subsequences) by QM (Parson) on Sep 14, 2004 at 01:33 UTC
I'll try and post something here tomorrow, under appropriate readmore tags... Update: It's tomorrow, and here's a toy example: Read more... (1119 Bytes) and a more realistic piece (200 lines): Read more... (27 kB) The full file is 400MB -- I don't know where I would upload that, or if it would matter. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Re^5: String Compression Optimization (repeated concatenated subsequences) by BrowserUk (Patriarch) on Sep 14, 2004 at 05:59 UTC
Maybe your scratchpad + and a /msg? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^5: String Compression Optimization (repeated concatenated subsequences) by BrowserUk (Patriarch) on Sep 14, 2004 at 20:07 UTC
For comparison purposes: `u: 1 wallclock secs ( 0.53 usr + 0.00 sys = 0.53 CPU) @ 112.78/s (n +=60)` [download] The length and complexity of the real data elements are the cause of all your back tracking woe's. So, do away with them:) What I did was to extract an array of the significant bits of the data ("L 1 0 0 0 L 0 L L H 0 L 0 L 0 L 1 L 1 0 1 0" etc.), then uniq those using a hash and build a substitution or placeholder table using 'aa;', 'ab;', 'ac;' etc. I then concatenate these into a single string and use my findRuns() (with appropriately modified regexes). Once the placeholder string has been compressed, I use a reverse lookup table to put back the original data elements into the compressed string. Read more... (32 kB) As you can see below, the placeholder sequence is much shorter , and the regular nature of the placeholders means the regex engine has a much simpler task finding the sequences. I haven't bothered to re-integrate the compressed sequences with the header and trailer information, as you haven't said what format would be used for the repeat sequences. I've also injected some newlines and tabs to make visual verification easier. Read more... (8 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re^5: String Compression Optimization (repeated concatenated subsequences) by Beechbone (Friar) on Sep 15, 2004 at 09:57 UTC
Funny, I was working on the same problem when I read your node. This will compress the 200 line example six times a second: Read more... (11 kB) _Search, Ask, ^Know	[reply] [d/l]