I've put this here, coz I'm not sure which section to post in....

Just spent the worst part of an hour on this, only to spot an obvious problem - chances of matching n bytes is 256^n, not 256^(n-1). D'oh ;-)

So this is NOT A SUGGESTED SOLUTION!, just my thought processes before I went wrong...

The challenge, discussed on /.:

    I will attach a prize of $5,000 to anyone who successfully meets this
    challenge.  First, the contestant will tell me HOW LONG of a data file to
    generate.  Second, I will generate the data file, and send it to the
    contestant.  Last, the contestant will send me a decompressor and a
    compressed file, which will together total in size less than the original
    data file, and which will be able to restore the compressed file to the
    original state.

    With this offer, you can tune your algorithm to my data.  You tell me the
    parameters of size in advance.  All I get to do is arrange the bits within
    my file according to the dictates of my whim.  As a processing fee, I will
    require an advance deposit of $100 from any contestant.  This deposit is
    100% refundable if you meet the challenge.

Although it would never get paid (see this :), I couldn't help but wonder. My thoughts...

Compressor:

parse the file starting at byte $i=0

search for $n byte repetitions for $n>1, existing when $i is between 256**($n-2) and 256**($n-1) - 1 inclusive

splice these out of the data and store a placeholder (ie $i) to show where to resplice bytes to.

So, the matches reduce storage space as follows:

               repeat 
position       match    storage

0-255          2 bytes  1 byte
256-65535      3 bytes  2 bytes
65536-16777215 4 bytes  3 bytes
etc

delimit matches of different byte size with a single delimiter, ie:

single byte matches=double byte matches=etc

determine the delimiter based on lowest number of occurences in matches (as these would have to be deleted)

So for a file of length $i and $m matches (not containing delimiter),

bytes saved = $m - (int(log(base 256)$i)

possibly minus another one - can't remember my logs :)

Only problem is that this saves a negative amount on average...

Decompressor:

Take the values stored, and parse. Here's a wordy example to illustrate how I'd decompress... my @matches = split '=', 'single byte matches=double byte matches=etc'; my $i=0; for (@matches) { while(substr($_,0,$i+1,'')) { ... (un?)pack to decimal ... add repetion of length $i+1 at this position in the string } $i++; }

Obviously, this is pseudocode. we'd be dealing with 2 bytes representing a number between 256 and 65535 etc, but I hope you see the idea.


Damn, and I was so sure I was on to something...

Ah well, I really must get back to work, but your thoughts/code welcomed on this...

cLive ;-)


In reply to compression challenge by cLive ;-)

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.