comment on

BrowserUk,
The closest I have had to deal to that problem is where the first few bytes of a message told how long the message was but had to include itself in the length calculation. For illustration purposes, lets say it would just be the number of bytes followed by a capital X followed by the payload.

If the payload was 98 bytes and we add the letter X we get 99 so we start out by saying '99X..' but by adding 99 to the message we added two more bytes which made it '101X..' but because 101 is longer than 99 by 1 byte, the final value was 102X. This was a simple problem to solve though I never did understand why the length had to include itself.

Getting back to the OP, here is the approach I have been playing with in my head. There will be 96 characters (Unix newline plus ASCII 32 - ASCII 126). This means a 1/8 reduction simply by going to 7 bit characters and leaves room for 32 multi-character tuples.

Rather than keeping track of each tuple separately, keep track of data structures of size 3 (originally I thought 4 could fit into memory but I am not so sure). Single characters are ignored because they will be encoded using the 7 bit method.

HER (count 137)
    HE
    ER
    Points To
        ERE  (count 100)
        ERM  (count 27)
        RED  (count 18)
    Pointed From
        IHE (count 11)
        HEH (count 2)
[download]

Once the data structure is created, get the frequency of all the 2 character tuples (requires traversing the data structure). With this data in hand, use a heuristic algorithm to pick an approach. You could limit how long you look based on time or number of iterations or what not. Pick the tuple (either 2 or 3) that saves the most space. There should be enough data in the tree to measure the impact of every choice and re-calculate frequencies. Once 32 have been chosen - remember the choices and the total size reduction and restart by picking the 2nd highest choice first.

Does this make sense? It does in my head but I have been struggling with articulating ideas lately.

Cheers - L~R

In reply to Re^4: Dynamically Updating Frequency Analysis by Limbic~Region
in thread Dynamically Updating Frequency Analysis by Limbic~Region

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.