in reply to Compressing a set of integers
since someone else has already posted a reasonable solution, I'll add a couple of notes:
When discussing numbers, it is important to note the boundary. You said integer, but are negatives possible? is the range outside the 16bit signed limit? 32bit signed limit? is there any limit at all?
When looking for performance solutions, you should also provide information on the platform. Do you need portability? are you on a win32 or unix platform? are you capable of understanding C code? (an Inline::C function would make short work of your problem)
And if possible, providing context is valuable. There have been numerous times when people have posted a very limited description, asking a very targetted question, and it was found that the entire operation could be avoided by use of a different overall strategy. Context helps experienced monks identify those situations.
More details on compressing a set of integers
by toma (Vicar) on Jan 26, 2003 at 05:10 UTC
|
The numbers represent an index into another array.
The numbers are positive. I could offset the numbers
by a fixed amount if it would help somehow.
I have about 50,000 numbers today, but I don't want
the code to have a 16 bit limit since I expect the
dataset to grow.
I want to be able to store the data with Storable,
and I think that it will not support a custom
Inline::C representation. If there is another
high-performance way to store the data that can
take advantage of Inline::C, then that would be
a candidate.
My platform is somewhat crufty. It is HP-UX 10.20
for the production web server and linux as the
data generator. For generality and future-proofing,
I prefer portable code, but this is not a hard
requirement.
So far, my main ideas are:
- Translate the numbers to base64 to make them
more compact in string form.
- Reorder the data so that more common numbers
are smaller and are therefore shorter in string form.
- Create and optimize a state machine to make a
real compressor tuned for this special set of characters.
I have been reluctant to use our database for this
task because the lists of numbers can get quite
long. I have a VARCHAR limit of 255 characters, and
I don't want to get into BLOBs. I could use another
database such as SQLite which uses strings as the
datatype, but I have almost no experience with this
type of solution. I do plan on implementing a
solution (for the larger problem) in SQLite for
comparison purposes.
The larger problem is that I am investigating
speed versus memory tradeoffs in on-line queries
involving a many-to-many relationship.
I have an *extremely*
fast solution using Table::ParentChild,
but it cannot be stored using the Storable module.
This is because Table::ParentChild uses an XS data
structure, which represents each parent-child relationship
in four 32 bit integers. I can sacrifice a little
speed and use less memory. To this end
I have coded another very fast solution using hashes
and arrays, and most of the memory it uses are in
these strings of numbers. I think that the strings
are still wasteful, since they represent only about 3 bits
of information per digit and one bit or so of information
per delimiter character.
I hope to write more about the larger problem in the
future, but for now I find that the core of my
solution relies on these strings of numbers.
The compression problem sounded familiar, so I hoped that
someone would have some ideas.
I have run across this problem before, where I
wanted compression, but I have a line-oriented
file and I want to be able to decompress just
a single line at a time, when this line is
located anywhere in the file.
It seems like an ordinary problem in compression,
but after searching for a few hours I didn't find anything,
so I asked first in chat and then here.
Update:
I have found the classification of the larger problem
that I am working on. It is a graph problem. My graphs
are sparse and large, so I am using an adjacency list.
The sets that I am compressing are rows in the adjacency
list. This is described in chapter 8 of
Mastering Algorithms with Perl,
where the Graph modules are described.
It should work perfectly the first time! - toma | [reply] |
|
Its worth pointing out that attempting to compress a string containing a list of comma delimited ascii numbers using Base64, will make the string grow in size, not shrink.
Eg. A string containing 1000 5 digit numbers separated by commas, is 6000 bytes. Encode that base64 and you get a string nearly 8K in length.
Pack the same string using 16-bits and you get 2000 bytes.
If you really need to make it smaller still, then you might consider using Compress::Zlib::memGzip which compresses the same ascii list into 1800 bytes.
There are caveats associated with this 10% reduction in size. A) your compression will vary with the data set. B) Compress::Zlib::memGunzip is 10 to 15 times slower than unpack.
If your main concern is that using pack & unpack with format 'S*', will limit you to 16-bit numbers, then move to 32-bit using 'L*'. This will pack the same 1000 numbers into 4000 bytes which makes it look not such a good option until you think about what happens when you start using 6-digit numbers.
The string containing 100,000 .. 101,000, comma delimited ascii is 7000 bytes. Base64:9467, zipped:1817, packed 32-bit 4000. Again, Zlib is looking good, but now the figures for a random mixed of 1 to 6-digit numbers
ascii:6458, base64:8726, zipped:3169, pack 32-bit:4000.
Suddenly the trade for decompression speed looks less valuable, and if you move to a random mix of 8-digit numbers, Zlib starts averaging 4500 bytes for 1000 numbers, pack remains at 4000 bytes.
Finally, with Zlib, there is a heavy penalty for small lists. Lists of 1 x 5-digits average around 25 bytes; 4 x 5-digits numbers around 45 bytes and 4 x 8-digit numbers around 75 bytes. The corresponding pack'd lists come out at 2/8 for 'S*' and 4/16/16 for 'L*'.
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.
| [reply] |
|
Sorry, I should have mentioned that I didn't mean to
use the Base64 module, I was considering writing the
numbers in base 64, so that when I run out of 0-9 I
go through a-z then A-Z, etc, so I use more of the
bits of the characters, and use fewer digits to
represent the same number.
The penalty for compression on small strings is the
dictionary that gets set up for the string. If I use
a fixed dictionary, this shouldn't be a problem. So
I need a customized version of the compression code.
With compression, I shouldn't need to use base 64,
and I'm guessing that I will get about a 60% reduction in
size.
Another optimization I thought of was to change the
numbers to represent the differences between the numbers
instead of the numbers themselves. So instead of
1,10,45,50,120
I would have
1,9,35,5,70
It should work perfectly the first time! - toma | [reply] [d/l] [select] |
|
|