in reply to Re^4: URL string compression?
in thread URL string compression?

Sorry, I wasn't clear.

You have two problems. The first is compression. The second is encoding the compressed result into safe characters. I was addressing the latter problem.

If you use my suggested encoding method, you first need to compress your data down to 28 bytes. Base32 will convert your compressed data into 45 (ceil(28*(8/log2(32)))) safe characters.

What information is contained in the original file name? Consistently compressing by 78% (1 - ceil(28/125)) will be hard, and will only be possible with intimate knowledge the data to compress.

Replies are listed 'Best First'.
Re^6: URL string compression?
by punch_card_don (Curate) on Feb 14, 2006 at 00:42 UTC
    OH, OK. Encode is for AFTER we've compressed to make sure that the compressed filename is made up of only "safe" characters.

    OK, but now that compression...

    The filename is made up of codes that tell what is in the file. Fortunately, the structure is well defined. The order of the codes is fixed. But they may be of varying lengths and composition.

    For example:

    2006-asdf-qwerty-123_456_789-

    where we know in advance that the filename field separator is a dash, the first code is the year the file was created, the second is a code for the subject, the third is a code for the author, and the fourth will be references to chemicals mentioned in the work, but this list can have a varying number of members. For xample.




    Forget that fear of gravity,
    Get a little savagery in your life.

      Can you use a number for the subject code and the athor code? If you used the following packing:

      Year 2 bytes (0..65535) subject code 2 bytes(?) (0..65535) author code 2 bytes(?) (0..65535) chemicals: 2 bytes per (0..65535)

      That allows you to have up to 11 chemicals. It's possible to do better (by using bits which aren't an even number of bytes), but you lose efficiency.

      If the above is ok, let me know and I'll code it tonight.

      If the above is not sufficient, let me know more precise ranges for the year, the subject codes, the author codes and chemical codes.

      Update: Promised code:

      sub encode_base32 { ... based on encode_base64 ... } sub decode_base32 { ... based on decode_base64 ... } sub compress_data { my ($year, $subject, $author, @chemicals) = @_; carp(...) if $year < 0 || $year > 65535; carp(...) if $subject < 0 || $subject > 65535; carp(...) if $author < 0 || $author > 65535; carp(...) if @chemicals > 11; carp(...) if grep { $_ == 65535 } @chemicals; push(@chemicals, (65535)x(11-@chemicals)); return pack('n*', $year, $subject, $author, @chemicals); } sub decompress_data { my ($data) = @_; my ($year, $subject, $author, @chemicals) = unpack('n*', $data); @chemicals = grep { $_ != 65535 } @chemicals; return ($year, $subject, $author, @chemicals); } $file_name = encode_base32(compress_data(...)); (...) = decompress_data(decode_base32($file_name));

      Untested.

        Be aware that packing 0 .. 65536 into 2 bytes is going to result in filenames containing all manner of control characters and 8-bit characters that even if the filesystem accepts them will be nigh impossible for any user to type.

        You'll probably need to hexify the numbers to make them acceptable for use, with the possible resultant greater length.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      If any of the fields or codes are fixed length, you could get rid of some separators.

      Also, if any of the fields are completely numeric, I think you would get a better result by compressing them separately.