in reply to Re^5: URL string compression?
in thread URL string compression?

OH, OK. Encode is for AFTER we've compressed to make sure that the compressed filename is made up of only "safe" characters.

OK, but now that compression...

The filename is made up of codes that tell what is in the file. Fortunately, the structure is well defined. The order of the codes is fixed. But they may be of varying lengths and composition.

For example:

2006-asdf-qwerty-123_456_789-

where we know in advance that the filename field separator is a dash, the first code is the year the file was created, the second is a code for the subject, the third is a code for the author, and the fourth will be references to chemicals mentioned in the work, but this list can have a varying number of members. For xample.




Forget that fear of gravity,
Get a little savagery in your life.

Replies are listed 'Best First'.
Re^7: URL string compression?
by ikegami (Patriarch) on Feb 14, 2006 at 01:16 UTC

    Can you use a number for the subject code and the athor code? If you used the following packing:

    Year 2 bytes (0..65535) subject code 2 bytes(?) (0..65535) author code 2 bytes(?) (0..65535) chemicals: 2 bytes per (0..65535)

    That allows you to have up to 11 chemicals. It's possible to do better (by using bits which aren't an even number of bytes), but you lose efficiency.

    If the above is ok, let me know and I'll code it tonight.

    If the above is not sufficient, let me know more precise ranges for the year, the subject codes, the author codes and chemical codes.

    Update: Promised code:

    sub encode_base32 { ... based on encode_base64 ... } sub decode_base32 { ... based on decode_base64 ... } sub compress_data { my ($year, $subject, $author, @chemicals) = @_; carp(...) if $year < 0 || $year > 65535; carp(...) if $subject < 0 || $subject > 65535; carp(...) if $author < 0 || $author > 65535; carp(...) if @chemicals > 11; carp(...) if grep { $_ == 65535 } @chemicals; push(@chemicals, (65535)x(11-@chemicals)); return pack('n*', $year, $subject, $author, @chemicals); } sub decompress_data { my ($data) = @_; my ($year, $subject, $author, @chemicals) = unpack('n*', $data); @chemicals = grep { $_ != 65535 } @chemicals; return ($year, $subject, $author, @chemicals); } $file_name = encode_base32(compress_data(...)); (...) = decompress_data(decode_base32($file_name));

    Untested.

      Be aware that packing 0 .. 65536 into 2 bytes is going to result in filenames containing all manner of control characters and 8-bit characters that even if the filesystem accepts them will be nigh impossible for any user to type.

      You'll probably need to hexify the numbers to make them acceptable for use, with the possible resultant greater length.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        You should reread the thread. In short, first you compress the data to 28 bytes (at which point "all manner of control characters and 8-bit characters" are perfectly acceptable), then you encode it into 45 safe characters using a 32 bit variant of Base64.
Re^7: URL string compression?
by blahblahblah (Priest) on Feb 14, 2006 at 02:59 UTC
    If any of the fields or codes are fixed length, you could get rid of some separators.

    Also, if any of the fields are completely numeric, I think you would get a better result by compressing them separately.