compressing text

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a script that fires off a process. This process generates an HTML-formatted report of varying size, which gets stuffed into a scalar for submission to a database. I have a satisfactory strip() procedure that removes unecessary content prior to submission. What I would like to do is compress this even further if possible, such that the information in the scalar is compressed, submitted, and upon querying is uncompressed to its original state.

Intuitively I think gzip or some such algorithm, but the data is in a scalar and as far as I know there is no "file" data type supported for Postgres. Alternatively, there are several modules on CPAN that operate on scalar strings, such as Compress-LZF, in which case my question is, which is best?

All in all, what would people suggest?

Comment on compressing text

Replies are listed 'Best First'.
Re: compressing text by chromatic (Archbishop) on Apr 11, 2003 at 18:50 UTC
I would suggest not storing this file in the database. I prefer storing a file path in the database. What benefits does it have to being stored there? Full-text searching, if your database supports that -- are you using it? Existing permissions model for storing data -- this is a good reason. Existing locking model for reads and writes -- this is a good reason. What drawbacks does it have? It may be hard to model accurately -- what column type do you need? It's really hard to edit in the database. It's a lot of data to send back and forth.	[reply]
Keep your files in the database! by Improv (Pilgrim) on Apr 11, 2003 at 18:55 UTC
The nice thing about having files in the database is that you can have constraints set up to keep files from being orphaned. For example, let's imagine we were implementing sourceforge, in perl and postgres. Sourceforge needs to deal with a lot of files, with projects and files being created and deleted all the time. If everything's in the database, if you delete a project, you can set things up so that a cascaded delete will remove all the files from the database at the same time, and you can tell, with a simple query, if you don't do that, which files are orphaned. Moreover, your database backups, if done right (check the pg_dump manpage), will include the appropriate version of all the files in there, so you'll never need to worry about them getting out of sync with your regular backups. Finally, you don't need to worry about a separate access path to manage the files referenced in your database -- if you can SQL in, you can upload/update/delete the files (subject to database ACLs, of course), and there's no need to also allow ssh, ftp, or whatever other solutions to manage this stuff. In sum, it's easier to get all the smarts that your database has to cover files if the files are in the database.	[reply]
Re: Keep your files in the database! by perrin (Chancellor) on Apr 11, 2003 at 19:08 UTC
And the reason not to put files in the database is that all of your file-based utilities (rsync, ,tar, less, grep, mv, vim, etc.) become useless and have to be replaced. I think of it like this: the files are the master data, and the database info about them is just a handy index for locating them based on particular attributes. You can delete the index and rebuild it any time, since the master info is really the files themselves.	[reply]
Re: Re: Keep your files in the database! by Anonymous Monk on Apr 11, 2003 at 19:16 UTC
Re: Re: Keep your files in the database! by Improv (Pilgrim) on Apr 11, 2003 at 19:36 UTC
Re: compressing text by perrin (Chancellor) on Apr 11, 2003 at 19:10 UTC
Bzip2 compresses more than gzip. You can use the Compress::Bzip2 module on scalars.	[reply]
Re: compressing text by Improv (Pilgrim) on Apr 11, 2003 at 18:49 UTC
You have 2 primary options for data types, BLOBs and stringlikes. For info on using the stringlike types, see here in the postgres documentation. For BLOBs, you'll actually end up using postgres utility functions to store the data in a system table, and you'll store a reference (called an oid) to it there in your normal table(s). I recently went through this, and learned how to do it. You can read up on it, and grab some code fragments, here. The solution was here. Also, check up on the Pg manpage -- it can be helpful.	[reply]