I would assume that you should explicitly encode your text to UTF-8 on compressing and explicitly decode from UTF-8 on decompressing. IO::Compress::Gzip likely only works on octets and expects octets. I wonder why it doesn't scream bloody murder...
Also, maybe you need to explicitly set sqlite_unicode if you are reading/storing UTF-8 data in SQLite.
| [reply] [d/l] [select] |
Thanks. I did have sqlite_unicode => 1, and the db worked with non-ASCII text if I didn't try to compress it.
This seems to fix the problem:
sub compressor {
my $in = shift;
$in = encode ('utf8', $in);
my $out;
gzip \$in => \$out;
return ($out);
}
sub uncompressor {
my $in = shift;
my $out;
gunzip \$in => \$out;
return (decode ('utf8', $out));
}
I tested it with some real-life sample data and the compression isn't doing too well: the source data is a 9.4MB text file that compresses down to a 2.4MB zip file. When I imort it without compression, I get a 19.7MB db file. With compression, the db file is 17.0MB. That's a little smaller than the original but not enough to make it worth it. I was hoping for something in the 10MB range (~50% compression). I imagine it could be because each string is compressed separately so repeated strings or parts of strings can't be exploited during compression. Is this a lost battle? If not, I would be grateful for suggestions on a better algorithm.
| [reply] [d/l] [select] |
Greetings, elef
Have you tried any of the other different forms of compression IO::Compress offers? My personal experiences when creating archives, seems to indicate the xz algorithm provides better results, more often than not. I notice IO::Compress also offers IO::Compress::Xz. Of course all the algorithm's have different results given the type of input data. But thought it worth mentioning.
Best Wishes.
--Chris
#!/usr/bin/perl -Tw
use Perl::Always or die;
my $perl_version = (5.12.5);
print $perl_version;
| [reply] |