IO::Compress::Gzip and unicode

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:


use strict;
use FileHandle;
use IO::Compress::Gzip;

my $unicode_string = "Smiley Face: \x{263A}\n";

writefile({
    'filename'  => "/tmp/out",
    'gzip'      => 0,
    'data'      => $unicode_string,
});

writefile({
    'filename'  => "/tmp/out.gz",
    'gzip'      => 1,
    'data'      => $unicode_string,
});

sub writefile {
    my($opts) = @_;

    my $fh = ($opts->{'gzip'}) ?
        IO::Compress::Gzip->new(
            FileHandle->new("> $opts->{'filename'}"),
        ) :
        FileHandle->new("> $opts->{'filename'}");

    binmode($fh, ':utf8');

    print $fh $opts->{'data'};
    $fh->close;
}

__DATA__
[download]

First subroutine call succeeds and produces a /tmp/out file with the expected content.

Seoond subroutine call fails with the message:
Wide character in IO::Compress::Gzip::write: at <program_name> line 29.

Line 29 is the 'print' statement.

Documentation suggests IO::Compress::Gzip::binmode is a no-op.

Using "Encode::decode_utf8($opts->{'data'});" doesn't work either.

The second subroutine call produces a valid, compressed as expected /tmp/out.gz file as long as $unicode_string doesn't actually contain any unicode characters.

How do I make this work? I'd much prefer to compress in perl rather than gzip files after writing them, as the real-world code with the issue demonstrated by this minimum-reproducible test case deals with large data volumes and performance is a concern.

2018-03-03 Athanasius removed the question text from the main code block and added paragraph tags

Comment on IO::Compress::Gzip and unicode
Select or Download Code

Replies are listed 'Best First'.
Re: IO::Compress::Gzip and unicode by salva (Canon) on Mar 02, 2018 at 08:49 UTC
You can use PerlIO layers to do that. PerlIO::via::gzip provides on the fly data compression (and it uses IO::Compress::Gzip under the hood). `# untested! sub writefile { my($opts) = @_; open my $fh, '>', $opts->{filename} or die $!; binmode($fh, ':via(gzip)') if $opts->{'gzip'}; binmode($fh, ':utf8'); print $fh $opts->{'data'}; $fh->close; }` [download]	[reply] [d/l]
Re: IO::Compress::Gzip and unicode by Corion (Patriarch) on Mar 02, 2018 at 08:35 UTC
The easy approach would be to use Encode::encode to convert your string to octets before writing it to the file: `my $unicode_string = "Smiley Face: \x{263A}\n"; my $bytes = encode('UTF-8', $unicode_string); binmode $fh, ':raw'; print {$fh} $bytes;` [download] But I think that the `binmode ':utf8'` already should do that. Maybe there is a difference between `:utf8` and `:encoding(UTF-8)`, so maybe try: `binmode $fh, ':encoding(UTF-8)';` [download] in your code instead. But as you already looked at the documentation of IO::Compress::Gzip and it doesn't have a proper `binmode` implementation, you will need to do that yourself I fear.	[reply] [d/l] [select]
Re: IO::Compress::Gzip and unicode by pmqs (Friar) on Mar 02, 2018 at 23:35 UTC
As things stand you need to explicitly encode the data to utf8. To do that you need to use `encode_utf8` rather than `decode_utf8`. Change this line `print $fh $opts->{'data'};` [download] to this `print $fh Encode::encode_utf8($opts->{'data'}) ;` [download]	[reply] [d/l] [select]