I have an issue with reading a gzipped UTF-8 encoded file.

Here is an example:

preparation: put an umlaut into a file and gzip it. Also

echo ü > umlaut gzip -k umlaut

Now check the difference :(

perl -e ' use IO::Uncompress::Gunzip; binmode(STDOUT, ":utf8"); open my $in, "<:utf8", "umlaut"; $_=<$in>; print "Uncompressed: $_ ",ord($_),"\n"; my $gin= IO::Uncompress::Gunzip->new("umlaut.gz"); binmode($gin, ":utf8"); $_=<$gin>; print " Compressed: $_ ",ord($_),"\n"; '

Output

Uncompressed: ü
 252
  Compressed: ü
 195

In theory there shouldn't be a difference between the outputs :(

Update: I learned that "binmode" won't do anything to the IO::Uncompress::Gunzip filehandle.

Handling the decode myself, not relying on an IO-layer, gives the expected result:

perl -e ' use IO::Uncompress::Gunzip; use Encode; binmode(STDOUT, ":utf8"); my $gin= IO::Uncompress::Gunzip->new("umlaut.gz"); $_=<$gin>; $_ = Encode::decode("UTF-8", $_); print " Compressed: $_ ",ord($_),"\n"; '

Update: As suggested by Corion I'm now using PerlIO::gzip. My Original code, note the test example shown here, now is:

my $encoding = ":utf8"; if ( $filename =~ /\.gz$/ ) { $encoding = ":gzip$encoding"; } open $in, "<$encoding", $filename or die "Can't read $filename +: $!\n";

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

In reply to issue with reading IO::Uncompress:Gunzip and utf-8 by Skeeve

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.