Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Lost in compressed encodings

by Skeeve (Parson)
on Apr 06, 2020 at 08:22 UTC ( [id://11115120]=perlquestion: print w/replies, xml ) Need Help??

Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

This is a follow up of my Lost in encodings question.

Thanks to the answers given there I was able to solve that issue, but now I'm clueless again as I now need to read in compressed (gzip) UTF-8 files and I do not know how to convince perl to read them as UTF-8.

My code for opening the files is:

open my $in, '<:utf8', $filename or die "Can't read $filename: $!\ +n"; if ($filename=~/\.gz$/) { $in= new IO::Uncompress::Gunzip $in, { AutoClose => 1 }; }

When reading uncompressed data, it works fine as I could verify with the help given in my previous thread. I did so by setting the debugger to UTF-8. When reading uncompressed data the Umlaut "ü" is correctly displayed as "ü". When reading the same data from a compressed file the "ü" is displayed as "ü".

I have no idea how to make perl consider the compressed data as UTF-8?


s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re: Lost in compressed encodings
by Corion (Patriarch) on Apr 06, 2020 at 08:41 UTC

    The order of decompressing and decoding matters. You want to first uncompress and then decode. If you want to cheat, you can use PerlIO::gzip:

    my $in; my $open_mode = '<:raw'; if ($filename=~/\.gz$/) { $open_mode .= ':gzip'; } $open_mode .= ':utf8'; open my $in, $open_mode, $filename or die "Can't read $filename: $ +!\n";

    If you want to stay with IO::Uncompress::Gunzip, I think the following should work, but I don't know if ->binmode() also applies other encodings properly:

    my $in; if ($filename=~/\.gz$/) { $in = new IO::Uncompress::Gunzip $in, { AutoClose => 1 }; } else { open $in, '<:raw', $filename or die "Can't read $filename: $!\ +n"; }; binmode $in, ':utf8';

      Thanks Corion. I already had the feeling that the sequence somehow is the issue.

      Unfortunately providing binmode after IO::Uncompress did not help.

      My changed code:

      open my $in, '<:raw', $filename or die "Can't read $filename: $!\n +"; if ($filename=~/\.gz$/) { $in= new IO::Uncompress::Gunzip $in, { AutoClose => 1 }; } binmode $in, ':utf8';

      It still works with uncompressed and not with compressed data.

      Seems I will have to manually decode each line…


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

        I think that IO::Uncompress::Gunzip only understands ->binmode() and not ->binmode(':utf8');. The documentation (now that I read it ...) even says:

        This is a noop provided for completeness.

        If you are able to install PerlIO::gzip, that one should work with stacking other decoding mechanisms on top of it.

        If you have a gzip binary available, you can use that to decompress:

        my $in; if( $filename =~ /\.gz$/ ) { open $fh, "gzip -cd "$filename" |' or die "Can't read from gzip $filename: $!/$?"; } else { open $in, '<:raw', $filename or die "Can't read $filename: $!\n"; }; binmode $fh, ':utf8';
Re: Lost in compressed encodings
by jo37 (Deacon) on Apr 06, 2020 at 16:23 UTC

    You are overwriting the file handle $in with the newly created by IO::Uncompress::Gunzip. I can't tell the consequences, but it does not look sane to me.

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re: Lost in compressed encodings
by Anonymous Monk on Apr 07, 2020 at 16:46 UTC

    Nitpick: you should probably be using :encoding(utf-8) rather than just :utf8. See perlunifaq for the details.

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11115120]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-04-24 22:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found