in reply to Can't decode ill-formed UTF-8 octet sequence

How do I read such malformed strings in perl with unicode support enabled without having it dying?

If you don't like how the decoding layer handles error, you are free to perform the decoding yourself. Encode's decode's third argument controls how it behaves on error.

But the solution to the problem is to avoid generating the garbage in the first place.

$_->{info}{name} is a string that consists of the characters 43.72.79.73.69.73.AE. It's apparently a string of decoded text (a string of Unicode Code Points).

But file handles can only transmit bytes. You need to encode the Unicode Code Points into bytes to write them to a file handle.

One way of doing this is to add an encoding layer to the file handle.

perl -gne' use v5.36; use utf8::all; use Bencode qw( bdecode ); say bdecode( $_ )->{info}{name}; '

Replies are listed 'Best First'.
Re^2: Can't decode ill-formed UTF-8 octet sequence
by ikegami (Patriarch) on Jul 17, 2024 at 15:31 UTC

    The above is incorrect, as it "decodes" the torrent file too. You want binary mode (:raw) for that, and an encoding layer (:encoding(UTF-8)) on the output.

    Unfortunately, while we can set a default encoding layer for the files read via ARGV, there's no way to make it use binary mode. It gives us a mess.

    perl -e' use v5.36; use utf8::all; use Bencode qw( bdecode ); use File::Slurper qw( read_binary ); binmode STDIN; sub process_torrent { say bdecode( $_[0] )->{info}{name}; } if ( @ARGV ) { process_torrent( read_binary( $_ ) ) for @ARGV; } else { process_torrent( do { local $/; <STDIN> } ); } '

    Outside of Windows, you can probably get away with not using binary mode.

    perl -gne' use v5.36; use Bencode qw( bdecode ); binmode STDOUT, ":encoding( UTF-8 )"; binmode STDERR, ":encoding( UTF-8 )"; say bdecode( $_ )->{info}{name}; '