In the original program output, every character (including the 'centered dot' chr(0xb7) ) is encoded as a single byte, except the specific hyphen like character your ask about, which is encoded as 3 bytes: e2 80 93.
Which to me suggests that the output is utf-8. Update: Corion points out that text containing single bytes > 0x7f and 3-byte chars isn't utf-anything; but rather a mixed(-up) encoding.
I suspect that the 'wrongness' the OP perceives when he treats the perl input stream as utf-8 and writes his output file as utf-8, has more to do with how he subsequently is inspecting that output than it does with Perl's handing of the data; but am insufficiently versed in the subject to be able to confirm that suspicion.
| [reply] [Watch: Dir/Any] |
Upon further investigation, I found that dumptorrent is capable of dumping more than one torrent with a single command. Dumping both a torrent containing a centered dot and torrent containing the EN DASH to a single file does indeed yield a mixed encoding. At this point, I don't know if this is due to different encodings used in different torrent files, a problem with dumptorrent, a problem with the bencode library, or a problem with the Windows APIs. I found that I could read the data correctly by trying to decode raw input as UTF-8 and, if that fails, decoding it as cp1252. This seems to work for all torrent files I have encountered so far:
use strict;
use warnings;
use Try::Tiny;
use Regexp::Common qw /number/;
use Encode;
my $linecount = 0;
open(DATA, qq(dumptorrent.exe "$ARGV[0]"|));
binmode(DATA, ":raw");
binmode(STDOUT, ":encoding(UTF-8)");
foreach my $line (<DATA>)
{
$line = DecodeRawLine($line);
# Discard the 5 lines of output before the list of filenames
if (++$linecount > 5)
{
last if (length($line) == 0);
$line =~ s/^ *//;
$line =~ s/ +\(${RE{num}{real}}[KMG]\)$//; # e.g. " (8.22M
+)"
my $sizeindex = rindex($line, ' ') + 1;
my $filesize = substr($line, $sizeindex);
# Ignore zero-length files in torrents.
if (defined $filesize && length($filesize) > 0 && $filesiz
+e > 0)
{
# Remove the extra spaces between the file name and si
+ze
(my $filekey = $line) =~ s/ +(\d+)$/ $1/;
print "$filekey\n";
}
}
}
sub DecodeRawLine
{
# IMPORTANT! Any arguments in @_ that are needed in 'try' must be
+COPIED!
my $line = substr($_[0], 0, length($_[0])-2); # remove CR-LF
try
{
$line = Encode::decode('UTF-8', $line, Encode::FB_CROAK);
}
catch
{
$line = Encode::decode('cp1252', $line);
};
return $line;
}
| [reply] [Watch: Dir/Any] [d/l] |