What is the code for this '–', both in the original and changed file?
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] |
In the original program output, every character (including the 'centered dot' chr(0xb7) ) is encoded as a single byte, except the specific hyphen like character your ask about, which is encoded as 3 bytes: e2 80 93.
Which to me suggests that the output is utf-8. Update: Corion points out that text containing single bytes > 0x7f and 3-byte chars isn't utf-anything; but rather a mixed(-up) encoding.
I suspect that the 'wrongness' the OP perceives when he treats the perl input stream as utf-8 and writes his output file as utf-8, has more to do with how he subsequently is inspecting that output than it does with Perl's handing of the data; but am insufficiently versed in the subject to be able to confirm that suspicion.
| [reply] |
Upon further investigation, I found that dumptorrent is capable of dumping more than one torrent with a single command. Dumping both a torrent containing a centered dot and torrent containing the EN DASH to a single file does indeed yield a mixed encoding. At this point, I don't know if this is due to different encodings used in different torrent files, a problem with dumptorrent, a problem with the bencode library, or a problem with the Windows APIs. I found that I could read the data correctly by trying to decode raw input as UTF-8 and, if that fails, decoding it as cp1252. This seems to work for all torrent files I have encountered so far:
use strict;
use warnings;
use Try::Tiny;
use Regexp::Common qw /number/;
use Encode;
my $linecount = 0;
open(DATA, qq(dumptorrent.exe "$ARGV[0]"|));
binmode(DATA, ":raw");
binmode(STDOUT, ":encoding(UTF-8)");
foreach my $line (<DATA>)
{
$line = DecodeRawLine($line);
# Discard the 5 lines of output before the list of filenames
if (++$linecount > 5)
{
last if (length($line) == 0);
$line =~ s/^ *//;
$line =~ s/ +\(${RE{num}{real}}[KMG]\)$//; # e.g. " (8.22M
+)"
my $sizeindex = rindex($line, ' ') + 1;
my $filesize = substr($line, $sizeindex);
# Ignore zero-length files in torrents.
if (defined $filesize && length($filesize) > 0 && $filesiz
+e > 0)
{
# Remove the extra spaces between the file name and si
+ze
(my $filekey = $line) =~ s/ +(\d+)$/ $1/;
print "$filekey\n";
}
}
}
sub DecodeRawLine
{
# IMPORTANT! Any arguments in @_ that are needed in 'try' must be
+COPIED!
my $line = substr($_[0], 0, length($_[0])-2); # remove CR-LF
try
{
$line = Encode::decode('UTF-8', $line, Encode::FB_CROAK);
}
catch
{
$line = Encode::decode('cp1252', $line);
};
return $line;
}
| [reply] [d/l] |