Re^4: What is the proper way to read non-ANSI data

In the original program output, every character (including the 'centered dot' chr(0xb7) ) is encoded as a single byte, except the specific hyphen like character your ask about, which is encoded as 3 bytes: e2 80 93.

~~Which to me suggests that the output is utf-8.~~ Update: Corion points out that text containing single bytes > 0x7f and 3-byte chars isn't utf-anything; but rather a mixed(-up) encoding.

I suspect that the 'wrongness' the OP perceives when he treats the perl input stream as utf-8 and writes his output file as utf-8, has more to do with how he subsequently is inspecting that output than it does with Perl's handing of the data; but am insufficiently versed in the subject to be able to confirm that suspicion.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

Comment on Re^4: What is the proper way to read non-ANSI data

Replies are listed 'Best First'.
Re^5: What is the proper way to read non-ANSI data by freonpsandoz (Beadle) on Oct 04, 2015 at 00:39 UTC
Upon further investigation, I found that dumptorrent is capable of dumping more than one torrent with a single command. Dumping both a torrent containing a centered dot and torrent containing the EN DASH to a single file does indeed yield a mixed encoding. At this point, I don't know if this is due to different encodings used in different torrent files, a problem with dumptorrent, a problem with the bencode library, or a problem with the Windows APIs. I found that I could read the data correctly by trying to decode raw input as UTF-8 and, if that fails, decoding it as cp1252. This seems to work for all torrent files I have encountered so far: use strict; use warnings; use Try::Tiny; use Regexp::Common qw /number/; use Encode; my $linecount = 0; open(DATA, qq(dumptorrent.exe "$ARGV[0]"\|)); binmode(DATA, ":raw"); binmode(STDOUT, ":encoding(UTF-8)"); foreach my $line (<DATA>) { $line = DecodeRawLine($line); # Discard the 5 lines of output before the list of filenames if (++$linecount > 5) { last if (length($line) == 0); $line =~ s/^ *//; $line =~ s/ +\(${RE{num}{real}}[KMG]\)$//; # e.g. " (8.22M +)" my $sizeindex = rindex($line, ' ') + 1; my $filesize = substr($line, $sizeindex); # Ignore zero-length files in torrents. if (defined $filesize && length($filesize) > 0 && $filesiz +e > 0) { # Remove the extra spaces between the file name and si +ze (my $filekey = $line) =~ s/ +(\d+)$/ $1/; print "$filekey\n"; } } } sub DecodeRawLine { # IMPORTANT! Any arguments in @_ that are needed in 'try' must be +COPIED! my $line = substr($_[0], 0, length($_[0])-2); # remove CR-LF try { $line = Encode::decode('UTF-8', $line, Encode::FB_CROAK); } catch { $line = Encode::decode('cp1252', $line); }; return $line; } [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^5: What is the proper way to read non-ANSI data
by freonpsandoz (Beadle) on Oct 04, 2015 at 00:39 UTC

Upon further investigation, I found that dumptorrent is capable of dumping more than one torrent with a single command. Dumping both a torrent containing a centered dot and torrent containing the EN DASH to a single file does indeed yield a mixed encoding. At this point, I don't know if this is due to different encodings used in different torrent files, a problem with dumptorrent, a problem with the bencode library, or a problem with the Windows APIs. I found that I could read the data correctly by trying to decode raw input as UTF-8 and, if that fails, decoding it as cp1252. This seems to work for all torrent files I have encountered so far:

use strict;
use warnings;
use Try::Tiny;
use Regexp::Common qw /number/;
use Encode;

my $linecount = 0;
        
open(DATA, qq(dumptorrent.exe "$ARGV[0]"|));
binmode(DATA, ":raw");
binmode(STDOUT, ":encoding(UTF-8)");
foreach my $line (<DATA>)
{
    $line = DecodeRawLine($line);
    # Discard the 5 lines of output before the list of filenames
    if (++$linecount > 5)
    {
            last if (length($line) == 0);
            $line =~ s/^ *//;
            $line =~ s/ +\(${RE{num}{real}}[KMG]\)$//; # e.g. " (8.22M
+)" 
            my $sizeindex = rindex($line, ' ') + 1;
            my $filesize = substr($line, $sizeindex);
            # Ignore zero-length files in torrents. 
            if (defined $filesize && length($filesize) > 0 && $filesiz
+e > 0)
            {
                # Remove the extra spaces between the file name and si
+ze
                (my $filekey = $line) =~ s/ +(\d+)$/ $1/;
                print "$filekey\n";
            }
    }
}

sub DecodeRawLine
{
    # IMPORTANT! Any arguments in @_ that are needed in 'try' must be 
+COPIED!
    my $line = substr($_[0], 0, length($_[0])-2); # remove CR-LF
    try
    {
        $line = Encode::decode('UTF-8', $line, Encode::FB_CROAK);
    }
    catch
    {
        $line = Encode::decode('cp1252', $line);
    };
    return $line;
}
[download]

[reply]
[d/l]