Re^2: What is the proper way to read non-ANSI data

http://qs1969.pair.com?node_id=1142001

in reply to Re: What is the proper way to read non-ANSI data
in thread What is the proper way to read non-ANSI data

That also works for '·' but not for '–'.

Comment on Re^2: What is the proper way to read non-ANSI data

Replies are listed 'Best First'.
Re^3: What is the proper way to read non-ANSI data by CountZero (Bishop) on Sep 15, 2015 at 06:15 UTC
What is the code for this '–', both in the original and changed file? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re^4: What is the proper way to read non-ANSI data by BrowserUk (Patriarch) on Sep 15, 2015 at 07:35 UTC
In the original program output, every character (including the 'centered dot' chr(0xb7) ) is encoded as a single byte, except the specific hyphen like character your ask about, which is encoded as 3 bytes: e2 80 93. ~~Which to me suggests that the output is utf-8.~~ Update: Corion points out that text containing single bytes > 0x7f and 3-byte chars isn't utf-anything; but rather a mixed(-up) encoding. I suspect that the 'wrongness' the OP perceives when he treats the perl input stream as utf-8 and writes his output file as utf-8, has more to do with how he subsequently is inspecting that output than it does with Perl's handing of the data; but am insufficiently versed in the subject to be able to confirm that suspicion. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply]
Re^5: What is the proper way to read non-ANSI data by freonpsandoz (Beadle) on Oct 04, 2015 at 00:39 UTC
Upon further investigation, I found that dumptorrent is capable of dumping more than one torrent with a single command. Dumping both a torrent containing a centered dot and torrent containing the EN DASH to a single file does indeed yield a mixed encoding. At this point, I don't know if this is due to different encodings used in different torrent files, a problem with dumptorrent, a problem with the bencode library, or a problem with the Windows APIs. I found that I could read the data correctly by trying to decode raw input as UTF-8 and, if that fails, decoding it as cp1252. This seems to work for all torrent files I have encountered so far: use strict; use warnings; use Try::Tiny; use Regexp::Common qw /number/; use Encode; my $linecount = 0; open(DATA, qq(dumptorrent.exe "$ARGV[0]"\|)); binmode(DATA, ":raw"); binmode(STDOUT, ":encoding(UTF-8)"); foreach my $line (<DATA>) { $line = DecodeRawLine($line); # Discard the 5 lines of output before the list of filenames if (++$linecount > 5) { last if (length($line) == 0); $line =~ s/^ *//; $line =~ s/ +\(${RE{num}{real}}[KMG]\)$//; # e.g. " (8.22M +)" my $sizeindex = rindex($line, ' ') + 1; my $filesize = substr($line, $sizeindex); # Ignore zero-length files in torrents. if (defined $filesize && length($filesize) > 0 && $filesiz +e > 0) { # Remove the extra spaces between the file name and si +ze (my $filekey = $line) =~ s/ +(\d+)$/ $1/; print "$filekey\n"; } } } sub DecodeRawLine { # IMPORTANT! Any arguments in @_ that are needed in 'try' must be +COPIED! my $line = substr($_[0], 0, length($_[0])-2); # remove CR-LF try { $line = Encode::decode('UTF-8', $line, Encode::FB_CROAK); } catch { $line = Encode::decode('cp1252', $line); }; return $line; } [download]	[reply] [d/l]

In Section Seekers of Perl Wisdom