Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^3: What is the proper way to read non-ANSI data

by CountZero (Bishop)
on Sep 15, 2015 at 06:15 UTC ( [id://1142004]=note: print w/replies, xml ) Need Help??


in reply to Re^2: What is the proper way to read non-ANSI data
in thread What is the proper way to read non-ANSI data

What is the code for this '–', both in the original and changed file?

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

Replies are listed 'Best First'.
Re^4: What is the proper way to read non-ANSI data
by BrowserUk (Patriarch) on Sep 15, 2015 at 07:35 UTC

    In the original program output, every character (including the 'centered dot' chr(0xb7) ) is encoded as a single byte, except the specific hyphen like character your ask about, which is encoded as 3 bytes: e2 80 93.

    Which to me suggests that the output is utf-8. Update: Corion points out that text containing single bytes > 0x7f and 3-byte chars isn't utf-anything; but rather a mixed(-up) encoding.

    I suspect that the 'wrongness' the OP perceives when he treats the perl input stream as utf-8 and writes his output file as utf-8, has more to do with how he subsequently is inspecting that output than it does with Perl's handing of the data; but am insufficiently versed in the subject to be able to confirm that suspicion.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      Upon further investigation, I found that dumptorrent is capable of dumping more than one torrent with a single command. Dumping both a torrent containing a centered dot and torrent containing the EN DASH to a single file does indeed yield a mixed encoding. At this point, I don't know if this is due to different encodings used in different torrent files, a problem with dumptorrent, a problem with the bencode library, or a problem with the Windows APIs. I found that I could read the data correctly by trying to decode raw input as UTF-8 and, if that fails, decoding it as cp1252. This seems to work for all torrent files I have encountered so far:

      use strict; use warnings; use Try::Tiny; use Regexp::Common qw /number/; use Encode; my $linecount = 0; open(DATA, qq(dumptorrent.exe "$ARGV[0]"|)); binmode(DATA, ":raw"); binmode(STDOUT, ":encoding(UTF-8)"); foreach my $line (<DATA>) { $line = DecodeRawLine($line); # Discard the 5 lines of output before the list of filenames if (++$linecount > 5) { last if (length($line) == 0); $line =~ s/^ *//; $line =~ s/ +\(${RE{num}{real}}[KMG]\)$//; # e.g. " (8.22M +)" my $sizeindex = rindex($line, ' ') + 1; my $filesize = substr($line, $sizeindex); # Ignore zero-length files in torrents. if (defined $filesize && length($filesize) > 0 && $filesiz +e > 0) { # Remove the extra spaces between the file name and si +ze (my $filekey = $line) =~ s/ +(\d+)$/ $1/; print "$filekey\n"; } } } sub DecodeRawLine { # IMPORTANT! Any arguments in @_ that are needed in 'try' must be +COPIED! my $line = substr($_[0], 0, length($_[0])-2); # remove CR-LF try { $line = Encode::decode('UTF-8', $line, Encode::FB_CROAK); } catch { $line = Encode::decode('cp1252', $line); }; return $line; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1142004]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-03-29 09:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found