http://qs1969.pair.com?node_id=1141800

freonpsandoz has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to read data produced by another program (dumptorrent) through a pipe. If I set binmode ":encoding(UTF-8)" on the pipe, data containing the character '–' is read correctly, but the character '·' is not. If I omit the binmode call, the '·' character is read correctly, but the '–' character is not. How do I read this data correctly?

Thanks

use strict; use warnings; my $linecount = 0; open(DATA, qq(dumptorrent.exe "$ARGV[0]"|)); #binmode(DATA, ":encoding(UTF-8)"); binmode(STDOUT, ":encoding(UTF-8)"); foreach my $line (<DATA>) { # Discard the 5 lines of output before the list of filenames if (++$linecount > 5) { chomp $line; last if (length($line) == 0); $line =~ s/^ *//; $line =~ s/ +\([\d.]+[KMG]\)$//; # e.g. (8.22M) present fo +r files larger than ~1k; discard it my $sizeindex = rindex($line, ' ') + 1; my $filesize = substr($line, $sizeindex); # Ignore zero-length files in torrents. if (defined $filesize && length($filesize) > 0 && $filesiz +e > 0) { (my $filename = $line) =~ s/ +\d+$//; my $filekey = "$filename $filesize"; print "$filekey\n"; } } }

Replies are listed 'Best First'.
Re: What is the proper way to read non-ANSI data
by BrowserUk (Patriarch) on Sep 13, 2015 at 05:09 UTC
    If I set binmode ":encoding(UTF-8)" on the pipe, data containing the character '–' is read correctly, but the character '·' is not. If I omit the binmode call, the '·' character is read correctly,

    Then the output being produced by dumptorrent.exe must not be encoded as either ascii or utf8.

    Typically unicode on windows system is cp1250; so try that.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
      cp1250 seems to do the same thing as omitting the binmode call: '·' is read correctly, but '–' isn't.

        Then you'll need to work out what encoding the program is outputting. (That's the problem with Unicode; it doesn't self identify.)

        Try posting a few lines of the output from running the following command:

        dumptorrent.exe the.file | perl -nle"print; print unpack 'H*', $_"

        Perhaps the output will allow someone to recognise the encoding being used.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re: What is the proper way to read non-ANSI data
by CountZero (Bishop) on Sep 13, 2015 at 12:12 UTC
    What do you mean by "is not read correctly"?

    How did you find out?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      I redirect the output to a utf-8 file and compare it to the redirected output of dumptorrent.exe in Notepad++. The characters display as – instead of – for example.
        Notepad++. The characters display as –

        Notepad++, like all other programs, can only guess the encoding of plain text files. Some other file formats, like HTML, may contain more information about the encoding used. Other file formats are always encoded as UTF-8, like Java sources (IIRC).

        So, Notepad++ may just guess wrong. Check in the status bar which encoding Notepad++ guessed (probably ANSI). Use the Encoding menu to switch (not convert!) the encoding.

        A trick that works quite often is to write a Byte Order Mark ("\x{FEFF}") as first character to any file that is encoded in some Unicode encoding, including UTF-8. It is not strictly required for UTF-8, but helps most programs to guess the encoding right, including Notepad++.

        In most cases, the BOM does not hurt. An exception are any kind of unix scripts that must start with "#!" and not with a BOM. A BOM makes the script unrecognisabe to the kernel, leading to bizarre error messages.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: What is the proper way to read non-ANSI data
by CountZero (Bishop) on Sep 14, 2015 at 13:09 UTC
    What about windows-1252?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      That also works for '·' but not for '–'.

        What is the code for this '–', both in the original and changed file?

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics