manzico has asked for the wisdom of the Perl Monks concerning the following question:

Here's the situation. I've got some data. It contains a text header and then one long string of binary data that represents numerous samples of data. I have a script that pulls off the header, and then begins to read the binary data. Depending on the type of data, each data point is n-bytes long. I take this into account. I read in, let's say 2 bytes, split them, ord each and reassemble the two numbers so that I have the value for the data point. All of this works fine in linux. However, when I run the same script in DOS (I'm using active state perl for the DOS part), it inevitably finds a byte that seems to indicate EOF for it. I've looked at the byte values in linux and DOS and they match until the hose up. The last byte that DOS gets is FDh. The one that it seems to trip up on is 1Ah which is actually SUB (substitute). It's my understanding that if you open a file in DOS you have to specify that it is a binary file or this type of control character error will occasionally (inevitably?) occur. So my question is this, how do I tell my filehandle, in the DOS implementation, that the file it points to is a binary file, not a text file? Or is my problem something else I haven't thought of?

Replies are listed 'Best First'.
Re: perl in DOS woes
by BrowserUk (Patriarch) on Nov 04, 2002 at 23:56 UTC

    See perlfunc:binmode for how to set a filehandle into binary mode.

    While your at it, you should look at unpack for a better way of processing your binary data than doing it with ord.


    Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy
Re: perl in DOS woes
by scholar (Acolyte) on Nov 05, 2002 at 00:12 UTC
    DOS uses 1AH (^Z) to indicate eof for a text file -- which means that if the value exists somewhere in your file, you can't use text file semantics to read it. You'll get the same sorts of problems if there are any null bytes in the file. What I usually do in this situation is:
    open "< FH"; binmode FH;
    Then of course you can't use the usual file operators to read your data (but you shouldn't be doing that with a binary file anyway, even on *nix -- hopefully someone with more *nix experience can point out the pitfalls there). You would need to use the read function to get chunks of data from the file and then parse out the resultant buffer. Something like:
    while (read(FH, $buf, 512)) { do something; }
    Oh ... and thanks for finding a question I could answer so soon ... My apologies if this first attempt is not as helpful as it might be :)
      Then of course you can't use the usual file operators to read your data (but you shouldn't be doing that with a binary file anyway, even on *nix -- hopefully someone with more *nix experience can point out the pitfalls there).

      You can use the usual file operators. There is nothing wrong with using them to read binary data. In fact, it can be quite handy as long as you understand what you are doing.

      Usually, you'll want to change $/, the input record separator, before you do so. By default, $/ is a newline which is unlikely to be a meaningful delimiter in your binary data. A null, "\0", is often used to delimit variable length records though. If you have fixed length records or even if you just want to read chunks of a specific length in, you can do that by setting $/ to a reference. For example,

      $/ = \16384;
      would result in 16Kb block reads. Another useful value to set $/ to for binary data (as well as for text) is undef. Setting it to undef results in the whole file being slurped in a single read.

      -sauoq
      "My two cents aren't worth a dime.";
      
        Oh, I just knew I wasn't being as helpful as I thought.

        "Of course you can use regular file operators on a binmode file" I said as I slapped myself silly reading your reply. "Why, I've done it a hundred times myself!" And I'd forgotten about how useful the input record seperator can be, especially setting it to undef. I think that using

        $/=\<a number>;
        would have made things a lot easier sometimes too.

        I have fallen prey to the fallacy of assuming that my habit is actually a recommended practice. When I'm dealing with a "binary file" I almost always use read, which makes it clear to me later what I was intending to do with it. On that somewhat flimsy notion, I beg forgiveness for misleading comments.

        I have spent a couple of hours attempting to recreate problems I could swear I have had with null bytes (\x0) in files but have been unable to do -- please ignore that part of my answer as well.

Re: perl in DOS woes
by bart (Canon) on Nov 05, 2002 at 01:24 UTC
    binmode. The premature EOF will not be your only problem, conversion of CR+LF to bare LF will be another one. binmode() fixes both.

    And use $n = unpack 'v', $twobytes; or maybe yet unlikely, $n = unpack 'n', $twobytes; to extract the 2 byte integer. 'v' is for Little Endian, default on Intel, while 'n' is for Big Endian.

Re: perl in DOS woes
by manzico (Initiate) on Nov 05, 2002 at 05:52 UTC
    I used read, and then switched to sysread. It was the binmode that escaped me. I'll also have to try the other stuff, although the data isn't delimited at all. It's one big block of contiguous samples packed together. I have to know the size of each sample and handle the parsing from there (I didn't create the format... just have to deal with it). Thanks for the help. I'll give it a try in the morning.

      From your description, unpack would be perfect for your purposes.

      For example, if your file has 3 16-bit integers, 2 32-bit integers and 4 1-byte ascii fields, getting them into seperate scalars would be as simple as

      my ($int16_1, $int16_2, $int16_3, $int32_1, $int32_2, $c_1, $c_2, $c3, + $c_4) = unpack'S3 L2 C4', $scalar_18_bytes;

      And there are format specifiers for just about every conceivable sort of datatype, size and -endian that you are likely to encounter.


      Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy
Re: perl in DOS woes
by manzico (Initiate) on Nov 05, 2002 at 20:28 UTC
    I probably could have been a little clearer. The data is coming from an A/D test set. The A/D has so many bits per sample, let's say 10 bits/sample. What the setup does is take the data and store it in labview. To save, you hit a save button and then it presents you with a menu to add header information, which is simply placed at the beginning of the file. Then it takes the samples, which are two's complement, sign extends to fill up the MSByte (in this case resulting in two bytes) and saves all of the samples as one long string of binary data... MSbyte0,LSbyte0,MSbyte1,LSbyte1,MSbyte2,LSbyte2,... ...,MSbyten,LSbyten Do you still recommend unpack as opposed to ord... I suppose it would be more compact with unpack.

      Do you still recommend unpack as opposed to ord.

      The answer is a qualified "yes". I'll get to the qualification in bit.

      Ostensibly, once you have stripped the header from your binary data, extracting your 16-bit, big-endian values from string would be a simple as

      my @ADsamples = unpack 'n*', $binaryData;

      Which will 'do the right thing' with big-endian 16 bit data regardless of the architecture that it runs on.

      However, there is a caveat as I mention a the top. The 'n' upack format specifier is for unsigned 16-bit values. There is no equivalent for big-endian, signed 16-bit data.

      The upshot of this is that if your original data has the msb of the original sample is set, once this is sign extended, the resultant will be a negative value. Treating this as an unsigned value will result in large positive values!

      This may not be a problem as you maybe oring out the bits that you need and ignoring the machinations of the sign extension. Otherwise, getting the negative values back is fairly trivial.

      @signed = map{ $_ > 32767 ? $_ - 65536 : $_ } @unsigned;

      So long as your aware of it, no problem.

      That does bring up one other matter though,related to your use of ord. Depending on how you're breaking out the values from your sting, and which version of Perl you are using, ord has a trap waiting for the unwary using it to manipulate non-character data, namely utf.

      In recent version of Perl (>5.6.1 I think but I'm not certain), ord can return values >255. Additionally, depending on what function you use and what regex (if any) you use, attempting to break a scalar into chars will not necessarially render bytes. split/./, $binaryData; is going to attempt to treat your data as variable-length chars if it discovers anything in the data that looks like a ucs char. This also holds true for  my @bytes = /./g; for instance unless you take precations to prevent it.

      You may already be aware of this and taking the appropriate steps, but as I fell into the trap myself very recently, I thought it worth mentioning.


      Nah! You're thinking of Simon Templar, originally played (on UKTV) by Roger Moore and later by Ian Ogilvy