Remember to binmode text files

This just a very short FYI for anyone who is interested in why read() would refuse to read the "right" number of bytes from a file. In the following snippet I am slurping an entire file and verifying that I read the expected number of bytes. I was getting the error message "Read 3157 but expected 3158 bytes from 002971DD46D2CB2286256BAC002C26FB.xml" and it didn't immediately occur to me that because of the implicit new-line handling on text files, a single \r character had been removed and my $expected was differing from my $got value.

One application of binmode() later and my code worked. This isn't a revelation, perlfunc already documents this but since I had to think about it for moment today I figured I'd just share and let this be a reminder to anyone else who might get something out of it.

sub readfile {
    my $fname = shift;
    my $fdata;
    open XML, "<", $fname or die "Couldn't open $fname: $!";
    binmode XML or die "Couldn't binmode $fname: $!";
    
    my $expected = -s XML;
    my $got = read XML, $fdata, $expected;
    $expected == $got or
      die "Read $got but expected $expected bytes from $fname: $!";
    close XML or die "Couldn't close $fname: $!";

    return \ $fdata;
}
[download]

Comment on Remember to binmode text files Download Code

Replies are listed 'Best First'.
Re: Remember to binmode text files (wrong test/conclusion) by tye (Sage) on Jun 10, 2003 at 19:13 UTC
Rather than binmode text files, you should instead learn that "file size" only equals "number of bytes when the file is read into memory" when the file is a simple stream of bytes. Although that is very common on Unix, it is nearly uncommon outside of Unix. For another example where this isn't true, consider Unix directories. They are files but they don't simply contain a stream of bytes and the "size" won't match the number of bytes you get back when you "read" them (some Unix systems will let you read a directory as a stream of bytes, but that isn't what you are supposed to do with them). A great many types of systems don't routinely store files as simple streams of bytes (and even some that support that won't report file size to match your expectations). It is quite common to have files recorded as a series of records. And record separators can have a length of 0 (for fixed-length records, for example) or a longer length (such as preceeding the record by the length of the record) or even a variable length (such as when records are indexed). Now, Unix takes a minimalist approach (which I think turned out to be a really good idea) and implements any of the above schemes on top of the file system's idea that all files are simply a stream of bytes. So when you read an ordinary file on Unix, you just get that same stream of bytes. But these other systems track record boundaries "outside" of the data of the file (which allows you to put a "\n" inside your record, which probably doesn't seem like a big deal to you since you've spent your entire computing lifespan thinking about files as streams of bytes). This file meta data may or may not be included in the "size" that `-s` gives back to you. Whether it does or not is really a matter convenience/efficiency. Even non-oridinary files on Unix don't stores simple streams of bytes. In Unix, the file isn't actually stored as a stream of bytes. It is probably stored as a bunch sectors thrown willy-nilly about the disk. But the Unix file system presents these to the program/programmer as a stream of bytes. So even when a Unix file has a chunk missing from the middle that is not recorded to disk, Unix zero-fills these when it is read and also shows the "file size" as the number of bytes that you'd have after this has been done so your comparison still succeeds in this case. So please, just stop comparing "number of bytes read" to what `-s` says. It isn't portable. Even if you use binmode, you'll run into (somewhat rare) cases where this doesn't work. Even when you have an ordinary file on Unix, there are race conditions to consider. binmode on text files is usually a bad idea. Comparing `-s` to number of bytes read is always a bad idea in my book. - tye	[reply] [d/l] [select]
Re: Remember to binmode text files by zentara (Cardinal) on Jun 11, 2003 at 21:06 UTC
I get uniform results with either text files or binary files with a "byte count". Devel-size gives different results though??? `#!/usr/bin/perl use warnings; use Devel::Size qw(size total_size); my $file = shift \|\| $0; my $statfilesize = -s $file; print "statfilesize $file ->$statfilesize\n"; my $count=0; open (FH,"<$file") or die "Couldn't open: $!"; my $buf = do {local $/; <FH>}; close FH; foreach my $byte (unpack "C*", $buf) { $count++ } print "count->$count -> total size of slurped string $file-> ", size(\ +$buf),"\n";` [download]	[reply] [d/l]