Re^2: Accessing individual bytes of a binary file as numerical values

I am quite astonished at the apparent requirement to checksum a "window" of the input file. Normally one would calculate a checksum on the whole file or whole file - 16 bits and then compare the checksums. Skipping X bytes of offset the beginning seems odd to me. The code below only went through the bare minimum of "kick the tires".... but I did run it at least on the first 2 bytes of the source file (not much testing ha!).

some suggestions:

Change the order of the interface to put the noun (the file name), which always must be there first. Then put the adjectives like window offset and size of window parameters. It could be that suitable default values can be worked for those?
I just used the normal read and seek functions instead of the sysread, etc. functions.
A seek operation will cause any buffers to be flushed (if needed). Seek to current position is often used to cause flush of write data to the disk in certain types of files. Seek is can be a very expensive thing - be careful with that.
Disk files are normally written in increments of 512 bytes. Each 512 byte chunk is too small for the filesystem to keep track of, so 8 of these things get amalgamated together as 4096 bytes. the file system tracks chunks of 4096 bytes (..usually nowadays...). In general try to read at least 4096 bytes chunks from a hard disk. That is an increment that is likely to "make the file system "happy"". In general, an OS call makes your process eligible for re-scheduling. This can slow the execution time of your program quite a bit.
I found some of your coding constructs confusing - you may like the way I did it or not...

#!/usr/bin/perl
use strict;
use warnings;

my $BUFFSIZE = 4096 *1;

sub Checksum 
{
        my ($FileName, $Start_byte, $Size) = @_;       
        
        open (my $fh, '<', $FileName) or die "unable to open $FileName
+ for read $!";
        binmode $fh;
        
        #This is truly bizarre! Checksum does not start at beginning o
+f file!
        #
        seek ($fh, $Start_byte, 0) or die "Cannot seek to $Start_byte 
+on $FileName $!";
        
        my $check_sum =0;
        
        # Allow for checkum only on a "window" of the input file, i.e.
+ $Size may be
        # much smaller than size_of_file - start_byte!  Another Bizarr
+e requirement!!
        
        while ($Size >0)
        {
            my $n_byte_request = ($BUFFSIZE > $Size) ? $Size : $BUFFSI
+ZE;
            my $n_bytes_read = read($fh, my $buff, $n_byte_request);
            
            die "file system error binary read for $FileName" unless d
+efined $n_bytes_read;
            die "premature EOF on $FileName checksum block size too bi
+g for actual file"
                if ($n_bytes_read < $n_byte_request);
            
            my @bytes = unpack('C*', $buff);  #input string of data ar
+e 8 bit unsigned ints
            
            # check_sum is at least a 32 bit signed int. masking to 16
+ bits 
            # after every add probably not needed, but maybe.
            
            $check_sum += $_ for @bytes;
            
            $Size -= $n_bytes_read;
        }
        
        close $fh;
        $check_sum &= 0xFFFF;  #Truncate to 16 bits, probably have to 
+do this more often...
        return $check_sum;
}

my $chk = Checksum('BinaryCheckSum.pl', 0,2);
print  $chk;  #prints 68 decimal, 0x23 + 0x21, "#!"
[download]

Minor Update: I thought a bit about truncating the checksum. The max value of 8 unsigned bits is 0xFF or ~~511~~ 255 in decimal. Max positive value of a 32 bit signed int is 0x7FFFFFFF or decimal 2,147,483,647. If every byte was the maximum 8 bit unsigned value, How many bytes would it take to overflow a 32 bit signed int? 2,147,483,647 / ~~511~~ 255 ~ 8.4 million. At that size of file, a checksum is absolutely worthless. I conclude that truncating the $check_sum after the calculation is good enough. If the OP is using this on 4-8MB files, that is a VERY bad idea.

Comment on Re^2: Accessing individual bytes of a binary file as numerical values Download Code

Replies are listed 'Best First'.
Re^3: Accessing individual bytes of a binary file as numerical values by AnomalousMonk (Archbishop) on Apr 25, 2019 at 06:38 UTC
For some reason, Chris01234 seems resolutely committed to byte-by-byte access to the data block, but if you're going to use a block-by-block approach, why not just `$check_sum += unpack('%16C*', $buff);` (see unpack) and then bitwise-and mask to 16 bits before returning? (I'd also use the biggest buffer I could get away with. 10MB? 100MB?) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: Accessing individual bytes of a binary file as numerical values by Marshall (Canon) on Apr 25, 2019 at 07:13 UTC
Well, for one thing, we aren't sure (or at least I'm not sure) that the Op's application uses files in increments of 16 bits. Maybe this is download to an 8-bit uP? The OP's spec appears to be a bizarre thing where only a "window" of the binary file is check-summed. I've never seen anything like that before. I am not at all sure that performance is an issue here at all! I think the goal should be clear, understandable code and then work on performance optimization later. I just made a post to try to explain a simple way to do what needs to be done. If that doesn't work, I at least feel like I tried. BTW: increasing the multiple of 4K bytes may not help all that much in terms of execution performance. This depends upon the O/S and file system AND how many other disk intensive processes are running at the same time. Depending upon the OS and file system, these 4K units can wind up being scattered around over the disk surface. Your processes is not calculating if it is waiting for the disk to finish some very large request. Sometimes there is an optimal "quanta" of info to request. I think that discussion is well beyond the scope of the original question.	[reply]
Re^5: Accessing individual bytes of a binary file as numerical values by AnomalousMonk (Archbishop) on Apr 25, 2019 at 18:48 UTC
Well, for one thing, we aren't sure (or at least I'm not sure) that the Op's application uses files in increments of 16 bits. Maybe this is download to an 8-bit uP? I don't understand the point you're making here. To me, the OPer seems to want ~~a 16-bit~~ \| more precisely, I guess it's an unsigned 16-bit sum of ~~bytes~~ \| unsigned bytes (`'C'` data elements in terms of pack) from a subsection of a file. The OS or underlying hardware doesn't seem to matter. The OP's spec appears to be a bizarre thing where only a "window" of the binary file is check-summed. I've never seen anything like that before. I once worked with an application that used data files that had several subsections that each had a simple 32-bit checksum of bytes. We had to generate these files, and subsequently extract and verify the subsections for use. And yes, we recognized that a simple 32-bit checksum offered little "security", and no, we had no option to change anything, so I have some sympathy for what Chris01234 may be facing. I am not at all sure that performance is an issue here at all! I think the goal should be clear, understandable code and then work on performance optimization later. I agree, and that's the sort of solution I tried to offer here, with Chris01234's byte-by-byte code being altered as little as possible to achieve what seemed to me a significant increase in clarity – and no `pack`/`unpack` need apply! But if you're going to introduce a block-processing approach, the use of `unpack`'s `'%'` checksumming feature is, to me, ideally simple and clear, and it's also documented; see unpack and Doing Sums in perlpacktut. If this simple, clear code also performs well, so much the better. I don't know anything about Chris01234's personal odyssey among the Perlish islands of `sysread` and `pack`/`unpack` and the demons he or she met there, but I'll bet it was "interesting", and now all Chris01234 wants to do is get back home and start coding C# again. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^6: Accessing individual bytes of a binary file as numerical values by Marshall (Canon) on Apr 28, 2019 at 20:10 UTC