Chris01234 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, my first post here as a novice Perl user. I am attempting to form a checksum value for sections of a binary file, the checksums are a 16 bit sum of 8 bit values. I've got the 16 bit overflowing summation part working with pack and unpack but I'm struggling with accessing the bytes of the file, I've tried various combinations of packing, unpacking, using the buffer value directly etc. Could anyone point me in the right direction? Thanks.
sub Checksum { my @Arguments = @_ ; my $Start = $Arguments[0] ; my $Size = $Arguments[1] ; my $FileHandle = $Arguments[2] ; my $BufferByte = 0 ; my $Checksum = 0 ; my $ChecksumLoopCounter = 0 ; my $ChecksumByteOffset = $Start ; while ( $ChecksumLoopCounter < $Size ) { unless ( sysread ( $FileHandle, $BufferByte, 1, $ChecksumByteO +ffset ) == 1 ) { die "sysread size error" ; } my $Byte = pack("C",unpack("C",$BufferByte) ); print "\nByte: ".$Byte;#." BufferByte: ".$BufferByte ; $Checksum = pack("S", unpack ("S", $CheckSum) + $Byte ) ; # Th +is line implements the looping 16 bit sum $ChecksumLoopCounter += 1 ; $ChecksumByteOffset += 1 ; } return $Checksum; }

Replies are listed 'Best First'.
Re: Accessing individual bytes of a binary file as numerical values
by Fletch (Bishop) on Apr 24, 2019 at 12:10 UTC

    Perhaps ord and chr? Or maybe not turn the unpack'd value right back into a character?

    $ perl -lE 'say join( q{,}, unpack( "C*", qq{Hello} ) )' 72,101,108,108,111 $ perl -lE 'say join( q{,}, map ord, split(q{},qq{Hello}) )' 72,101,108,108,111 $ perl -lE 'say pack( "C*", 72, 101, 108, 108, 111 )' Hello

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Accessing individual bytes of a binary file as numerical values
by vr (Curate) on Apr 24, 2019 at 15:31 UTC

    (1) Perl can calculate checksum for you, see unpack:

    In addition to fields allowed in pack, you may prefix a field with a %<number> to indicate that you want a <number>-bit checksum of the items instead of the items themselves. Default is a 16-bit checksum. The checksum is calculated by summing numeric values of expanded values

    As:

    >perl -wE "say unpack '%8C*', qq(\x{01}\x{02}\x{ff})" 2

    Here, I calculate 8-bit checksum only to demonstrate that "it works", you probably should write something like:

    sysread ( $FileHandle, $Buffer, $Size, $ChecksumByteOffset ) or die; $Checksum = unpack '%C*', $Buffer;

    Unless you need to read really huge chunks. But even then, read larger pieces than just byte by byte.

    (2) Do you run your code under use warnings;? You should, to catch if not this, then possibly other, more dangerous errors.

    >perl -wE "say 1 + unpack 'S', 0" Use of uninitialized value in addition (+) at -e line 1. 1

    Though expected net result is correct, initial value of your accumulator is 0, i.e. too short string ("0") for first call to unpack to produce anything but undef in scalar context.

    Hm-m, documentation says actually:

    If there are more pack codes or if the repeat count of a field or a group is larger than what the remainder of the input string allows, the result is not well defined: the repeat count may be decreased, or unpack may produce empty strings or zeros, or it may raise an exception.

    (3) Unless you use recipe (1) (or use it in modified form, reading chunk by chunk) I'd suggest using modulo arithmetic directly instead of tricks with packing:

    use strict; use warnings; use Benchmark 'cmpthese'; cmpthese -3, { packing => sub { my $sum = 0; $sum = unpack("S", pack("S", $sum + +$_)) for 0..1e6 }, modulo => sub { my $sum = 0; $sum = ($sum + $_) % 65536 for 0..1e +6 }, } __END__ Rate packing modulo packing 1.67/s -- -71% modulo 5.79/s 247% --

    Here, I modified your code so that $sum stores checksum in human readable form. Do you really need your subroutine to return result in machine representation? Looks strange to me. Especially since you initialize accumulator with 0, not "\x{00}\x{00}".

Re: Accessing individual bytes of a binary file as numerical values
by roboticus (Chancellor) on Apr 24, 2019 at 12:36 UTC

    Chris01234:

    It looks like Fletch gave you the info you need to decode the bytes and make your checksum.

    I wanted to mention that simply adding bytes together is a weak method of detecting file changes. If you have two bytes transposed, for example, you'll get the same checksum. Similarly, if there are extra zeroes in the section you're checksumming, that would also give you the same checksum. So if you're trying to detect accidental changes, I'd suggest you at least look at a simple CRC. If you want to detect possible malicious changes, though, you'd want something even stronger, as it's easy enough to modify a file to generate any CRC you'd like.

    Finally, there are modules out there that can help you. If you go to http://cpan.org and put Digest or Checksum in the search bar, you'll find various modules you could use to generate your checksum. (One I've used, for example, is Digest::MD5.)

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Accessing individual bytes of a binary file as numerical values
by Chris01234 (Novice) on Apr 24, 2019 at 16:34 UTC
    Thanks for the input, I now have something that works. Using ord was very helpful, I have also modified it to use sysopen and sysseek, for some reason using sysread with the offset parameter doesn't work for me hence the sysseek. As for using CRCs, I know the checksum is weak but I have just been tasked to automate an existing process, not to change the process so I don't get to choose the method. I will look at the modulo arithmetic approach to see if I can get that working rather than the mess of packing and unpacking.
    sub Checksum { my @Arguments = @_ ; my $Start = $Arguments[0] ; my $Size = $Arguments[1] ; my $FileName = $Arguments[2] ; my $FileHandle ; unless ( sysopen $FileHandle, $FileName, 0 ) { die "\nsysopen failed"; } my $BufferByte = 0 ; my $Checksum = 0 ; my $ChecksumLoopCounter = 0 ; my $ChecksumByteOffset = $Start ; unless ( sysseek($FileHandle, $Start, 0) == $Start ) { die "\nsysseek error" ; } while ( $ChecksumLoopCounter < $Size ) { unless ( sysread ( $FileHandle, $BufferByte, 1) == 1 ) { die "sysread size error" ; } my $Byte = ord($BufferByte); $Checksum = unpack("S" , pack("S", $Checksum + $Byte)); $ChecksumLoopCounter++ ; $ChecksumByteOffset++ ; } close $FileHandle ; return $Checksum; }
      I have also modified it to use sysopen and sysseek, for some reason using sysread with the offset parameter doesn't work for me hence the sysseek.
      sysread FILEHANDLE,SCALAR,LENGTH,OFFSET sysread FILEHANDLE,SCALAR,LENGTH ... An OFFSET may be specified to place the read data at some plac +e in the string other than the beginning. A negative OFFSET specifi +es placement at that many characters counting backwards from the +end of the string. A positive OFFSET greater than the length of SC +ALAR results in the string being padded to the required size with " +\0" bytes before the result of the read is appended.

      In other words, OFFSET determines where inside $BufferByte the data is placed, not where where the data is read from the file.

      while ( $ChecksumLoopCounter < $Size ) { unless ( sysread ( $FileHandle, $BufferByte, 1) == 1 ) { die "sysread size error" ; } my $Byte = ord($BufferByte); $Checksum = unpack("S" , pack("S", $Checksum + $Byte)); $ChecksumLoopCounter++ ; $ChecksumByteOffset++ ; } close $FileHandle ; return $Checksum;

      It goes without saying that I don't understand the requirements and constraints of your task. For some reason, you are constrained to read the file byte-by-byte with sysread et al. I must agree that trying to use pack and unpack in this situation is awkward. I would be inclined toward something like (untested):

      use integer; my $Checksum = 0; while ($Size--) { unless ( sysread ( $FileHandle, $BufferByte, 1) == 1 ) { die "sysread size error" ; } $Checksum += ord($BufferByte); } close $FileHandle ; return $Checksum & 0xffff;
      See integer.


      Give a man a fish:  <%-{-{-{-<

      My brain hurts with all this pack, unpack gibberish.
      You have a simple situation.
      I will attempt to explain: my @bytes = unpack ('C*',$buf); is the key thing you need to understand.
      #!/usr/bin/perl; use strict; use warnings; # ! / u s r / b + i n / my $buf = pack ('C*', (0x23, 0x21, 0x2F, 0x75, 0x73, 0x72, 0x2f, 0x62, + 0x69, 0x6E, 0x2f)); # $buf is now a sequence of "characters" which are 8 bit unsigned num +bers # Don't go all crazy with me about multi-byte characters # Here "C" means 8 bits, one byte # The ASCII characters corresponding to those numbers are: # #!/usr/bin/ # to get a particular characters or a group of characters # from the $buf string, use substr() # # substr EXPR,OFFSET,LENGTH,REPLACEMENT # substr() is often used in conjection with unpack to generate # a particular numeric value with different byte ordering of say # a 16 or 32 or 64 bit value print substr($buf,3,3); #prints: usr print "\n"; print substr($buf,7,4); #prints: bin/ print "\n"; # Now translate each byte in $buf into an array.. # Each 8 bit character will be a represented on my # computer as a 64 bit signed value in an array of # what are named "bytes" my @bytes = unpack ('C*',$buf); # @bytes is now an array of numbers! # in decimal values: print "Decimal Values of \@bytes\n"; print "$_ " for @bytes; print "\n"; # Now to print those bytes in character context: print "Character values of \@bytes\n"; print chr($_) for @bytes; print "\n"; __END__ usr bin/ Decimal Values of @bytes 35 33 47 117 115 114 47 98 105 110 47 Character values of @bytes #!/usr/bin/

      Update:
      I looked back over some code from 2001:
      This modifies the header of a wave file.
      - creates byte strings in the correct order ("little endian")
      - then replaces, in 2 places, the 32 bit values in the buffer with new values
      - then writes the buffer with the new size parameters

      Hope this helps give you some ideas...
      How to replace a 32 bit value in a binary buffer with a new value...
      I might do it differently now, but after 18 years, this code still makes "sense"

      my $rsize = pack("V4", $new_riff_size); # "V4" means Vax or Intel "lit +tle endian" substr($buff,4,4) = substr($rsize,0,4); my $data_size = pack("V4", $new_data_size); substr($buff,54,4)= substr($data_size,0,4); print OUTBIN substr($buff,0,58);
      I am quite astonished at the apparent requirement to checksum a "window" of the input file. Normally one would calculate a checksum on the whole file or whole file - 16 bits and then compare the checksums. Skipping X bytes of offset the beginning seems odd to me. The code below only went through the bare minimum of "kick the tires".... but I did run it at least on the first 2 bytes of the source file (not much testing ha!).

      some suggestions:

      1. Change the order of the interface to put the noun (the file name), which always must be there first. Then put the adjectives like window offset and size of window parameters. It could be that suitable default values can be worked for those?
      2. I just used the normal read and seek functions instead of the sysread, etc. functions.
      3. A seek operation will cause any buffers to be flushed (if needed). Seek to current position is often used to cause flush of write data to the disk in certain types of files. Seek is can be a very expensive thing - be careful with that.
      4. Disk files are normally written in increments of 512 bytes. Each 512 byte chunk is too small for the filesystem to keep track of, so 8 of these things get amalgamated together as 4096 bytes. the file system tracks chunks of 4096 bytes (..usually nowadays...). In general try to read at least 4096 bytes chunks from a hard disk. That is an increment that is likely to "make the file system "happy"". In general, an OS call makes your process eligible for re-scheduling. This can slow the execution time of your program quite a bit.
      5. I found some of your coding constructs confusing - you may like the way I did it or not...
      #!/usr/bin/perl use strict; use warnings; my $BUFFSIZE = 4096 *1; sub Checksum { my ($FileName, $Start_byte, $Size) = @_; open (my $fh, '<', $FileName) or die "unable to open $FileName + for read $!"; binmode $fh; #This is truly bizarre! Checksum does not start at beginning o +f file! # seek ($fh, $Start_byte, 0) or die "Cannot seek to $Start_byte +on $FileName $!"; my $check_sum =0; # Allow for checkum only on a "window" of the input file, i.e. + $Size may be # much smaller than size_of_file - start_byte! Another Bizarr +e requirement!! while ($Size >0) { my $n_byte_request = ($BUFFSIZE > $Size) ? $Size : $BUFFSI +ZE; my $n_bytes_read = read($fh, my $buff, $n_byte_request); die "file system error binary read for $FileName" unless d +efined $n_bytes_read; die "premature EOF on $FileName checksum block size too bi +g for actual file" if ($n_bytes_read < $n_byte_request); my @bytes = unpack('C*', $buff); #input string of data ar +e 8 bit unsigned ints # check_sum is at least a 32 bit signed int. masking to 16 + bits # after every add probably not needed, but maybe. $check_sum += $_ for @bytes; $Size -= $n_bytes_read; } close $fh; $check_sum &= 0xFFFF; #Truncate to 16 bits, probably have to +do this more often... return $check_sum; } my $chk = Checksum('BinaryCheckSum.pl', 0,2); print $chk; #prints 68 decimal, 0x23 + 0x21, "#!"
      Minor Update: I thought a bit about truncating the checksum. The max value of 8 unsigned bits is 0xFF or 511 255 in decimal. Max positive value of a 32 bit signed int is 0x7FFFFFFF or decimal 2,147,483,647. If every byte was the maximum 8 bit unsigned value, How many bytes would it take to overflow a 32 bit signed int? 2,147,483,647 / 511 255 ~ 8.4 million. At that size of file, a checksum is absolutely worthless. I conclude that truncating the $check_sum after the calculation is good enough. If the OP is using this on 4-8MB files, that is a VERY bad idea.

        For some reason, Chris01234 seems resolutely committed to byte-by-byte access to the data block, but if you're going to use a block-by-block approach, why not just
            $check_sum += unpack('%16C*', $buff);
        (see unpack) and then bitwise-and mask to 16 bits before returning? (I'd also use the biggest buffer I could get away with. 10MB? 100MB?)


        Give a man a fish:  <%-{-{-{-<

Re: Accessing individual bytes of a binary file as numerical values
by karlgoethebier (Abbot) on Apr 25, 2019 at 12:40 UTC
    "...quite astonished at the apparent requirement to checksum a 'window'...calculate a checksum on the whole file..." (Marshall)

    #MeToo. Probably this is yet another sick assignment by some sick professor?

    For a more serious/practical use of checksums you might be happier using digest from Path::Tiny:

    my $object = path($file)->digest($algorithm);. The default algorithm is SHA-256.

    Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help