in reply to Accessing individual bytes of a binary file as numerical values

Thanks for the input, I now have something that works. Using ord was very helpful, I have also modified it to use sysopen and sysseek, for some reason using sysread with the offset parameter doesn't work for me hence the sysseek. As for using CRCs, I know the checksum is weak but I have just been tasked to automate an existing process, not to change the process so I don't get to choose the method. I will look at the modulo arithmetic approach to see if I can get that working rather than the mess of packing and unpacking.
sub Checksum { my @Arguments = @_ ; my $Start = $Arguments[0] ; my $Size = $Arguments[1] ; my $FileName = $Arguments[2] ; my $FileHandle ; unless ( sysopen $FileHandle, $FileName, 0 ) { die "\nsysopen failed"; } my $BufferByte = 0 ; my $Checksum = 0 ; my $ChecksumLoopCounter = 0 ; my $ChecksumByteOffset = $Start ; unless ( sysseek($FileHandle, $Start, 0) == $Start ) { die "\nsysseek error" ; } while ( $ChecksumLoopCounter < $Size ) { unless ( sysread ( $FileHandle, $BufferByte, 1) == 1 ) { die "sysread size error" ; } my $Byte = ord($BufferByte); $Checksum = unpack("S" , pack("S", $Checksum + $Byte)); $ChecksumLoopCounter++ ; $ChecksumByteOffset++ ; } close $FileHandle ; return $Checksum; }
  • Comment on Re: Accessing individual bytes of a binary file as numerical values
  • Download Code

Replies are listed 'Best First'.
Re^2: Accessing individual bytes of a binary file as numerical values
by jwkrahn (Abbot) on Apr 24, 2019 at 19:03 UTC
    I have also modified it to use sysopen and sysseek, for some reason using sysread with the offset parameter doesn't work for me hence the sysseek.
    sysread FILEHANDLE,SCALAR,LENGTH,OFFSET sysread FILEHANDLE,SCALAR,LENGTH ... An OFFSET may be specified to place the read data at some plac +e in the string other than the beginning. A negative OFFSET specifi +es placement at that many characters counting backwards from the +end of the string. A positive OFFSET greater than the length of SC +ALAR results in the string being padded to the required size with " +\0" bytes before the result of the read is appended.

    In other words, OFFSET determines where inside $BufferByte the data is placed, not where where the data is read from the file.

Re^2: Accessing individual bytes of a binary file as numerical values
by AnomalousMonk (Archbishop) on Apr 24, 2019 at 17:57 UTC
    while ( $ChecksumLoopCounter < $Size ) { unless ( sysread ( $FileHandle, $BufferByte, 1) == 1 ) { die "sysread size error" ; } my $Byte = ord($BufferByte); $Checksum = unpack("S" , pack("S", $Checksum + $Byte)); $ChecksumLoopCounter++ ; $ChecksumByteOffset++ ; } close $FileHandle ; return $Checksum;

    It goes without saying that I don't understand the requirements and constraints of your task. For some reason, you are constrained to read the file byte-by-byte with sysread et al. I must agree that trying to use pack and unpack in this situation is awkward. I would be inclined toward something like (untested):

    use integer; my $Checksum = 0; while ($Size--) { unless ( sysread ( $FileHandle, $BufferByte, 1) == 1 ) { die "sysread size error" ; } $Checksum += ord($BufferByte); } close $FileHandle ; return $Checksum & 0xffff;
    See integer.


    Give a man a fish:  <%-{-{-{-<

Re^2: Accessing individual bytes of a binary file as numerical values
by Marshall (Canon) on Apr 25, 2019 at 06:43 UTC
    My brain hurts with all this pack, unpack gibberish.
    You have a simple situation.
    I will attempt to explain: my @bytes = unpack ('C*',$buf); is the key thing you need to understand.
    #!/usr/bin/perl; use strict; use warnings; # ! / u s r / b + i n / my $buf = pack ('C*', (0x23, 0x21, 0x2F, 0x75, 0x73, 0x72, 0x2f, 0x62, + 0x69, 0x6E, 0x2f)); # $buf is now a sequence of "characters" which are 8 bit unsigned num +bers # Don't go all crazy with me about multi-byte characters # Here "C" means 8 bits, one byte # The ASCII characters corresponding to those numbers are: # #!/usr/bin/ # to get a particular characters or a group of characters # from the $buf string, use substr() # # substr EXPR,OFFSET,LENGTH,REPLACEMENT # substr() is often used in conjection with unpack to generate # a particular numeric value with different byte ordering of say # a 16 or 32 or 64 bit value print substr($buf,3,3); #prints: usr print "\n"; print substr($buf,7,4); #prints: bin/ print "\n"; # Now translate each byte in $buf into an array.. # Each 8 bit character will be a represented on my # computer as a 64 bit signed value in an array of # what are named "bytes" my @bytes = unpack ('C*',$buf); # @bytes is now an array of numbers! # in decimal values: print "Decimal Values of \@bytes\n"; print "$_ " for @bytes; print "\n"; # Now to print those bytes in character context: print "Character values of \@bytes\n"; print chr($_) for @bytes; print "\n"; __END__ usr bin/ Decimal Values of @bytes 35 33 47 117 115 114 47 98 105 110 47 Character values of @bytes #!/usr/bin/

    Update:
    I looked back over some code from 2001:
    This modifies the header of a wave file.
    - creates byte strings in the correct order ("little endian")
    - then replaces, in 2 places, the 32 bit values in the buffer with new values
    - then writes the buffer with the new size parameters

    Hope this helps give you some ideas...
    How to replace a 32 bit value in a binary buffer with a new value...
    I might do it differently now, but after 18 years, this code still makes "sense"

    my $rsize = pack("V4", $new_riff_size); # "V4" means Vax or Intel "lit +tle endian" substr($buff,4,4) = substr($rsize,0,4); my $data_size = pack("V4", $new_data_size); substr($buff,54,4)= substr($data_size,0,4); print OUTBIN substr($buff,0,58);
Re^2: Accessing individual bytes of a binary file as numerical values
by Marshall (Canon) on Apr 25, 2019 at 03:15 UTC
    I am quite astonished at the apparent requirement to checksum a "window" of the input file. Normally one would calculate a checksum on the whole file or whole file - 16 bits and then compare the checksums. Skipping X bytes of offset the beginning seems odd to me. The code below only went through the bare minimum of "kick the tires".... but I did run it at least on the first 2 bytes of the source file (not much testing ha!).

    some suggestions:

    1. Change the order of the interface to put the noun (the file name), which always must be there first. Then put the adjectives like window offset and size of window parameters. It could be that suitable default values can be worked for those?
    2. I just used the normal read and seek functions instead of the sysread, etc. functions.
    3. A seek operation will cause any buffers to be flushed (if needed). Seek to current position is often used to cause flush of write data to the disk in certain types of files. Seek is can be a very expensive thing - be careful with that.
    4. Disk files are normally written in increments of 512 bytes. Each 512 byte chunk is too small for the filesystem to keep track of, so 8 of these things get amalgamated together as 4096 bytes. the file system tracks chunks of 4096 bytes (..usually nowadays...). In general try to read at least 4096 bytes chunks from a hard disk. That is an increment that is likely to "make the file system "happy"". In general, an OS call makes your process eligible for re-scheduling. This can slow the execution time of your program quite a bit.
    5. I found some of your coding constructs confusing - you may like the way I did it or not...
    #!/usr/bin/perl use strict; use warnings; my $BUFFSIZE = 4096 *1; sub Checksum { my ($FileName, $Start_byte, $Size) = @_; open (my $fh, '<', $FileName) or die "unable to open $FileName + for read $!"; binmode $fh; #This is truly bizarre! Checksum does not start at beginning o +f file! # seek ($fh, $Start_byte, 0) or die "Cannot seek to $Start_byte +on $FileName $!"; my $check_sum =0; # Allow for checkum only on a "window" of the input file, i.e. + $Size may be # much smaller than size_of_file - start_byte! Another Bizarr +e requirement!! while ($Size >0) { my $n_byte_request = ($BUFFSIZE > $Size) ? $Size : $BUFFSI +ZE; my $n_bytes_read = read($fh, my $buff, $n_byte_request); die "file system error binary read for $FileName" unless d +efined $n_bytes_read; die "premature EOF on $FileName checksum block size too bi +g for actual file" if ($n_bytes_read < $n_byte_request); my @bytes = unpack('C*', $buff); #input string of data ar +e 8 bit unsigned ints # check_sum is at least a 32 bit signed int. masking to 16 + bits # after every add probably not needed, but maybe. $check_sum += $_ for @bytes; $Size -= $n_bytes_read; } close $fh; $check_sum &= 0xFFFF; #Truncate to 16 bits, probably have to +do this more often... return $check_sum; } my $chk = Checksum('BinaryCheckSum.pl', 0,2); print $chk; #prints 68 decimal, 0x23 + 0x21, "#!"
    Minor Update: I thought a bit about truncating the checksum. The max value of 8 unsigned bits is 0xFF or 511 255 in decimal. Max positive value of a 32 bit signed int is 0x7FFFFFFF or decimal 2,147,483,647. If every byte was the maximum 8 bit unsigned value, How many bytes would it take to overflow a 32 bit signed int? 2,147,483,647 / 511 255 ~ 8.4 million. At that size of file, a checksum is absolutely worthless. I conclude that truncating the $check_sum after the calculation is good enough. If the OP is using this on 4-8MB files, that is a VERY bad idea.

      For some reason, Chris01234 seems resolutely committed to byte-by-byte access to the data block, but if you're going to use a block-by-block approach, why not just
          $check_sum += unpack('%16C*', $buff);
      (see unpack) and then bitwise-and mask to 16 bits before returning? (I'd also use the biggest buffer I could get away with. 10MB? 100MB?)


      Give a man a fish:  <%-{-{-{-<

        Well, for one thing, we aren't sure (or at least I'm not sure) that the Op's application uses files in increments of 16 bits. Maybe this is download to an 8-bit uP? The OP's spec appears to be a bizarre thing where only a "window" of the binary file is check-summed. I've never seen anything like that before.

        I am not at all sure that performance is an issue here at all! I think the goal should be clear, understandable code and then work on performance optimization later.

        I just made a post to try to explain a simple way to do what needs to be done.
        If that doesn't work, I at least feel like I tried.

        BTW: increasing the multiple of 4K bytes may not help all that much in terms of execution performance. This depends upon the O/S and file system AND how many other disk intensive processes are running at the same time. Depending upon the OS and file system, these 4K units can wind up being scattered around over the disk surface. Your processes is not calculating if it is waiting for the disk to finish some very large request. Sometimes there is an optimal "quanta" of info to request. I think that discussion is well beyond the scope of the original question.