I am quite astonished at the apparent requirement to checksum a "window" of the input file. Normally one would calculate a checksum on the whole file or whole file - 16 bits and then compare the checksums. Skipping X bytes of offset the beginning seems odd to me. The code below only went through the bare minimum of "kick the tires".... but I did run it at least on the first 2 bytes of the source file (not much testing ha!).
some suggestions:
- Change the order of the interface to put the noun (the file name), which always must be there first. Then put the adjectives like window offset and size of window parameters. It could be that suitable default values can be worked for those?
- I just used the normal read and seek functions instead of the sysread, etc. functions.
- A seek operation will cause any buffers to be flushed (if needed). Seek to current position is often used to cause flush of write data to the disk in certain types of files. Seek is can be a very expensive thing - be careful with that.
- Disk files are normally written in increments of 512 bytes. Each 512 byte chunk is too small for the filesystem to keep track of, so 8 of these things get amalgamated together as 4096 bytes. the file system tracks chunks of 4096 bytes (..usually nowadays...). In general try to read at least 4096 bytes chunks from a hard disk. That is an increment that is likely to "make the file system "happy"". In general, an OS call makes your process eligible for re-scheduling. This can slow the execution time of your program quite a bit.
- I found some of your coding constructs confusing - you may like the way I did it or not...
#!/usr/bin/perl
use strict;
use warnings;
my $BUFFSIZE = 4096 *1;
sub Checksum
{
my ($FileName, $Start_byte, $Size) = @_;
open (my $fh, '<', $FileName) or die "unable to open $FileName
+ for read $!";
binmode $fh;
#This is truly bizarre! Checksum does not start at beginning o
+f file!
#
seek ($fh, $Start_byte, 0) or die "Cannot seek to $Start_byte
+on $FileName $!";
my $check_sum =0;
# Allow for checkum only on a "window" of the input file, i.e.
+ $Size may be
# much smaller than size_of_file - start_byte! Another Bizarr
+e requirement!!
while ($Size >0)
{
my $n_byte_request = ($BUFFSIZE > $Size) ? $Size : $BUFFSI
+ZE;
my $n_bytes_read = read($fh, my $buff, $n_byte_request);
die "file system error binary read for $FileName" unless d
+efined $n_bytes_read;
die "premature EOF on $FileName checksum block size too bi
+g for actual file"
if ($n_bytes_read < $n_byte_request);
my @bytes = unpack('C*', $buff); #input string of data ar
+e 8 bit unsigned ints
# check_sum is at least a 32 bit signed int. masking to 16
+ bits
# after every add probably not needed, but maybe.
$check_sum += $_ for @bytes;
$Size -= $n_bytes_read;
}
close $fh;
$check_sum &= 0xFFFF; #Truncate to 16 bits, probably have to
+do this more often...
return $check_sum;
}
my $chk = Checksum('BinaryCheckSum.pl', 0,2);
print $chk; #prints 68 decimal, 0x23 + 0x21, "#!"
Minor Update: I thought a bit about truncating the checksum. The max value of 8 unsigned bits is 0xFF or
511 255 in decimal. Max positive value of a 32 bit signed int is 0x7FFFFFFF or decimal 2,147,483,647. If every byte was the maximum 8 bit unsigned value, How many bytes would it take to overflow a 32 bit signed int? 2,147,483,647 /
511 255 ~ 8.4 million. At that size of file, a checksum is absolutely worthless. I conclude that truncating the $check_sum after the calculation is good enough. If the OP is using this on 4-8MB files, that is a VERY bad idea.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.