in reply to Re^2: Accessing individual bytes of a binary file as numerical values
in thread Accessing individual bytes of a binary file as numerical values

For some reason, Chris01234 seems resolutely committed to byte-by-byte access to the data block, but if you're going to use a block-by-block approach, why not just
    $check_sum += unpack('%16C*', $buff);
(see unpack) and then bitwise-and mask to 16 bits before returning? (I'd also use the biggest buffer I could get away with. 10MB? 100MB?)


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^4: Accessing individual bytes of a binary file as numerical values
by Marshall (Canon) on Apr 25, 2019 at 07:13 UTC
    Well, for one thing, we aren't sure (or at least I'm not sure) that the Op's application uses files in increments of 16 bits. Maybe this is download to an 8-bit uP? The OP's spec appears to be a bizarre thing where only a "window" of the binary file is check-summed. I've never seen anything like that before.

    I am not at all sure that performance is an issue here at all! I think the goal should be clear, understandable code and then work on performance optimization later.

    I just made a post to try to explain a simple way to do what needs to be done.
    If that doesn't work, I at least feel like I tried.

    BTW: increasing the multiple of 4K bytes may not help all that much in terms of execution performance. This depends upon the O/S and file system AND how many other disk intensive processes are running at the same time. Depending upon the OS and file system, these 4K units can wind up being scattered around over the disk surface. Your processes is not calculating if it is waiting for the disk to finish some very large request. Sometimes there is an optimal "quanta" of info to request. I think that discussion is well beyond the scope of the original question.

      Well, for one thing, we aren't sure (or at least I'm not sure) that the Op's application uses files in increments of 16 bits. Maybe this is download to an 8-bit uP?

      I don't understand the point you're making here. To me, the OPer seems to want a 16-bit | more precisely, I guess it's an unsigned 16-bit sum of bytes | unsigned bytes ('C' data elements in terms of pack) from a subsection of a file. The OS or underlying hardware doesn't seem to matter.

      The OP's spec appears to be a bizarre thing where only a "window" of the binary file is check-summed. I've never seen anything like that before.

      I once worked with an application that used data files that had several subsections that each had a simple 32-bit checksum of bytes. We had to generate these files, and subsequently extract and verify the subsections for use. And yes, we recognized that a simple 32-bit checksum offered little "security", and no, we had no option to change anything, so I have some sympathy for what Chris01234 may be facing.

      I am not at all sure that performance is an issue here at all! I think the goal should be clear, understandable code and then work on performance optimization later.

      I agree, and that's the sort of solution I tried to offer here, with Chris01234's byte-by-byte code being altered as little as possible to achieve what seemed to me a significant increase in clarity – and no pack/unpack need apply! But if you're going to introduce a block-processing approach, the use of unpack's '%' checksumming feature is, to me, ideally simple and clear, and it's also documented; see unpack and Doing Sums in perlpacktut. If this simple, clear code also performs well, so much the better.

      I don't know anything about Chris01234's personal odyssey among the Perlish islands of sysread and pack/unpack and the demons he or she met there, but I'll bet it was "interesting", and now all Chris01234 wants to do is get back home and start coding C# again.


      Give a man a fish:  <%-{-{-{-<

        Well, for one thing, we aren't sure (or at least I'm not sure) that the Op's application uses files in increments of 16 bits. Maybe this is download to an 8-bit uP? I was not very well spoken. I was looking at some solutions that generated a 16 bit checksum but were processing the data 16 bits at a time. I think here, sum should be 16 bit value, but each addition to that sum should be only 8 bits. Code should allow an odd number of bytes, 51 or whatever.

        I agree that there appear to be plenty of workable solutions presented in this thread. Even this byte by byte stuff will work if you don't have to do it very often and how long it takes doesn't really matter.

        Update: A few comments about buffer size... Bigger is not always "better". In my experience, increasing the buf size will have an effect up until a certain point. After that point, no gain is apparent. I recommend increments of 4K bytes (4096) because that is likely to be a "unit" that the file system deals with most naturally (as explained above). I suspect that the "sweet spot" in terms of buff size is likely to be 16 or 32 Kbytes. Typically going way bigger than that won't hurt, but it won't help either. When I really care, I make buffSize a variable and do some benchmarking. My advice is the result of my benchmarking experience on the O/S'es and systems that I commonly use - mileage certainly does vary. My point is: Bigger is not always "better". Note that with a truly huge buffer size, it is possible to have a dramatic slowdown if using such a large buffer causes swapping back and forth to the disk.