in reply to Count byte/character occurrence (quickly)

While your version took 12 seconds for 16MB on my system:

C:\test>dir 1123355.bin 14/04/2015 16:50 16,777,216 1123355.bin C:\test>1159245 1123355.bin Took 12.897612 secs

(That was the third run so the cache was primed.)

This version took:

C:\test>1159245 1123355.bin Took 3.832763 secs : 3762666 ☺ : 46120 ☻ : 43642 ♥ : 44106 ♦ : 43878 ...

The code;

#! perl -slw use strict; use Time::HiRes qw[ time ]; my $start = time; open I, '<:raw', $ARGV[ 0 ]; my @seen; while( read( I, my $buf, 16384 ) ) { ++$seen[$_] for unpack 'C*', $buf; } printf "Took %f secs\n", time() - $start; printf "%c : %u\n", $_, $seen[$_] for 0 .. 255;

Replies are listed 'Best First'.
Re^2: Count byte/character occurrence (1/4)
by hippo (Archbishop) on Apr 01, 2016 at 10:46 UTC
    while( read( I, my $buf, 16484  ) ) {

    The choice of buffer size here is intriguing. I would have guessed that the ideal would be some exact multiple of the block size. What is the thinking behind the 100 byte excess?

      What is the thinking behind the 100 byte excess?

      A typo :) Now corrected. (Didn't make much difference in terms of speed; which surprises me.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.

        That makes sense. For a 16MB file the script is only doing 1000 reads at 16K each. If each read is only reading one extra block and that extra block is always the consecutive one it probably should make only a small difference to the run time. Anyway, thanks for the clarification - thought I must have missed some clever trick.

Re^2: Count byte/character occurrence (1/4)
by james28909 (Deacon) on Apr 01, 2016 at 15:58 UTC
    WOW! thats alot faster. I didnt even think about unpack. that code runs in ~2.6 seconds on my machine. Thanks!
Re^2: Count byte/character occurrence (1/4)
by james28909 (Deacon) on Apr 04, 2016 at 20:58 UTC
    Away for pc right now, but would something with a for loop and substr be faster? I dont necessarily have to unpack any bytes, do I?
      would something with a for loop and substr be faster? I dont necessarily have to unpack any bytes, do I?

      That requires a call into C (substr) for every byte; where using unpack requires a single call for the entire string.

      The cardinal rule for optimising Perl code, is to get perl's built-ins to do as much of the work as you can.

      Using this loop:

      ++$seen[ ord chop $buf ] while length $buf;
      in place of the unpack loop is almost but not quite as fast.

      It trades 2 built-in calls per byte, against the cost of building the unpack return list on the stack, and loses by a hair.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.