in reply to Perl Hashes in C?

This drops into Inline::C to explore the bitmap and construct the hash. It takes 23 seconds to process a 125 megapixel image:

#! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_1138218', CLEAN_AFTER_BUILD =>0; typedef unsigned __int64 U64; typedef unsigned int U32; typedef unsigned char U8; HV* countColors( SV *img ) { STRLEN l; U32 *rgba = (U32*)SvPVx( img, l ), i; HV* counts = newHV(); l /= 4; for( i = 0; i < l; ++i ) { if( hv_exists( counts, (char*)&rgba[ i ], 4 ) ) { SV **val = hv_fetch( counts, (char*)&rgba[ i ], 4, 0 ); SvIV_set( *val, SvIV( *val ) + 1 ); } else { SV *val = newSViv( 1 ); hv_store( counts, (char*)&rgba[ i ], 4, val, 0 ); } } return counts; } END_C use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; use List::Util qw[ sum ]; use GD; GD::Image->trueColor(1); my $i = GD::Image->new( $ARGV[ 0 ] ) or die $!; my $gd = $i->gd; my( undef, $width, $height, undef, undef ) = unpack 'nnnCV', substr( $ +gd, 0, 11, '' ); printf "Width:%u height:%u pixels:%u\n", $width, $height, length( $gd +) / 4; my $start = time; my $href = countColors( $gd ); printf "Took %f seconds\n", time() -$start; printf "Found %u colors in %u pixels\n", scalar( keys %{ $href } ), su +m( values %{ $href } ); __END__ C:\test>1138218 mid.png Width:12800 height:10240 pixels:131072000 Took 23.391081 seconds Found 81814 colors in 131072000 pixels

The handling of the hash could be made more efficient by only calculating the hash-values once instead of several times; and by pre-extending the hash.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

Replies are listed 'Best First'.
Re^2: Perl Hashes in C? (Twice as fast)
by BrowserUk (Patriarch) on Aug 12, 2015 at 19:56 UTC

    A small change doubled the performance:

    HV* countColors2( SV *img ) { STRLEN l; U32 *rgba = (U32*)SvPVx( img, l ), i; HV* counts = newHV(); l /= 4; for( i = 0; i < l; ++i ) { SV **val; if( val = hv_fetch( counts, (char*)&rgba[ i ], 4, 0 ) ) { SvIV_set( *val, SvIV( *val ) + 1 ); } else { SV *val = newSViv( 1 ); hv_store( counts, (char*)&rgba[ i ], 4, val, 0 ); } } return counts; }
    C:\test>1138218 mid.png Width:12800 height:10240 pixels:131072000 Took 12.642578 seconds Found 81814 colors in 131072000 pixels

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re^2: Perl Hashes in C?
by BrianP (Acolyte) on Aug 12, 2015 at 05:10 UTC

    These are TrueColor. I am dealing with 281 TRILLION, full 16 bits/channel, 48 bits/pixel.

    "Found 81814 colors in 131072000 pixels" -> 1602 Pixels per color.

    The 216MB Photoshop RAW/16 file had 27 MILLION unique colors out of 36M "Pixels=36152321, unique Colors=27546248=76.19%"

    76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup.

    I am monkeying with the MAX_UNSORTED parameter which determines when a sort has to be done after so many new, random colors have been piled on top of the lookup table.

    I had it set at a way, way too low 200. I wrote a Perl script to run the C program with varying MAX_UNSORT numbers and are seeing vastly better performance with 3805 is the best so far. The linear searches on top of the pile are pretty cheap compared to QSorting and merging.

    With a 1 in 3 sampling (12M of 36M), I have it down to < 46 seconds with 88.55% unique colors

    The larger the number of unique colors, the more it pays to leave a pile of unsorted colors on top.

    The one I did before was a ColorMAtch colorspace and it had ~76% unique colors. This one is ProPhoto and is over 85%! Same NEF file, same ACR settings, no photoshop other than to import from ACR and save as RAW.

    It looks like I need to work on the Sort_Merge. QSort on the entire 27 million tall stack, 99% already sorted was taking 98% of the program time. The shuffle_merge is 100 times faster on this problem

      The 216MB Photoshop RAW/16 file had 27 MILLION unique colors out of 36M

      This is a perl creating a hash with 27 million keys:

      [0] Perl> $t = time; ++$h{$_} for 1 .. 27e6; print time() - $t;; 52.9575479030609

      53 seconds!

      76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup.

      Sorry, but if you mean "rehash every preexisting key", you are wrong. Why do you believe that?

      (If you mean something else by the highlighted phrase above, you're gonna have to explain yourself better.)

      The beauty of hashing for this application is that it doesn't matter what the range is, only the actual total.

      For each pixel in your image you either need to add a new key; or increment an existing value. Either takes approximately the same amount of time: circa: 0.000000457763671875 of a second on my rather ancient hardware.

      Indexing your 48-bit values (as opposed to my 32-bit ones) will take ~50% longer; so perhaps 40 seconds to count the colours in my 125 mega-pixel image.

      I have it down to < 46 seconds with 88.55% unique colors

      If you've already trimmed your OP time of "2.17 hours" to 48 seconds, why have you wasted our time by asking this question?

      Another monk that I won't waste my time reading and producing solutions for in future.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
      I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
        ++$h{$_} for 1 .. 27e6; print time() - $t;;
        This is the trivial hash, guaranteed to never have a collision with sequential integers. And it has no data

        I wrote a test script to see how fast the suggested Perl hashing is with Big Data and millions of collisions:

        sub rgb_hash() { $debug=0; $raw="pf-2015.0531-249465.pprgb.7360x4912.raw"; %e2at=%rgb2c=(); # Init Event -> Accululated Time hash; RGB -> Co +unt hash &time_event('init', \%e2at, $debug*0); # Initialize timing struct +ures &time_event('Read_Raw', \%e2at, $debug*0); $rsize = -s $raw; open(IN, "<$raw"); binmode IN; $read = read(IN, $buf, $rsize); close IN; # Use it n cluse it! die("Read ERROR!: Read $read != file $rsize\n") unless($read==$rs +ize); printf("Read %1.6f MB in raw file $raw\n", $read/1.0E6); &time_event('# Pre-extend array', \%e2at, $debug*0); $ushort[108456960+1] = 0x0000FFFF0000; # Pre-extend array &time_event('Unpack', \%e2at, $debug*0); @ushort=unpack("S*", $buf); $su = scalar(@ushort) || -1; printf("Ushort[$su], length(buf)=%1.6f, Bytes/ushort %1.9f\n", ($bl=length($buf)) / 1.0E6, $bl/$su); for($ii=0; $ii < scalar @ushort; $ii+=3) { # Extract the R, G and B values ($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice $quad=pack('Q', $rr, $gg, $bb, 0x0000); $mult=$rr*$gg*$bb; # Titanic 281 Trillion size number $white=$rr + $gg*65535 + $bb*65535*65535; # 281 Trillion size + number $bs=$rr | ($gg << 16) | ($bb << 32); # << WORKS! $rgb2c{$bs}++; # Increment count for this color if(!($ii%6E6)) { # Document time on every 100k unit $jj++; printf("R=0x%04hx, g=0x%04hx, b=0x%04hx, bs=0x%016qx, q=0x +%016qx, m=0x%016qx, W=0x%016qx\n", $rr, $gg, $bb, $bs, $quad, $mult, $white); } } &time_event("Count distinct colors", \%e2at, $debug*0); printf("RGB hash has %d distinct colors.\n", scalar keys %rgb2c); &time_event('term', \%e2at, $debug*0); # Finalize timing structur +es &show_event(\%e2at, $debug*0); # Automatically stops last event t +iming }
        No snide comments about my backwoodsy writing style, please :)

        This thing is 3 times as fast as the C program I wrote.
        And we get exactly the same number.

        I has to be the hashing function calculating an address. The C program, bless its little heart, had to do a Binary Search over an ever growing lookup table.

        There was one severe problem with getting the 48bit hash key right. I packed 3 verified unsigned short into a "Q"

        ($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice $quad=pack('Q', $rr, $gg, $bb, 0x0000);
        The R, G and B printed perfectly and agreed with the Bitshift/AND value and the QUAD was always zero:
        R=0x0892, g=0x0b8c, b=0x0672, bs=0x000006720b8c0892, q=0x0000000000000000, R=0x0892 g=0x0b8c b=0x0672,
        It's kind of hard to see, buy they all agree except for the
        #(*#@%) QUAD!
        I followed written documentation:

        From: >> http://perldoc.perl.org/functions/pack.html

        Formula: pack TEMPLATE,LIST; Q An unsigned quad value. Example: $foo = pack('nN', 42, 4711); # pack big-endian 16- and 32-bit unsigned integers
        I had to do a bitshift, <<16 for the GREEN and <<32 for the BLUE and logically AND them together to get a UINT24_T that worked.

        The performance is MOST IMPRESSIVE!!

        Thanks for the Pointers (References?

        ================================================ Boring Report: Running c:/bin/bb.pl Wed Aug 12 20:05:10 2015 Read 216.913920 MB in raw file pf-2015.0531-249465.pprgb.7360x4912.raw Ushort[108456960], length(buf)=216.913920, Bytes/ushort 2.000000000 R=0x1746, g=0x1910, b=0x35ee, bs=0x000035ee19101746 BAD quad=0x0000000000000000, m=0x0000007ae0ad0540, W=0x000035edad3434 +24 ... RGB hash has 27645898 distinct colors. <<<<<< The good part SE: Sum of Event Times = 1.11464685 min, now-init = 1.11464715 min Event '# Pre-extend array' elapsed time = 0.003 min = 0.24%. Event 'Count distinct colors' elapsed time = -0.017 min = -1.50%. Event 'Loop' elapsed time = 1.012 min = 90.81%. Event 'Read_Raw' elapsed time = 0.002 min = 0.22%. Event 'Unpack' elapsed time = 0.097 min = 8.73%. 90.81% -> Loop 8.73% -> Unpack 0.24% -> # Pre-extend array 0.22% -> Read_Raw 0.00% -> Count distinct colors Elapsed time = 1.11 min
        >> 76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup.

        >Sorry, but if you mean "rehash every preexisting key", you are wrong. Why do you believe that?

        As I recall, when a hash algorithm is selected, there is a tradeoff between performance and probability of uniqueness. Some fraction of the data up to and including the entire key may be used in the hash key.

        If all of the distinct keys tried actually give non-colliding hash values, then the tradeoff worked. Otherwise, another algorithm must be selected which uses either more of the key or a different algorithm.

        That triggers a total recalc. No?

        Not even the Perl Gurus can know in advance that your data would have 8000 distinct colors and that mine would have 27 million.

        And, hashing 48 bit quantomly random data has to be much more than 2.0 times as hard on a hashing algorithm as 24 bit data. There are 3300 times more buckets to keep track of.

        Working in 16E6 color space does not in any way seem like it should be half as hard as 281474976710656 color space.

        What I was seeing in my ill-fated C attempt was dramatically longer run times with modest increases in data volume from the ever expanding looup table. That is why I was looking for a hashing formula!

        >> If you've already trimmed your OP time of "2.17 hours" to 48 seconds, why have you wasted our time

        When I asked the question, the run-time was hours. Way beyond my nano-scale attention span. While The Monks were busy writing up many questions, I was beaverishly instrumenting my code to find out where all the time was being squandered. Qsort was taking 98% of the time!

        Replacing <dumb old> QSort with a brilliantly conceived "Shuffle Merge" (TM :) and increasing my MAX_UNSORTED value from 200 (D'oh!) to a more workable 7920, I was able to realize the astonishingly better run time of roughly a minute. 120 X Faster? Dang!

        Make your first draft the klunkiest, cloddiest, most horrendously heinous code possible because then you have no place to go but UP!

        >> 76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup.

        >Sorry, but if you mean "rehash every preexisting key", you are wrong. Why do you believe that?

        As I recall, when a hash algorithm is selected, there is a tradeoff between performance and probability of uniqueness. Some fraction of the data up to and including the entire key may be used in the hash key.

        If all of the distinct keys tried actually give non-colliding hash values, then the tradeoff worked. Otherwise, another algorithm must be selected which uses either more of the key or a different algorithm.

        That triggers a total recalc. No?

        Not even the Perl Gurus can know in advance that your data would have 8000 distinct colors and that mine would have 27 million.

        And, hashing 48 bit quantomly random data has to be much more than 2.0 times as hard on a hashing algorithm as 24 bit data. There are 3300 times more buckets to keep track of.

        Working in 16E6 color space does not in any way seem like it should be half as hard as 281474976710656 color space.

        What I was seeing in my ill-fated C attempt was dramatically longer run times with modest increases in data volume from the ever expanding looup table. That is why I was looking for a hashing formula!

        >> If you've already trimmed your OP time of "2.17 hours" to 48 seconds, why have you wasted our time

        When I asked the question, the run-time was hours. Way beyond my nano-scale attention span. While The Monks were busy writing up many questions, I was beaverishly instrumenting my code to find out where all the time was being squandered. Qsort was taking 98% of the time!

        Replacing <dumb old> QSort with a brilliantly conceived "Shuffle Merge" (TM :) and increasing my MAX_UNSORTED value from 200 (D'oh!) to a more workable 7920, I was able to realize the astonishingly better run time of roughly a minute. 120 X Faster? Dang!

        Make your first draft the klunkiest, cloddiest, most horrendously heinous code possible because then you have no place to go but UP!

        Either be here to help people, as you plainly very much can, or do not be here, and keep such comments to yourself.
          A reply falls below the community's threshold of quality. You may see it by logging in.