in reply to Re^3: Perl Hashes in C?
in thread Perl Hashes in C?

++$h{$_} for 1 .. 27e6; print time() - $t;;
This is the trivial hash, guaranteed to never have a collision with sequential integers. And it has no data

I wrote a test script to see how fast the suggested Perl hashing is with Big Data and millions of collisions:

sub rgb_hash() { $debug=0; $raw="pf-2015.0531-249465.pprgb.7360x4912.raw"; %e2at=%rgb2c=(); # Init Event -> Accululated Time hash; RGB -> Co +unt hash &time_event('init', \%e2at, $debug*0); # Initialize timing struct +ures &time_event('Read_Raw', \%e2at, $debug*0); $rsize = -s $raw; open(IN, "<$raw"); binmode IN; $read = read(IN, $buf, $rsize); close IN; # Use it n cluse it! die("Read ERROR!: Read $read != file $rsize\n") unless($read==$rs +ize); printf("Read %1.6f MB in raw file $raw\n", $read/1.0E6); &time_event('# Pre-extend array', \%e2at, $debug*0); $ushort[108456960+1] = 0x0000FFFF0000; # Pre-extend array &time_event('Unpack', \%e2at, $debug*0); @ushort=unpack("S*", $buf); $su = scalar(@ushort) || -1; printf("Ushort[$su], length(buf)=%1.6f, Bytes/ushort %1.9f\n", ($bl=length($buf)) / 1.0E6, $bl/$su); for($ii=0; $ii < scalar @ushort; $ii+=3) { # Extract the R, G and B values ($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice $quad=pack('Q', $rr, $gg, $bb, 0x0000); $mult=$rr*$gg*$bb; # Titanic 281 Trillion size number $white=$rr + $gg*65535 + $bb*65535*65535; # 281 Trillion size + number $bs=$rr | ($gg << 16) | ($bb << 32); # << WORKS! $rgb2c{$bs}++; # Increment count for this color if(!($ii%6E6)) { # Document time on every 100k unit $jj++; printf("R=0x%04hx, g=0x%04hx, b=0x%04hx, bs=0x%016qx, q=0x +%016qx, m=0x%016qx, W=0x%016qx\n", $rr, $gg, $bb, $bs, $quad, $mult, $white); } } &time_event("Count distinct colors", \%e2at, $debug*0); printf("RGB hash has %d distinct colors.\n", scalar keys %rgb2c); &time_event('term', \%e2at, $debug*0); # Finalize timing structur +es &show_event(\%e2at, $debug*0); # Automatically stops last event t +iming }
No snide comments about my backwoodsy writing style, please :)

This thing is 3 times as fast as the C program I wrote.
And we get exactly the same number.

I has to be the hashing function calculating an address. The C program, bless its little heart, had to do a Binary Search over an ever growing lookup table.

There was one severe problem with getting the 48bit hash key right. I packed 3 verified unsigned short into a "Q"

($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice $quad=pack('Q', $rr, $gg, $bb, 0x0000);
The R, G and B printed perfectly and agreed with the Bitshift/AND value and the QUAD was always zero:
R=0x0892, g=0x0b8c, b=0x0672, bs=0x000006720b8c0892, q=0x0000000000000000, R=0x0892 g=0x0b8c b=0x0672,
It's kind of hard to see, buy they all agree except for the
#(*#@%) QUAD!
I followed written documentation:

From: >> http://perldoc.perl.org/functions/pack.html

Formula: pack TEMPLATE,LIST; Q An unsigned quad value. Example: $foo = pack('nN', 42, 4711); # pack big-endian 16- and 32-bit unsigned integers
I had to do a bitshift, <<16 for the GREEN and <<32 for the BLUE and logically AND them together to get a UINT24_T that worked.

The performance is MOST IMPRESSIVE!!

Thanks for the Pointers (References?

================================================ Boring Report: Running c:/bin/bb.pl Wed Aug 12 20:05:10 2015 Read 216.913920 MB in raw file pf-2015.0531-249465.pprgb.7360x4912.raw Ushort[108456960], length(buf)=216.913920, Bytes/ushort 2.000000000 R=0x1746, g=0x1910, b=0x35ee, bs=0x000035ee19101746 BAD quad=0x0000000000000000, m=0x0000007ae0ad0540, W=0x000035edad3434 +24 ... RGB hash has 27645898 distinct colors. <<<<<< The good part SE: Sum of Event Times = 1.11464685 min, now-init = 1.11464715 min Event '# Pre-extend array' elapsed time = 0.003 min = 0.24%. Event 'Count distinct colors' elapsed time = -0.017 min = -1.50%. Event 'Loop' elapsed time = 1.012 min = 90.81%. Event 'Read_Raw' elapsed time = 0.002 min = 0.22%. Event 'Unpack' elapsed time = 0.097 min = 8.73%. 90.81% -> Loop 8.73% -> Unpack 0.24% -> # Pre-extend array 0.22% -> Read_Raw 0.00% -> Count distinct colors Elapsed time = 1.11 min

Replies are listed 'Best First'.
Re^5: Perl Hashes in C?
by BrowserUk (Patriarch) on Aug 13, 2015 at 05:08 UTC

    FWIW, this constructs a 216 MB file of random bytes in memory; then indexes it is 6-byte chunks to find 28 million "colors" in 37 million pixels in 56 seconds (and I'm guessing your machine is a lot faster than mine):

    #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_1138218', CLEAN_AFTER_BUILD =>0; typedef unsigned __int64 U64; typedef unsigned int U32; typedef unsigned char U8; #define TAG printf( "%s[%u]\n", __FILE__, __LINE__ ) HV* countColors( SV *img ) { STRLEN l; U8 *p = SvPVx( img, l ); HV* counts = newHV(); U32 i; for( i = 0; i < l; i += 6 ) { SV **val; if( val = hv_fetch( counts, p+i, 6, 0 ) ) { ++SvUVX( *val ); } else { hv_store( counts, p+i, 6, newSVuv( 1 ), 0 ); } } return counts; } END_C use List::Util qw[ shuffle sum ]; my $data = ''; my @bytes = map chr, 0 .. 255; { local $/; open O, '>', \$data; seek O, 226492416, 0; print O chr(26 ); seek O, 0, 0; print O join '', ( shuffle( @bytes ) ) x 4 for 1 .. 221184; close O; } print length $data; my $start = time; my $href = countColors( $data ); printf "Took %f seconds\n", time() -$start; printf "Found %u colors in %u pixels\n", scalar( keys %{ $href } ), su +m( values %{ $href } ); __END__ C:\test>1138218-2 226713600 Took 56.000000 seconds Found 28532475 colors in 37785600 pixels

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re^5: Perl Hashes in C?
by BrowserUk (Patriarch) on Aug 13, 2015 at 01:40 UTC