++$h{$_} for 1 .. 27e6; print time() - $t;;
This is the trivial hash, guaranteed to never have a collision with
sequential integers. And it has no data
I wrote a test script to see how fast the suggested Perl hashing is
with Big Data and millions of collisions:
sub rgb_hash() {
$debug=0;
$raw="pf-2015.0531-249465.pprgb.7360x4912.raw";
%e2at=%rgb2c=(); # Init Event -> Accululated Time hash; RGB -> Co
+unt hash
&time_event('init', \%e2at, $debug*0); # Initialize timing struct
+ures
&time_event('Read_Raw', \%e2at, $debug*0);
$rsize = -s $raw;
open(IN, "<$raw");
binmode IN;
$read = read(IN, $buf, $rsize);
close IN; # Use it n cluse it!
die("Read ERROR!: Read $read != file $rsize\n") unless($read==$rs
+ize);
printf("Read %1.6f MB in raw file $raw\n", $read/1.0E6);
&time_event('# Pre-extend array', \%e2at, $debug*0);
$ushort[108456960+1] = 0x0000FFFF0000; # Pre-extend array
&time_event('Unpack', \%e2at, $debug*0);
@ushort=unpack("S*", $buf); $su = scalar(@ushort) || -1;
printf("Ushort[$su], length(buf)=%1.6f, Bytes/ushort %1.9f\n",
($bl=length($buf)) / 1.0E6, $bl/$su);
for($ii=0; $ii < scalar @ushort; $ii+=3) {
# Extract the R, G and B values
($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice
$quad=pack('Q', $rr, $gg, $bb, 0x0000);
$mult=$rr*$gg*$bb; # Titanic 281 Trillion size number
$white=$rr + $gg*65535 + $bb*65535*65535; # 281 Trillion size
+ number
$bs=$rr | ($gg << 16) | ($bb << 32); # << WORKS!
$rgb2c{$bs}++; # Increment count for this color
if(!($ii%6E6)) { # Document time on every 100k unit
$jj++;
printf("R=0x%04hx, g=0x%04hx, b=0x%04hx, bs=0x%016qx, q=0x
+%016qx, m=0x%016qx, W=0x%016qx\n",
$rr, $gg, $bb, $bs, $quad, $mult, $white);
}
}
&time_event("Count distinct colors", \%e2at, $debug*0);
printf("RGB hash has %d distinct colors.\n", scalar keys %rgb2c);
&time_event('term', \%e2at, $debug*0); # Finalize timing structur
+es
&show_event(\%e2at, $debug*0); # Automatically stops last event t
+iming
}
No snide comments about my backwoodsy writing style, please :)
This thing is 3 times as fast as the C program I wrote.
And we get exactly the same number.
I has to be the hashing function calculating an address. The C program,
bless its little heart, had to do a Binary Search over an ever growing
lookup table.
There was one severe problem with getting the 48bit hash key right.
I packed 3 verified unsigned short into a "Q"
($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice
$quad=pack('Q', $rr, $gg, $bb, 0x0000);
The R, G and B printed perfectly and agreed with the Bitshift/AND
value and the QUAD was always zero:
R=0x0892, g=0x0b8c, b=0x0672,
bs=0x000006720b8c0892, q=0x0000000000000000,
R=0x0892
g=0x0b8c
b=0x0672,
It's kind of hard to see, buy they all agree except for the
#(*#@%) QUAD!
I followed written documentation:
From: >> http://perldoc.perl.org/functions/pack.html
Formula: pack TEMPLATE,LIST; Q An unsigned quad value.
Example: $foo = pack('nN', 42, 4711);
# pack big-endian 16- and 32-bit unsigned integers
I had to do a bitshift, <<16 for the GREEN and <<32 for the BLUE
and logically AND them together to get a UINT24_T that worked.
The performance is MOST IMPRESSIVE!!
Thanks for the Pointers (References?
================================================
Boring Report:
Running c:/bin/bb.pl Wed Aug 12 20:05:10 2015
Read 216.913920 MB in raw file pf-2015.0531-249465.pprgb.7360x4912.raw
Ushort[108456960], length(buf)=216.913920, Bytes/ushort 2.000000000
R=0x1746, g=0x1910, b=0x35ee, bs=0x000035ee19101746
BAD quad=0x0000000000000000, m=0x0000007ae0ad0540, W=0x000035edad3434
+24
...
RGB hash has 27645898 distinct colors. <<<<<< The good part
SE: Sum of Event Times = 1.11464685 min, now-init = 1.11464715 min
Event '# Pre-extend array' elapsed time = 0.003 min = 0.24%.
Event 'Count distinct colors' elapsed time = -0.017 min = -1.50%.
Event 'Loop' elapsed time = 1.012 min = 90.81%.
Event 'Read_Raw' elapsed time = 0.002 min = 0.22%.
Event 'Unpack' elapsed time = 0.097 min = 8.73%.
90.81% -> Loop
8.73% -> Unpack
0.24% -> # Pre-extend array
0.22% -> Read_Raw
0.00% -> Count distinct colors
Elapsed time = 1.11 min
|