Perl Hashes in C?

BrianP has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl Hashes in C? by BrowserUk (Patriarch) on Aug 11, 2015 at 20:28 UTC
This drops into Inline::C to explore the bitmap and construct the hash. It takes 23 seconds to process a 125 megapixel image: #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_1138218', CLEAN_AFTER_BUILD =>0; typedef unsigned __int64 U64; typedef unsigned int U32; typedef unsigned char U8; HV* countColors( SV img ) { STRLEN l; U32 rgba = (U32)SvPVx( img, l ), i; HV counts = newHV(); l /= 4; for( i = 0; i < l; ++i ) { if( hv_exists( counts, (char)&rgba[ i ], 4 ) ) { SV val = hv_fetch( counts, (char)&rgba[ i ], 4, 0 ); SvIV_set( val, SvIV( val ) + 1 ); } else { SV val = newSViv( 1 ); hv_store( counts, (char)&rgba[ i ], 4, val, 0 ); } } return counts; } END_C use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; use List::Util qw[ sum ]; use GD; GD::Image->trueColor(1); my $i = GD::Image->new( $ARGV[ 0 ] ) or die $!; my $gd = $i->gd; my( undef, $width, $height, undef, undef ) = unpack 'nnnCV', substr( $ +gd, 0, 11, '' ); printf "Width:%u height:%u pixels:%u\n", $width, $height, length( $gd +) / 4; my $start = time; my $href = countColors( $gd ); printf "Took %f seconds\n", time() -$start; printf "Found %u colors in %u pixels\n", scalar( keys %{ $href } ), su +m( values %{ $href } ); __END__ C:\test>1138218 mid.png Width:12800 height:10240 pixels:131072000 Took 23.391081 seconds Found 81814 colors in 131072000 pixels [download] The handling of the hash could be made more efficient by only calculating the hash-values once instead of several times; and by pre-extending the hash. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply] [d/l]
Re^2: Perl Hashes in C? (Twice as fast) by BrowserUk (Patriarch) on Aug 12, 2015 at 19:56 UTC
A small change doubled the performance: `HV* countColors2( SV img ) { STRLEN l; U32 rgba = (U32)SvPVx( img, l ), i; HV counts = newHV(); l /= 4; for( i = 0; i < l; ++i ) { SV *val; if( val = hv_fetch( counts, (char)&rgba[ i ], 4, 0 ) ) { SvIV_set( val, SvIV( val ) + 1 ); } else { SV val = newSViv( 1 ); hv_store( counts, (char)&rgba[ i ], 4, val, 0 ); } } return counts; }` [download] `C:\test>1138218 mid.png Width:12800 height:10240 pixels:131072000 Took 12.642578 seconds Found 81814 colors in 131072000 pixels` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply] [d/l] [select]
Re^2: Perl Hashes in C? by BrianP (Acolyte) on Aug 12, 2015 at 05:10 UTC
These are TrueColor. I am dealing with 281 TRILLION, full 16 bits/channel, 48 bits/pixel. "Found 81814 colors in 131072000 pixels" -> 1602 Pixels per color. The 216MB Photoshop RAW/16 file had 27 MILLION unique colors out of 36M "Pixels=36152321, unique Colors=27546248=76.19%" 76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup. I am monkeying with the MAX_UNSORTED parameter which determines when a sort has to be done after so many new, random colors have been piled on top of the lookup table. I had it set at a way, way too low 200. I wrote a Perl script to run the C program with varying MAX_UNSORT numbers and are seeing vastly better performance with 3805 is the best so far. The linear searches on top of the pile are pretty cheap compared to QSorting and merging. With a 1 in 3 sampling (12M of 36M), I have it down to < 46 seconds with 88.55% unique colors The larger the number of unique colors, the more it pays to leave a pile of unsorted colors on top. The one I did before was a ColorMAtch colorspace and it had ~76% unique colors. This one is ProPhoto and is over 85%! Same NEF file, same ACR settings, no photoshop other than to import from ACR and save as RAW. It looks like I need to work on the Sort_Merge. QSort on the entire 27 million tall stack, 99% already sorted was taking 98% of the program time. The shuffle_merge is 100 times faster on this problem	[reply]
Re^3: Perl Hashes in C? by BrowserUk (Patriarch) on Aug 12, 2015 at 05:36 UTC
The 216MB Photoshop RAW/16 file had 27 MILLION unique colors out of 36M This is a perl creating a hash with 27 million keys: `[0] Perl> $t = time; ++$h{$_} for 1 .. 27e6; print time() - $t;; 52.9575479030609` [download] 53 seconds! 76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup. Sorry, but if you mean "rehash every preexisting key", you are wrong. Why do you believe that? (If you mean something else by the highlighted phrase above, you're gonna have to explain yourself better.) The beauty of hashing for this application is that it doesn't matter what the range is, only the actual total. For each pixel in your image you either need to add a new key; or increment an existing value. Either takes approximately the same amount of time: circa: 0.000000457763671875 of a second on my rather ancient hardware. Indexing your 48-bit values (as opposed to my 32-bit ones) will take ~50% longer; so perhaps 40 seconds to count the colours in my 125 mega-pixel image. I have it down to < 46 seconds with 88.55% unique colors If you've already trimmed your OP time of "2.17 hours" to 48 seconds, why have you wasted our time by asking this question? Another monk that I won't waste my time reading and producing solutions for in future. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply] [d/l]
Re^4: Perl Hashes in C? by Anonymous Monk on Aug 13, 2015 at 01:26 UTC
Re^5: Perl Hashes in C? by BrowserUk (Patriarch) on Aug 13, 2015 at 05:08 UTC
Re^5: Perl Hashes in C? by BrowserUk (Patriarch) on Aug 13, 2015 at 01:40 UTC
Re^4: Perl Hashes in C? by Anonymous Monk on Aug 13, 2015 at 01:58 UTC
Re^5: Perl Hashes in C? by BrowserUk (Patriarch) on Aug 13, 2015 at 02:09 UTC
Re^4: Perl Hashes in C? by Anonymous Monk on Aug 13, 2015 at 03:29 UTC
Re^4: Perl Hashes in C? by Anonymous Monk on Aug 12, 2015 at 15:13 UTC
Re^5: Perl Hashes in C? by BrowserUk (Patriarch) on Aug 12, 2015 at 15:29 UTC
Some notes below your chosen depth have not been shown here
Re: Perl Hashes in C? (Updated) by flexvault (Monsignor) on Aug 11, 2015 at 21:02 UTC
Welcome BrianP, On a Linux 32 bit system, Perl still allows 253 integers if you compile Perl for 64 bit integers: (Note: If Perl is a 64 bit version, then the problem goes away!) `rperl -e 'use integer;$i=((217)3);$m=253;print "$i\n$m\n";' 2251799813685248 9007199254740992` [download] As you can see the maximum integer is larger than what you need. So you can generate integers directly by multiplying the 3 16 bit <UPDATE> To make sure nobody just multiplies the RGB together please remember that to represent the number 307 in decimal use: `perl -e '$n = (3(102)) + (0(10*1)) + (9(10*0)); print "$n\n";'` [download] So the formula is ($B is the base): `perl -e '$B = 10; $n = (3($B*2)) + (0($B*1)) + (9($B0)); print +"$n\n";'` [download] So to generate 48 bit RGB integers you need: `perl -e '$B = 216; $n = ($R($B2)) + ($G($B*1)) + ($B($B0)); +print "$n\n";'` [download] Clarification complete! </UPDATE> values and then use that value as a hash key. Then each time you add a count just do: `$hash{$key}++; ## $key is 48 bit RGB as a single integer ## The value $hash{$key} is sum of the time +s found` [download] Regards...Ed "Well done is better than well said." - Benjamin Franklin**	[reply] [d/l] [select]
Re: Perl Hashes in C? by FreeBeerReekingMonk (Deacon) on Aug 11, 2015 at 19:30 UTC
Stick to Perl, use the underlying C routines, and the hash capabilities of perl (unless you run out of memory) Once you binary read a pixel and have 48 bits of data, it can be unpacked to a large integer. However, for speed, you should not use 3 times a 16 bit value, then shift them together, but use quad values, which only are available if your perl is 64 bit. Check Pack/Unpack Tutorial (aka How the System Stores Data) But still, in perl, it would be better to unpack to hex strings, which are much shorter (thus faster), and use those as keys in a hash counter, like so: `my %SEEN; my $key; open( FH, '<', "image.raw" ) or die $!; binmode(FH); my $startpos = 10; # skip header seek(FH, $startpos, 0); $/ = \42; # set record size while( <FH> ) { $key = join("", unpack 'H42', $_ ); $SEEN{$key}++; # not sure how to calculate stop } for my $key (keys %SEEN){ print "key $key seen $SEEN{$key} times\n"; } # Add method to go from hex key to numeric rgb values if you want` [download]	[reply] [d/l]
Re^2: Perl Hashes in C? by Anonymous Monk on Aug 12, 2015 at 05:46 UTC
I have PerlMagick read the file into a blob Unpack the blob into a very long array of unsigned shorts Calc the compound, multi-channel values for Yellow, Purple Cyan and White and hash them all The array slice makes it a great deal faster than segmenting it into pixels $im = Image::Magick->new(size=>$xyr, type=>'TrueColor', depth=>16); $err=$im->Read(filename=>"RGB:$raw"); warn "$err" if $err; # Read RA +W ($blob) = $im->ImageToBlob(); $bl=length $blob; @ushort=unpack("S", $blob); $blen=length $blob; $ulen=scalar @ushort; printf("CR: $lbl Blob len $blen unpacks to $ulen uint16s -> %d pix + -> %4.6f B/us\n", $blen / 6, $blen/$ulen); for($ii=0; $ii < scalar @ushort; $ii+=3) { ($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice # Show full integers in 4Gig range for Yellow, purple, cyan $yy=sprintf("%10d", $rr$gg); # White=4294967296 (4,294,967,29 +6) 10 dig $pp=sprintf("%10d", $rr$bb); $cc=sprintf("%10d", $gg$bb); # (2^16)^3 = 2^48 = 2.815E14, 15 sig figs MAX. # Doubles have 52/53 bits of Significand $tt=sprintf("%15d", $rr$gg$bb); # TRUE color, RGB??? $c2v2c{r}{$rr}++; # Count Red channel points with this value $c2v2c{g}{$gg}++; # Count Gre channel points with this value $c2v2c{b}{$bb}++; # Count Blu channel points with this value $c2v2c{y}{$yy}++; # Count Blu channel points with this value $c2v2c{p}{$pp}++; # Count Blu channel points with this value $c2v2c{c}{$cc}++; # Count Blu channel points with this value $c2v2c{t}{$tt}++; # Count Blu channel points with this value } $vt=0; # Values Total; sum of all values foreach $chan (@leg) { # sort keys %c2v2c $clr = $leg{$chan}; # y -> Yellow %v2c = %{$c2v2c{$chan}}; # Values -> Count_of_this_value hash $vc = scalar keys %v2c; # Distinct values for this channel $vt += $vc; # Value Count accumulator push @vc, "$clr $vc"; } printf("CR: $lbl Chan Value Counts: %s, tot=$vt\n", join(", ", @vc +)); return $vc; # Last ValueCount should be for T, TrueRGB } # End Count_Rgb(). [download] Looking at the log file: `3), R pf-2015.0531-249465.srgb.7360x4912.raw , crop_wh=577,428, blob= +1481736 B -> 6.000000 B/p CR: srgb Blob len 1481736 unpacks to 740868 uint16s -> 246956 pix -> 2 +.000000 B/us CR: srgb Chan Value Counts: Red 18955, Green 16919, Blue 10862, Yellow + 236778, Purple 221564, Cyan 222509, TrueRGB 225558, tot=953145 Distinct RGB colors=225558 = 91.335% -> 1 dup clr / 11.54 pix Elapsed time = 9.67 sec` [download] IT does 4 of these in almost 10 seconds, but look at the pixel count; < 1/4 MegaPix The hash only has to deal with 91% of 1/4 MB or 225,558 226 Thousand unique colors. Scale this to 30 MILLION unique colors and the time factor goes from 9 seconds to 9 DAYS! I have the C version down to 3 minutes flat with a 7920 Unsorted limit and 2.77 Million unique colors!	[reply] [d/l] [select]
Re^3: Perl Hashes in C? by BrianP (Acolyte) on Aug 15, 2015 at 03:30 UTC
This was the code example from FreeBeerReekingMonk: ----------------------------- my %SEEN; my $key; open( FH, '<', "image.raw" ) or die $!; binmode(FH); my $startpos = 10; # skip header seek(FH, $startpos, 0); $/ = \42; # set record size while( <FH> ) { $key = join("", unpack 'H42', $_ ); $SEEN{$key}++; # not sure how to calculate stop } for my $key (keys %SEEN){ print "key $key seen $SEEN{$key} times\n"; } # Add method to go from hex key to numeric rgb values if you want ================================= I have no idea where the \42 came from. I changed it to a record size=6. Then changed the Unpack from 'H42' to 'H12' 12 hex = 6 bytes. It works: # The RIGHT ANSWER IS: RGB hash has 27645898 distinct colors. sub rgb_hash2() { my %SEEN; my $key; # 27645898 colors 216913920 B 36152320 pixels! $raw="pf-2015.0531-249465.pprgb.7360x4912.raw"; # $fs=-s $raw; # RAW file size in BYTES open( FH, '<', "$raw" ) or die $!; binmode(FH); # RGB 16 BITS/CHANNEL = 48BITs/PIXEL = 6 BYTE QUANTA -> 12 HEX digits $/ = \6; # set record size. REF 2 INT, "\6" == 6 B #$/ = \42; # set record size while( <FH> ) { # Smash each 12 Hex rec into 48 BIT UINT48? #$key = join("", unpack 'S3', $_ ); $key = join("", unpack 'H12', $_ ); $SEEN{$key}++; } close FH; $sk=scalar keys %SEEN; # Scalar Keys. How many COLORS? printf("Seen{%d}, file_size=%2.6fMB, -> %1.6f B/key\n", $sk, $fs/1.0E6, $fs/$sk); return $sk; } I also tried it with 3 Unsigned Shorts too. ===================================================== ===================================================== Monk Ed suggested something at a bit lower level with a sysread into a tiny 4096 * 6 byte buffer and nibbling the buffer substr. If you monkey with the buffer size and try a size not evenly divisible by 6, it breaks. D'oh! =============================================== Ed's ~ code (may be partially mangled by me). %Image = (); keys %Image = 4096 * 128; $rdamt = 4194304 unless $rdamt; # 4 MByte! 4096 * 6; $compraw = -s $file; $pixels = int ( $compraw / 6 ); open ( $in, "<", "$file") or die "$!\n"; while( 1 ) { $size = sysread( $in, $buffer, $rdamt ); if ( $size == 0 ) { last; } while( $buffer ) { if ( length( $buffer ) < 6 ) { last; } $key = substr( $buffer, 0, 6, '' ); $Image{$key}++; } } close $in; $uniq = keys %Image; =============================================== PERFORMANCE SYSREAD with buffer size Event 'Read_Raw_16384' elapsed time = 0.520 min = 10.43%. Event 'Read_Raw_24576' elapsed time = 0.519 min = 10.41%. Event 'Read_Raw_4096' elapsed time = 0.525 min = 10.54%. 10.54% -> Read_Raw_4096 10.43% -> Read_Raw_16384 10.41% -> Read_Raw_24576 << Ed's Magic Number 4096 * 6 Average ~= 0.5 min Record read, unpack to HEX Event 'Read_Raw_Hex0' elapsed time = 0.781 min = 15.68%. Event 'Read_Raw_Hex1' elapsed time = 0.711 min = 14.26%. Event 'Read_Raw_Hex2' elapsed time = 0.704 min = 14.13%. Event 'Read_Raw_Hex3' elapsed time = 0.707 min = 14.18%. 15.68% -> Read_Raw_Hex0 14.26% -> Read_Raw_Hex1 14.18% -> Read_Raw_Hex3 14.13% -> Read_Raw_Hex2 Average ~= 0.7 min Ed's magic number beat out the others tried by an eyelash! Tried Unpacking 3 UShorts also. Seems like it should be faster to unpack 3, 2 byte ints than 12 hex: Event 'Read_Raw_UINT16x3Pack0' elapsed time=0.959 min=25.82% Event 'Read_Raw_UINT16x3Pack1' elapsed time=0.919 min=24.76% Event 'Read_Raw_UINT16x3Pack2' elapsed time=0.918 min=24.73% Event 'Read_Raw_UINT16x3Pack3' elapsed time=0.917 min=24.70% 25.82% -> Read_Raw_UINT16x3Pack0 24.76% -> Read_Raw_UINT16x3Pack1 24.73% -> Read_Raw_UINT16x3Pack2 24.70% -> Read_Raw_UINT16x3Pack3 NO! Hex Unpack is faster that UINT16 on the same data and by a sizable margin, The low level sysreads without buffering Gets the GOLD MEDAL! There was another INLINE C method from BrowserUk that looked perfect, but it used GD with "TRUECOLOR" which on my system means 24 bit and my data are all 16 so I didn't try it. What started as a 2 hour problem has been whittled down to 20 seconds. Ye Ha! Thanks for all the great ideas! Brian ======================================================== Inline C version from BrowserUk: by BrowserUk on Aug 11, 2015 at 20:28 UTC This drops into Inline::C to explore the bitmap and construct the hash +. It takes 23 seconds to process a 125 megapixel image: #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_1138218', CLEAN_AFTER_BUILD =>0; typedef unsigned __int64 U64; typedef unsigned int U32; typedef unsigned char U8; HV* countColors( SV img ) { STRLEN l; U32 rgba = (U32)SvPVx( img, l ), i; HV counts = newHV(); l /= 4; for( i = 0; i < l; ++i ) { if( hv_exists( counts, (char)&rgba[ i ], 4 ) ) { SV val = hv_fetch( counts, (char)&rgba[ i ], 4, 0 ); SvIV_set( val, SvIV( val ) + 1 ); } else { SV val = newSViv( 1 ); hv_store( counts, (char)&rgba[ i ], 4, val, 0 ); } } return counts; } END_C use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; use List::Util qw[ sum ]; use GD; GD::Image->trueColor(1); my $i = GD::Image->new( $ARGV[ 0 ] ) or die $!; my $gd = $i->gd; my( undef, $width, $height, undef, undef ) = unpack 'nnnCV', substr( $ +gd, 0, 11, '' ); printf "Width:%u height:%u pixels:%u\n", $width, $height, length( $gd +) / 4; my $start = time; my $href = countColors( $gd ); printf "Took %f seconds\n", time() -$start; printf "Found %u colors in %u pixels\n", scalar( keys %{ $href } ), su +m( values %{ $href } ); [download]	[reply] [d/l]
Re: Perl Hashes in C? by pme (Monsignor) on Aug 11, 2015 at 19:01 UTC
Have you considered C-hash?	[reply]
Re: Perl Hashes in C? by flexvault (Monsignor) on Aug 12, 2015 at 14:52 UTC
BrianP, Is there a sample image on the Internet that demonstrates your requirements? The mathematics of your problem can't be duplicated easily, but an image that demonstrates your actually problem could give those of us with a math background the ability to test for a better solution. BrowserUk solution is excellent, but without data representing your problem, we're just guessing! Regards...Ed "Well done is better than well said." - Benjamin Franklin	[reply]
Re^2: Perl Hashes in C? by Anonymous Monk on Aug 13, 2015 at 04:20 UTC
Ed, I am exploring alternate color spaces, bizarre processes and different printing options. I like the ProPhoto colorspace and find that I have many fewer problem spots when I use it; less burnouts, fewer jet black holes and better skin tones. A large percentage of the photo establishment insisting that no camera, monitor, printer or eyeball can see anything other than sRGB. It's easier to print, everything looks the same everywhere, the least common denominator makes your life easier. HOGWASH! I suspect that these people are keeping MacDonalds in business because it tastes the same everywhere. I am building analytic tools to measure the degradation that occurs with 20-year-old color spaces from a formerly large, low quality software empire. I don't know if you will be able to see this but it is a fairly colorful picture minimally processed in both ProPhoto and srgb, converted to 16bit/channel raw, subtracted by the uint16 and normalized. There is a mountain of difference. <<Link>> https://picasaweb.google.com/lh/photo/u4_oeKwuv6vXjOQesjBQri_P6Kt2i5G4X9d1GVZOYtg?feat=directlink This is one of the reasons I wanted this tool. I need to do zillions of these and 2 hours each was objectionable. These Perl hashes are astonishing in their raw power on big data. Crusading for Better Color, BrianP	[reply]
Re: Perl Hashes in C? by anonymized user 468275 (Curate) on Aug 12, 2015 at 18:59 UTC
Long before Perl and C++ were invented, a hashing algorithm meant mapping the first so many bytes of the key to a starting address, these being designed to be spread evenly across the total available addressing space, and then adding the remainder of the key and the value in sequentially (this needs an end-pointer to be stored at the landing point). Old school hash lookup expects therefore to land immediately on the full match value most of the time and worst case have to trawl along sequentially to find it from there, but not too far to have performance impact. Somehow it looks like this problem would tempt me into going that far old school to achieve all the theoretical benefits, especially if straight C is an option. One world, one people	[reply]
Re: Perl Hashes in C? by flexvault (Monsignor) on Aug 13, 2015 at 11:33 UTC
BrianP I downloaded the file, but it was in '.jpg' format and not a raw file. I used it in the following script, and then I copied it until the file size was larger than 217MB. The result of the 2 runs are below. I may be wrong but I don't think you need to convert to an integer and then use pack. The result will be the same as the raw 6 bytes ( 48 bits ) for the pixel. use strict; use warnings; my %Image = (); keys %Image = 4096 * 128; my $rdamt = 4096 * 6; my $buffer; # my $file = '249465.pprgb-srgb.absv.4000.cs.jpg'; # Original + File downloaded my $file = '249465.pprgb-srgb.absv.4000.cs.jpg2'; # New File + multiple copies my $compraw = -s $file; my $pixels = int ( $compraw / 6 ); open ( my $in, "<", "./$file") or die "$!\n"; while( 1 ) { my $size = sysread( $in, $buffer, $rdamt ); # No bufferi +ng if ( $size == 0 ) { last; } while( $buffer ) { if ( length( $buffer ) < 6 ) { last; } # Throw away + odd number of pixels my $key = substr( $buffer, 0, 6, '' ); $Image{$key}++; } } close $in; my $uniq = keys %Image; my $factor = sprintf("%.2f", $uniq / $pixels ); print "Found $uniq colors in $pixels pixels $factor\%\n"; __END__ [download] This is the results of run1, almost all unique pixels. About 1 second. `> time pyrperl uniqcolors.plx Found 153601 colors in 153606 pixels 1.00% ( File size: 921,638 ) real 0m0.098s user 0m0.092s sys 0m0.008s` [download] This is using the larger file. We got some more uniques since the '.jpg' file was not a multiple of 6 bytes. About 15-16 seconds. `> time pyrperl uniqcolors.plx Found 460775 colors in 41473710 pixels 0.01% ( File size: 248,842,260 + ) real 0m15.015s user 0m14.937s sys 0m0.080s` [download] Try it on your real data and let us know if it helps. Regards...Ed "Well done is better than well said." - Benjamin Franklin	[reply] [d/l] [select]
Re^2: Perl Hashes in C? by Anonymous Monk on Aug 13, 2015 at 17:32 UTC
Ed, Your hard core, direct, laser focused code got the answer spot on the first go in the delightfully brief time of < 37 seconds: `Running C:\bin\bb.pl Thu Aug 13 11:06:26 2015 Found 27645898 colors in 36152320 pixels 0.76% Elapsed time = 36.82 sec` [download] Franklin would have marveled at the design efficiency. I have been attempting to arrive at the same 6 byte Quanta size for hours using (!_))#% pack/unpack. UINT16s work perfectly as do UINT64s. How is it possible that nobody has ever thought of a UINT48? 32 bits is too wimpy; 4.3GB is not enough. But 4.3G4.3G BYTES is clearly OverTheTop! 18446744073709551616 ???? Wouldn't 65536 * 4294967296 Bytes be just about right? Surely 281474976710656 B "is more memory than anybody will ever need"?? 281 TER! Has a ring to it. A 24 bit processor would be ideally suited for GRAPHICS PROCESSING. I hate to be a pest (but it does not usually stop me). While I still have a residual amount of hair left, might I ask you if you could point out my obvious comprehensional deficit with UNPACK? I have all 217MB in memory. 8 byte quanta are too large and 4 byte are too small so I am stuck with 2 byte type "S", uint16_t. The inane method it all I can get to work, BIT shifting and ANDing: `@ushort=unpack("S", $buf); # < Ushort[108456960] == RIGHT Number! for($ii=0; $ii < scalar @ushort; $ii+=3) { ($rr, $gg, $bb) = @ushort[$ii .. $ii+2]; # Array slice MORONIC> $bs = $rr \| ($gg << 16) \| ($bb << 32); # << WORKS ;( $rgb2c{$bs}++; # Increment count for this color }` [download] This works, but as another Monk pointed out, finely slicing and dicing then bit shifting the diminutive chunks and ORing them back together is hardly as satisfying as using 6 byte, native Quanta. I usually need the individual colors so I need this type of code, just not here. This is a case of wanting to find my error rather than fixing a blocking bug. How hard can it be to unpack 3 of the units from my array and smash them into the $RGB key I need with NO MONKEY BUSINESS? I tried every permutation of type S I could think of. Type Q worked fine except that it gave 1 1/3 pixel at a time. Is there a way to Unpack 3 UINT16s at a time with UNPACK()?? `WORKS!> @q =unpack("Q", $buf); $sq = scalar(@q) \|\| -1; FAIL! @uint48=unpack("S3", $buf); $s48 = scalar(@uint48) \|\| -1; FAIL! @uint48=unpack("S@3", $buf); $s48 = scalar(@uint48) \|\| -1; FAIL! @uint48=unpack("S[3]", $buf); $s48 = scalar(@uint48) \|\| -1 +; FAIL! @uint48=unpack("(SSS)", $buf); $s48 = scalar(@uint48) \|\| +-1;` [download] And other, Quixotic attempts at 48 BITness! If you can't UNPACK 'em, PACK THEM! I tried packing 3 shorts, a quad with a pair of NULLs chasing and many other schemes: `#$quad=pack('Q', $rr, $gg, $bb, 0x0000); #$q2=pack('Q1', $rr, $gg, $bb, 0x0000); # Q2=0x0000000000000000 #$q4=pack('S4', $rr, $gg, $bb, 0x0000); # #$q5=pack("SSSS", $rr, $gg, $bb, 0x0000); # #$q3=pack('Q', $rr, $gg, $bb, 0x0000); # Q3=0x0000000000000000 #$q4=pack("Q", $rr, $gg, $bb, 0x0000); # Q4=0x0000000000000000 #$q5=pack("S", $rr, $gg, $bb, 0x0000); # Q5=0x0000000000000000 #$q5=pack("Q", @ushort[$ii .. $ii+2]);` [download] I always got zero or some error or something unprintable. Obviously reading a buffer-full and carving 6 byte slices works. And, reading 3 uint16s and clumsily bit-stitching them together gets the job done. But reading the whole file and unpacking an entire array of finished products in 1 line would be the most elegant and likely the fastest. Where is DATATYPE "G"? `@UINT48=unpack("G*", $buf); # NATIVE, 48BIT UNSIGNED GRAPHIC INTS!` [download] It is unlikely that either K or R had digital cameras offering RAW file output so they can be forgiven for overlooking the obvious utility of UINT48. Perhaps what K&R missed the Wall Gank can substantiate? Thank you, Brian	[reply] [d/l] [select]
Re^3: Perl Hashes in C? by flexvault (Monsignor) on Aug 13, 2015 at 19:02 UTC
BrianP, 32 bits is too wimpy; 4.3GB is not enough. But 4.3G4.3G BYTES is clearly OverTheTop! 18446744073709551616 ???? You have to go back to the math. 8bit or 128bit machines can get the same answer, it's knowing how the bits need to be put together :-) Wouldn't 65536 4294967296 Bytes be just about right? For you: Yes, for me, 32bits are fine for 98% of my work. All of my servers have at least 16GB, and many have many times that amount. But I can use 32bit Perl for 98% of the work (smaller footprint), and 64 bit Perl for the rest. I also have 32bit Perl with 64bit Integers. You are used to working with decimal numbers, but pack/unpack can be used to convert between binary, octal, decimal and hexadecimal. To use 48bit RGB, just think of the 6 octets as 3 16 bit numbers. Then this works: `my $myNum = 65000 * (232); my ( $R,$G,$B ) = unpack("nnn", $myNum ); print "\$myNum: $myNum\n(\$R,\$G,\$B): $R\t$G\t$B\n";` [download] A lot of monks here are better at the math than I, but I can hold my own most of the time! For the future, ask specific questions that show your problem and when possible show the code that's demonstrating the problem. For your initial problem, you didn't have to worry about endianess, but you may have to consider it if your working with different architectures. Good Luck...Ed Regards...Ed "Well done is better than well said." - Benjamin Franklin**	[reply] [d/l]
Re^4: Perl Hashes in C? by Anonymous Monk on Aug 14, 2015 at 00:05 UTC
Re: Perl Hashes in C? by flexvault (Monsignor) on Aug 14, 2015 at 16:54 UTC
BrianP, Maybe this helps? use strict; use warnings; # my $file = '249465.pprgb-srgb.absv.4000.cs.jpg'; # Original + File downloaded my $file = '249465.pprgb-srgb.absv.4000.cs.jpg2'; # New File + multiple copies my $compraw = -s $file; my $pixels = int ( $compraw / 6 ); my @UINT48; $UINT48[$pixels] = ''; ## Allocate size of array ( faster ) open ( my $in, "<", "./$file") or die "$!\n"; if ( 1 ) { my $loc = -1; my $size = sysread( $in, my $buffer, $compraw ); # No bu +ffering if ( $size ne $compraw ) { die "Disk error: $!\n" } while( $buffer ) { $UINT48[$loc++] = substr( $buffer, 0, 6, '' ); } } close $in; ## $buffer is gone! ## @UINT48 is populated with 6 byte RG +B strings my $uint48 = scalar @UINT48; print "\nImage has $uint48 pixels\n"; ## How to get the values back! my ( $R,$G,$B ) = unpack( "SSS", $UINT48[0] ); print "\t$R,$G,$B\n"; ( $R,$G,$B ) = unpack( "SSS", $UINT48[$pixels] ); print "\t$R,$G,$B\n"; __END__ > time pyrperl uniqcolors.plx Image has 41473711 pixels 17994,17993,256 55551,57599,4096 real 0m13.745s user 0m12.905s sys 0m0.840s > [download] For you earlier question about reading from disk, Linux reads/writes in 4096 byte blocks. So I multiply the pagesize * 6 (since 4096 is not exactly divisible by 6). Use you system page size and multiply by a factor of 6 ( 6,12,...). If you want '$buffer' around after the populate cycle, just remove the "if ( 1 ){ }" or define '$buffer' before the loop, Regards...Ed "Well done is better than well said." - Benjamin Franklin	[reply] [d/l]
Re: Perl Hashes in C? (just sort) by Anonymous Monk on Aug 15, 2015 at 13:26 UTC
Counting unique pixel values is the equivalent of `sort \| uniq`. No hashing, no associative maps are necessary. A good solution involves picking the most suitable sort algorithm and implementation. One might bucket by one color, then mergesort and count on the 32bit values. Anyway, 30 million items single-threaded — this ought to be ~1 sec job.	[reply] [d/l]
Re^2: Perl Hashes in C? (just sort) by Your Mother (Archbishop) on Aug 15, 2015 at 13:35 UTC
This stuff is absolutely not my forte but surely sorting is more expensive and slower than hashing unless memory use is thrashing... In any case, whatever it ought to be, others have given code and timings. :P	[reply]
Re^3: Perl Hashes in C? (just sort) by Anonymous Monk on Aug 15, 2015 at 15:03 UTC
Well... see Integer sorting. Sort can do better than n log n in specific circumstances. Whether it's sorting or hashing, the items are stashed in some location and the more you know about the distribution, the less you need to shuffle them around. Underneath, similar data structures may be used, indeed it may be hard to tell sorting and hashing apart here. (And remember, computer memory is linearly addressed which matches the single-dimensional sort ordering.) Mergesort is well amenable to parallel processing and SIMD streaming. It is practical and robust. Think of the same counting problem broken down sufficiently: you have a small strip of values, say couple hundred. Do you start hashing them or do you just run them through the pipeline?	[reply]