BrianP has asked for the wisdom of the Perl Monks concerning the following question:
I have a small Perl script that takes a crop spec with the XY position, width and height of a box, grabs the 48 bit RGB as a hash key and increments the data. It takes just a second on a ~100x100 pixel rectangle.
I use it to measure the quality of color spaces and various photo processes.
I wrote something much more clunky in C for use on entire, 216 MB pictures. But it takes 2.17 HOURS to run on a D800E pic with 27 Million unique colors. The lookup table made up of 64bit long longs gets to be 27E6 long toward the end and takes ~25 bsearch comparisons per lookup.
What I need is a good algorithm to take a 48 bit integer with quantomly random values and spit out an integer in the range 0-36E6. Perl has some pretty slick Hashing capabilities. I wonder whether any of them could be leveraged here.
GORY DETAILS: ----------
I have a C program which uses an array of 36 Million uint64s to hold 48 bits of RGB color data (16bits/channel) and 16 bits for a counter for the distinct colors found when scanning a 216 MB .RAW file.
For extremely colorful pictures, the length of this lookup table gets as high as 27 Million items. Even with a bsearch, it takes over 2 hours to count the number of pixels sharing each distinct color.
How does one HASH 48 bits of ~random data RGB data and return an integer in the range 36E6 to as high as many GB with 32GB of system RAM?
CURRENT PROCEDURE: -------------------------------
BSearch the known colors for the new color and increment the 16 bit counter if found. Else, append the new, unique colors to the end of the array and periodically QSort the new ones. Then I do a shuffle merge on the millions of old, sorted colors and the hundreds of new, colors (after qsorting the new) appending as many as possible from each array onto a merge array and go back and forth until old and new are all merged.
QSorting an array with millions of sorted and a few hundred unsorted seems like a waste. If bsearch fails on the sorted bottom, I do a linear search on the unsorted top and append the color if not found.
The BSearch comparison function masks off the highest 16 bits (0x0000FFFFFFFFFFFF) of *A and *B and returns 1 if *A & MASK > *B & MASK, -1 if <, else 0. Can't think of a faster way to compare 2 long longs from pointers.
What I need is a Perl Hash (just do it) and C speed; Cerl Language? tags hash raw rgb 48bit
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Perl Hashes in C?
by BrowserUk (Patriarch) on Aug 11, 2015 at 20:28 UTC | |
This drops into Inline::C to explore the bitmap and construct the hash. It takes 23 seconds to process a 125 megapixel image:
The handling of the hash could be made more efficient by only calculating the hash-values once instead of several times; and by pre-extending the hash. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
| [reply] [d/l] |
by BrowserUk (Patriarch) on Aug 12, 2015 at 19:56 UTC | |
A small change doubled the performance:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
| [reply] [d/l] [select] |
by BrianP (Acolyte) on Aug 12, 2015 at 05:10 UTC | |
These are TrueColor. I am dealing with 281 TRILLION, full 16 bits/channel, 48 bits/pixel. "Found 81814 colors in 131072000 pixels" -> 1602 Pixels per color. The 216MB Photoshop RAW/16 file had 27 MILLION unique colors out of 36M "Pixels=36152321, unique Colors=27546248=76.19%" 76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup. I am monkeying with the MAX_UNSORTED parameter which determines when a sort has to be done after so many new, random colors have been piled on top of the lookup table. I had it set at a way, way too low 200. I wrote a Perl script to run the C program with varying MAX_UNSORT numbers and are seeing vastly better performance with 3805 is the best so far. The linear searches on top of the pile are pretty cheap compared to QSorting and merging. With a 1 in 3 sampling (12M of 36M), I have it down to < 46 seconds with 88.55% unique colors The larger the number of unique colors, the more it pays to leave a pile of unsorted colors on top. The one I did before was a ColorMAtch colorspace and it had ~76% unique colors. This one is ProPhoto and is over 85%! Same NEF file, same ACR settings, no photoshop other than to import from ACR and save as RAW. It looks like I need to work on the Sort_Merge. QSort on the entire 27 million tall stack, 99% already sorted was taking 98% of the program time. The shuffle_merge is 100 times faster on this problem | [reply] |
by BrowserUk (Patriarch) on Aug 12, 2015 at 05:36 UTC | |
The 216MB Photoshop RAW/16 file had 27 MILLION unique colors out of 36M This is a perl creating a hash with 27 million keys:
53 seconds! 76% of the pixels have unique colors! This makes your hashing algorithm rehash everything when it lands on a dup. Sorry, but if you mean "rehash every preexisting key", you are wrong. Why do you believe that? (If you mean something else by the highlighted phrase above, you're gonna have to explain yourself better.) The beauty of hashing for this application is that it doesn't matter what the range is, only the actual total. For each pixel in your image you either need to add a new key; or increment an existing value. Either takes approximately the same amount of time: circa: 0.000000457763671875 of a second on my rather ancient hardware. Indexing your 48-bit values (as opposed to my 32-bit ones) will take ~50% longer; so perhaps 40 seconds to count the colours in my 125 mega-pixel image. I have it down to < 46 seconds with 88.55% unique colors If you've already trimmed your OP time of "2.17 hours" to 48 seconds, why have you wasted our time by asking this question? Another monk that I won't waste my time reading and producing solutions for in future. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
| [reply] [d/l] |
by Anonymous Monk on Aug 13, 2015 at 01:26 UTC | |
by BrowserUk (Patriarch) on Aug 13, 2015 at 05:08 UTC | |
by BrowserUk (Patriarch) on Aug 13, 2015 at 01:40 UTC | |
by Anonymous Monk on Aug 13, 2015 at 01:58 UTC | |
by BrowserUk (Patriarch) on Aug 13, 2015 at 02:09 UTC | |
by Anonymous Monk on Aug 13, 2015 at 03:29 UTC | |
by Anonymous Monk on Aug 12, 2015 at 15:13 UTC | |
by BrowserUk (Patriarch) on Aug 12, 2015 at 15:29 UTC | |
| |
|
Re: Perl Hashes in C? (Updated)
by flexvault (Monsignor) on Aug 11, 2015 at 21:02 UTC | |
Welcome BrianP, On a Linux 32 bit system, Perl still allows 2**53 integers if you compile Perl for 64 bit integers: (Note: If Perl is a 64 bit version, then the problem goes away!) As you can see the maximum integer is larger than what you need. So you can generate integers directly by multiplying the 3 16 bit <UPDATE> To make sure nobody just multiplies the RGB together please remember that to represent the number 307 in decimal use: So the formula is ($B is the base): So to generate 48 bit RGB integers you need: Clarification complete! </UPDATE> values and then use that value as a hash key. Then each time you add a count just do:
Regards...Ed "Well done is better than well said." - Benjamin Franklin | [reply] [d/l] [select] |
|
Re: Perl Hashes in C?
by FreeBeerReekingMonk (Deacon) on Aug 11, 2015 at 19:30 UTC | |
Stick to Perl, use the underlying C routines, and the hash capabilities of perl (unless you run out of memory)
| [reply] [d/l] |
by Anonymous Monk on Aug 12, 2015 at 05:46 UTC | |
Unpack the blob into a very long array of unsigned shorts Calc the compound, multi-channel values for Yellow, Purple Cyan and White and hash them all The array slice makes it a great deal faster than segmenting it into pixels Looking at the log file: IT does 4 of these in almost 10 seconds, but look at the pixel count; < 1/4 MegaPix The hash only has to deal with 91% of 1/4 MB or 225,558 226 Thousand unique colors. Scale this to 30 MILLION unique colors and the time factor goes from 9 seconds to 9 DAYS! I have the C version down to 3 minutes flat with a 7920 Unsorted limit and 2.77 Million unique colors!
| [reply] [d/l] [select] |
by BrianP (Acolyte) on Aug 15, 2015 at 03:30 UTC | |
| [reply] [d/l] |
|
Re: Perl Hashes in C?
by pme (Monsignor) on Aug 11, 2015 at 19:01 UTC | |
| [reply] |
|
Re: Perl Hashes in C?
by flexvault (Monsignor) on Aug 12, 2015 at 14:52 UTC | |
Is there a sample image on the Internet that demonstrates your requirements? The mathematics of your problem can't be duplicated easily, but an image that demonstrates your actually problem could give those of us with a math background the ability to test for a better solution. BrowserUk solution is excellent, but without data representing your problem, we're just guessing! Regards...Ed "Well done is better than well said." - Benjamin Franklin | [reply] |
by Anonymous Monk on Aug 13, 2015 at 04:20 UTC | |
Ed, I am exploring alternate color spaces, bizarre processes and different printing options. I like the ProPhoto colorspace and find that I have many fewer problem spots when I use it; less burnouts, fewer jet black holes and better skin tones. A large percentage of the photo establishment insisting that no camera, monitor, printer or eyeball can see anything other than sRGB. It's easier to print, everything looks the same everywhere, the least common denominator makes your life easier. HOGWASH! I suspect that these people are keeping MacDonalds in business because it tastes the same everywhere. I am building analytic tools to measure the degradation that occurs with 20-year-old color spaces from a formerly large, low quality software empire. I don't know if you will be able to see this but it is a fairly colorful picture minimally processed in both ProPhoto and srgb, converted to 16bit/channel raw, subtracted by the uint16 and normalized. There is a mountain of difference. <<Link>> https://picasaweb.google.com/lh/photo/u4_oeKwuv6vXjOQesjBQri_P6Kt2i5G4X9d1GVZOYtg?feat=directlink This is one of the reasons I wanted this tool. I need to do zillions of these and 2 hours each was objectionable. These Perl hashes are astonishing in their raw power on big data. Crusading for Better Color, BrianP | [reply] |
|
Re: Perl Hashes in C?
by anonymized user 468275 (Curate) on Aug 12, 2015 at 18:59 UTC | |
One world, one people | [reply] |
|
Re: Perl Hashes in C?
by flexvault (Monsignor) on Aug 13, 2015 at 11:33 UTC | |
I downloaded the file, but it was in '.jpg' format and not a raw file. I used it in the following script, and then I copied it until the file size was larger than 217MB. The result of the 2 runs are below. I may be wrong but I don't think you need to convert to an integer and then use pack. The result will be the same as the raw 6 bytes ( 48 bits ) for the pixel.
This is the results of run1, almost all unique pixels. About 1 second.
This is using the larger file. We got some more uniques since the '.jpg' file was not a multiple of 6 bytes. About 15-16 seconds.
Try it on your real data and let us know if it helps. Regards...Ed "Well done is better than well said." - Benjamin Franklin | [reply] [d/l] [select] |
by Anonymous Monk on Aug 13, 2015 at 17:32 UTC | |
Ed, Your hard core, direct, laser focused code got the answer spot on the first go in the delightfully brief time of < 37 seconds:
Franklin would have marveled at the design efficiency. I have been attempting to arrive at the same 6 byte Quanta size for hours using (*!_))#% pack/unpack. UINT16s work perfectly as do UINT64s. How is it possible that nobody has ever thought of a UINT48? 32 bits is too wimpy; 4.3GB is not enough. But 4.3G*4.3G BYTES is clearly OverTheTop! 18446744073709551616 ???? Wouldn't 65536 * 4294967296 Bytes be just about right? Surely 281474976710656 B "is more memory than anybody will ever need"?? 281 TER! Has a ring to it. A 24 bit processor would be ideally suited for GRAPHICS PROCESSING. I hate to be a pest (but it does not usually stop me). While I still have a residual amount of hair left, might I ask you if you could point out my obvious comprehensional deficit with UNPACK? I have all 217MB in memory. 8 byte quanta are too large and 4 byte are too small so I am stuck with 2 byte type "S", uint16_t. The inane method it all I can get to work, BIT shifting and ANDing:
This works, but as another Monk pointed out, finely slicing and dicing then bit shifting the diminutive chunks and ORing them back together is hardly as satisfying as using 6 byte, native Quanta. I usually need the individual colors so I need this type of code, just not here. This is a case of wanting to find my error rather than fixing a blocking bug. How hard can it be to unpack 3 of the units from my array and smash them into the $RGB key I need with NO MONKEY BUSINESS? I tried every permutation of type S I could think of. Type Q worked fine except that it gave 1 1/3 pixel at a time. Is there a way to Unpack 3 UINT16s at a time with UNPACK()??
And other, Quixotic attempts at 48 BITness! If you can't UNPACK 'em, PACK THEM! I tried packing 3 shorts, a quad with a pair of NULLs chasing and many other schemes:
I always got zero or some error or something unprintable. Obviously reading a buffer-full and carving 6 byte slices works. And, reading 3 uint16s and clumsily bit-stitching them together gets the job done. But reading the whole file and unpacking an entire array of finished products in 1 line would be the most elegant and likely the fastest. Where is DATATYPE "G"?
It is unlikely that either K or R had digital cameras offering RAW file output so they can be forgiven for overlooking the obvious utility of UINT48. Perhaps what K&R missed the Wall Gank can substantiate? Thank you, Brian | [reply] [d/l] [select] |
by flexvault (Monsignor) on Aug 13, 2015 at 19:02 UTC | |
You are used to working with decimal numbers, but pack/unpack can be used to convert between binary, octal, decimal and hexadecimal. To use 48bit RGB, just think of the 6 octets as 3 16 bit numbers. Then this works: A lot of monks here are better at the math than I, but I can hold my own most of the time! For the future, ask specific questions that show your problem and when possible show the code that's demonstrating the problem. For your initial problem, you didn't have to worry about endianess, but you may have to consider it if your working with different architectures. Good Luck...Ed Regards...Ed "Well done is better than well said." - Benjamin Franklin | [reply] [d/l] |
by Anonymous Monk on Aug 14, 2015 at 00:05 UTC | |
|
Re: Perl Hashes in C?
by flexvault (Monsignor) on Aug 14, 2015 at 16:54 UTC | |
Maybe this helps? For you earlier question about reading from disk, Linux reads/writes in 4096 byte blocks. So I multiply the pagesize * 6 (since 4096 is not exactly divisible by 6). Use you system page size and multiply by a factor of 6 ( 6,12,...). If you want '$buffer' around after the populate cycle, just remove the "if ( 1 ){ }" or define '$buffer' before the loop, Regards...Ed "Well done is better than well said." - Benjamin Franklin | [reply] [d/l] |
|
Re: Perl Hashes in C? (just sort)
by Anonymous Monk on Aug 15, 2015 at 13:26 UTC | |
Counting unique pixel values is the equivalent of sort | uniq. No hashing, no associative maps are necessary. A good solution involves picking the most suitable sort algorithm and implementation. One might bucket by one color, then mergesort and count on the 32bit values. Anyway, 30 million items single-threaded — this ought to be ~1 sec job. | [reply] [d/l] |
by Your Mother (Archbishop) on Aug 15, 2015 at 13:35 UTC | |
This stuff is absolutely not my forte but surely sorting is more expensive and slower than hashing unless memory use is thrashing... In any case, whatever it ought to be, others have given code and timings. :P | [reply] |
by Anonymous Monk on Aug 15, 2015 at 15:03 UTC | |
Well... see Integer sorting. Sort can do better than n log n in specific circumstances. Whether it's sorting or hashing, the items are stashed in some location and the more you know about the distribution, the less you need to shuffle them around. Underneath, similar data structures may be used, indeed it may be hard to tell sorting and hashing apart here. (And remember, computer memory is linearly addressed which matches the single-dimensional sort ordering.) Mergesort is well amenable to parallel processing and SIMD streaming. It is practical and robust. Think of the same counting problem broken down sufficiently: you have a small strip of values, say couple hundred. Do you start hashing them or do you just run them through the pipeline? | [reply] |