|laziness, impatience, and hubris
Re:x2 Counting keys with defined or undefined elements in a hashby grinder (Bishop)
|on Jun 05, 2003 at 16:59 UTC
My aesthetic sense is somewhat offended by scanning the list twice using grep. I would rather scan the list once and count both the defined and undefined in a single pass. Not to mention that fact that grep constructs a list, that you then throw away after having counted the number of elements.
The code below uses the result of defined to index an array. The array contains two elements, the number of undefined and the number of defined elements.
(A short while later)...
Note that in both broquaint's code and mine, you are still iterating over the hash. There really is no way around that. You could, as the AM says, keep track of what you insert as you go along. But that approach offers you far more ways to go introduce bugs and would be far messier in terms of code. Or if you are brave and insist on this approach, you could try some fancy footwork with tie.
Just iterate and be done with it.
...(some time later)...
Just for kicks, I rolled a benchmark out of the two approaches:
except I screwed up
Be careful! As it stands, this code creates a hash with 10 million elements. On my machine it soaks up 800Mb of RAM. The results show that the scalar grep, somewhat to my surprise, consistently outperforms the for approach:
Still, iterating over a 10_000_000 element list more than 160 thousand times a second isn't too shabby a performance...
...(and then quite some time later)...
Pondering the results of my benchmark, I realised something was flawed. Calculating how much memory was involved compared to how many iterations were performed implied a bus speed tending towards the exabyte to zettabyte per second range... which meant that the benchmark wasn't testing what I thought it was testing. Indeed, jsprat arrived at the same conclusion independently.
Here, then, is a corrected benchmark that produces results that are far more reasonable. I just changed the cmpthese function as follows:
(Yes, I had to up this to 300 seconds of CPU time to get good results)... which gives:
This is running on a 12 million element hash. This consumes 900Mb on my machine. (I forgot I had compiled the kernel to limit processes to consuming more than a gig, otherwise I could have pushed the size of the hash up further).
I should also note that this was run on 5.8.0, so I'm getting the benefit of the free scalar grep.
...(and yet still later after I read some more in the thread)... I included the difference algorithm (and I'm kicking myself for not having thought of it -- one more reason why this site is such a great place) which pulls way ahead. The changes are
So there you have it. At this late stage of the game, if performance is still a worry, you should start thinking about writing in C, or using a database.