in reply to Micro optimisations can pay off, and needn't be a maintenance problem
Abigail
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Micro optimisations can pay off, and needn't be a maintenance problem
by pilgrim (Monk) on Nov 12, 2002 at 17:51 UTC | |
| [reply] |
by Abigail-II (Bishop) on Nov 13, 2002 at 09:52 UTC | |
Abigail | [reply] |
by pilgrim (Monk) on Nov 13, 2002 at 19:40 UTC | |
That, I can't help you with; perhaps dws is onto something. I couldn't run the benchmark and haven't the time to tweak it and break out "get" versus "set", at the moment. (Nor have I the time to speculate on why it might not be a linear improvement.) Apparently, I've got just enough time to be no help at all. At least my keen grasp of the obvious is intact. | [reply] |
|
Re: Re: Micro optimisations can pay off, and needn't be a maintenance problem
by dws (Chancellor) on Nov 13, 2002 at 10:29 UTC | |
The benchmark test cases include overhead for object creation. That might skew the test results a bit. Getters and setters aren't benchmarked individually. Instead, the benchmark tests call getters and setters an equal number of times. But in the Z Buffer Demo getters seem to predominate (based on a quick read the code). Intuitively, getters are going to be faster than setters, perhaps enough so to explain a factor of 6 improvement.
| [reply] |
by pilgrim (Monk) on Nov 14, 2002 at 23:28 UTC | |
Breaking it out further, and accessing the previously-set $thing1 through $thing4 rather than creating a new object each iteration, I see that the individual "get" and "set" are actually about 37% faster. So object creation does factor in, and getters are faster. It's still "only" a 37% improvement, rather than sixfold, but it's not a unilateral 37%. | [reply] [d/l] [select] |
|
Re^2: Micro optimisations can pay off, and needn't be a maintenance problem (I don't believe it)
by tye (Sage) on Dec 27, 2002 at 23:59 UTC | |
I had been hoping to eventually see a response to explain this impossibility. This thread just now came up in the chatterbox and I'm glad that BrowserUk took some time to help me understand his perspective on the issue. The title of the node seems to clearly state that micro optimizations can "pay off". The node clearly indicates that about a 6-fold speed-up was obtained via micro optimization (and BrowserUk confirms in the chatterbox that this was the case). Some people who responded in this thread seem to think that it is possible for a 25% speed-up via a micro optimization to cause a 6-fold speed-up. Even more people have indicated that they think this node made exactly that claim (but BrowserUk explains that this was not his intent -- more on this in a bit). I can just see this thread being a huge boost to the problematic micro optimization craze. So I think it is important to strongly counter the impression this thread has been giving some people. First, let's address the part that BrowserUk did not intend to convey. [...], modifying these to do [...] has a substantial effect on the performance of the code. Reducing the draw time of the cube, from around 2 minutes (on my box) to around 20 secs.To me, this very clearly implies that this one simple micro optimization results in about a 6-fold speed-up. BrowserUk explains that just a few lines up, the sentence: One of the major causes of the slow performance is his [...] use of [...]was meant to indicate that several micro optimizations combined to give the 6-fold speed-up. So we do not have an example of a single, less-than-100% speed-up producing a total speed-up of over 500%. That would be impossible. If you have a micro-optimization that Benchmark.pm is capable of discerning a 50% speed-up in, you will be unlikely to see even a 25% total speed-up. Benchmark.pm goes to extreme lengths to try to isolate the item being timed. Even if your code does nearly nothing but that same bit over and over, there will be overhead that Benchmark.pm tries to eliminate from the calculations. But this overhead cannot be eliminated from the running of your script and will not be sped up by your optimization and so it will dillute the total speed up. In practice, a 50% speed-up via a micro optimization is pretty rare and is likely to only have a marginal impact on the total run-time of your script since most scripts don't spend even 50% of their time doing just one tiny thing. The smaller the operation you optimize, the smaller the part it is likely to play in the total run time of your script. It is a very unusual script that has even close to 50% of its total run time tied up in a "micro" operation. If your script spends 20% of its time doing the operation, then a 40% speed-up in that operation can't give you more than a 8% total speed-up. Measuring these percentages is certainly tricky, but the math is not tricky. Let's make the math really simple and even give the optimizers a big advantage. We'll interpret a "40% speed-up" as meaning that the new run time is 40% less than the original run time (new run time = 60% of original run time). Benchmark.pm would actually report this as "-40%" (the original being 40% slower) and "+71%" (the new being 71% faster) so the person doing the optimization would surely claim a "71% speed-up". So the simplified math would be: 1 - ( 0.80 + 0.20*(1-0.40) ) = 1 - 0.92 = 0.8 = 8% but in reality that represents a 71% operation speed-up producing a total speed-up of 7.4% (when the operation is 20% of the total run time). And having seen lots of micro optimization attempts, I feel confident in saying that micro optimization are most likely to produce less than a 20% speed-up and consist of less than 10% of the total run-time and so are very unlikely to give you even a 2% total speed-up. "So, if you have 100 such optimizations, you could expect a 200% speed-up" No. If you have 100 optimizations, then they can't total more than 100% of the run-time. If you have 100 micro optimizations, then I'd bet they don't total more than 50% of the run-time (because Perl has quite a bit of overhead). Even if all 100 of those micro optimizations made each of their operations 10,000-times faster, the total speed up would be less than a 50% reduction in the total run time. So I don't even believe BrowserUk's claim that several micro optimizations together resulted in a 6-fold speed-up. To get a 5-fold speed-up, you need to eliminate 4/5th (80%) of the run time. So, for example, if you amazingly found several micro operations that accounted for 90% of the run time, you'd have to speed each of those operations up 9-fold in order to get a 5-fold total speed-up: Well, BrowserUk chose a "major cause" of the slowness and managed to speed it up 25% (not 900%)! It just doesn't add up. I'm sure BrowserUk believes that micro optimizations made the code 6 times as fast. But I can't believe that conclusion at all based on the evidence present. I can only guess that a non-micro optimization made a more-than-6-fold improvement of a huge percentage of the total run time, leading to the over-all 6-fold speed-up and BrowserUk did not realize/recognize this. So I don't believe "micro optimizations can pay off" with a 6-fold speed-up. - tye | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Dec 29, 2002 at 16:42 UTC | |
Okay Tye, as requested. I started with the version of the code as currently posted at Z Buffer Demo which is a somewhat different version to the original I believe but is close enough for the differences made by rbc not to be an issue I think. I looked at trying to reverse all the non-trivial modifications I made to the original after I posted Micro optimisations can pay off, and needn't be a maintenance problem, but this proved beyond my memories capability after all this time. The two sets of (crude) comparative timings below show that on my machine, the original code takes 2 minutes and 5 seconds. This was measured by installing print scalar localtime; at the top and bottom of the sub main::Doit Crude, but more accurate than the "watch the clock" method I was using when I did this first time around. My optimised version of the code below shows that without making any changes to structure of either the code or the data, this timing has been reduced to under 4 seconds. I tried to stop when I acheived the original "around 20 seconds", but couldn't work out how. Now the debate will centre on whether the changes I have made constitute micro-optimisations or not. Another reason for avoiding the debate in the first place. The changes I have made include. I would characterise all these changes under the heading "re-factoring" as no functionality has been added, changed, or removed, nor been moved. I believe all these types of changes, have at various times by various people, been described as "micro-optimisations", and in that way been frowned upon. I did not set out to disprove this advice. I set out to assist a fellow monk with his question, and discovered that the edit/run cycle took so long that it discouraged me (and others if you look at the original thread). So, as I had an interest in seeing how the Z-buffer algorithm had been implemented, I set about trying to reduce the cycle time using simple changes. During this process, I found many places in the code where a single line would call several accessor subs. Each of these subs was shifting the argument list into named vars. Often these named vars were then passed as parameters to nested calls where they again were being shifted into another set of named vars. I originally recoded these accesses using the $_[n] nomenclature but that rapidly became confusing especially in subs where there were 4, 5 or 6 parameters. I then hit upon the idea of using constant subs to give meaningful names to the parameters. I had recently read somewhere that constant subs are inlined by the compiler and therefore have no more overhead than coding the constant itself and when I benchmarked a small testcase, this proved to be the case. I felt that as I hadn't seen this "discovery" described anywhere else, it was worth sharing with the community and I made my post. The controversy, such as it is, centers on the (obviously mathematically impossible) idea that a 25% reduction in attribute access could in some way be responsible for a "speed up factor of 6". (That phrase is in quotes because I never made that claim.) The obvious answer is "It cannot". And the reason I left the original question unanswered is that the answer is so obvious that I assumed it to be rhetorical. In hindsight, the phrase "Reducing the draw time of the cube, from around 2 minutes (on my box) to around 20 secs." from my original post should have read "Contributing to a reduction in the draw time of the cube, from around 2 minutes (on my box) to around 20 secs.". Had that read "from exactly 125.2 seconds to exactly 20.7", then I could have seen the source of the misunderstanding and the need for this belated re-examination. As it stands, if anyone feels that they where fooled by an obvious mathematical impossibility or that the idea of 'Constantising parameter access' was unworthy of a wider audience then I apologise. Likewise, if anyone believes that understanding the costs and benefits of TIMTOWTDI, and using them to achieve a reduction in runtime of one, non-trivial, real-world application from 125 seconds to under 4 is not worthy of consideration, I again apologise. Finally, if any monk feels that I profited unduly (in the oft-described, 'meaningless', XP stakes) from my original post (150 currently), then I invite them to downvote this post in recompense. Or if sufficient numbers believe that to be the case, I am quite sure that it is possible to have that number or any abitrary "fine" of XP deducted from my tally. I will not argue the point regardless. Please find attached my crude timings of the original and modified programs along with the modified code. I did set out to diff the two and highlight each of the changes and give my justification for why each change was, in and of itself, a micro-optimisation, but I got bored with the whole laborious process and gave up. Sorry. It might be worth noting that I did go on to make several more, decidedly non-trivial modifications when I was doing this originally. This included such things as Each of these steps had a further benefit on performance, but the main blockage had moved to the Tk code and that was beyond my (self-assigned) breif to modify. I think that doing the drawing into a bitmap and blitting this to the canvas would allow the performance to be reduced by (possibly) an order of magnitude, as the overhead inherent in drawing using discrete, hit-testable lines does nothing for the application as it stands. Original code - timing and profile Read more... (42 kB) | [reply] [d/l] [select] |
by tye (Sage) on Dec 30, 2002 at 05:48 UTC | |
Thank you. That clears things up quite nicely to my mind. A "micro optimization" is where you take a small operation (hence "micro") and make it run a bit faster (in hopes of making your program run faster). This is nearly doomed to be quite disappointing (as far as that hope is concerned). So let's look at your optimizations (not in quite the same order as you listed them). Yes, these are micro optimizations. No, this is not a micro optimization. Instead of making a tiny operation a bit faster, you are reducing how many times you run that tiny operation. So, this still has the disadvantage of the operations being tiny (so the percentage of the total run time being affected is likely quite small), but it has the advantage of instead of reducing this run time by something like 20%, you are instead cutting it by some large fraction. If the loop was being iterated 100 times, then you've cut the run time 100 fold. The affect on the total run time is still likely to be relatively small, simply because the tiny operation is unlikely to constitute a large fraction of the total run time. If it, for example, was 10% of the total run time, then you might speed up the run nearly a full 10%. I'm not quite sure what you mean by this. It might be a micro optimization or it might be similar to the previous item. (Emphasis mine.) Here we have a mix of micro optimizations and very non-micro optimizations. Changing the nesting of loops is an important optimization technique. It is the type of thing were you can cause the entire body of a loop to be run a small fraction of the number of times that it used to be run. The bodies of the loops could account for a huge percentage of the original total run time and you can reduce how many times they get run by a huge fraction. This is the kind of thing that can make programs run 10s or 100s of times faster. I'm sorry that I can't take the time right now to analyze the changes in detail. For now I'll assume that this explains things. If others take time to study this in more detail, I will be happy to see it and interested in their findings. Thank you very much, BrowserUk, for going through the effort to present this information. - tye | [reply] |
by Aristotle (Chancellor) on Dec 29, 2002 at 18:23 UTC | |
Just wanted to comment on the individual changes: Update: tye uses a precise definition of "micro optimization" in his post below and arrives at better points. Read that post first, it obsoletes most of mine. Read more... (3 kB)
Makeshifts last the longest. | [reply] |
by BrowserUk (Patriarch) on Dec 28, 2002 at 00:57 UTC | |
So pronounce Judge and Jury. Examine what is said, not who speaks. | [reply] |
by tye (Sage) on Dec 28, 2002 at 01:52 UTC | |
I'm sorry you took this personally and so hard. I was making a strong attack on the impression that I felt the node was giving people. This was certainly not meant as an attack on you. But you took it that way, so I'm sorry for giving you that impression. I hope you still find the strength to post the before/after code (like you mentioned in the chatterbox that you would) so that the community can look at the changes and decide where the huge speed gain came from. - tye | [reply] |
by BrowserUk (Patriarch) on Dec 28, 2002 at 02:26 UTC | |
by tye (Sage) on Dec 28, 2002 at 22:46 UTC | |