Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Bitten by the worst case (or why it pays to know whats inside the black box)

by demerphq (Chancellor)
on Jun 26, 2004 at 12:49 UTC ( [id://369817]=perlmeditation: print w/replies, xml ) Need Help??

Yesterday I had an interesting problem arise. I was working on some code that was supposed to cache some costly to produce data structures that are generated from a DB query. Since these data structures are both large in terms of memory, and slow in terms of creation, caching them represented a reasonable win in terms of both. Unfortunately however due to an oversight on my behalf the caching was ineffective and I was never reusing the old data and was always regenerating the same data structures.

After a bit of review I realized that the problem was due to the way I was implementing the cache lookup. The process was essentially this: produce the union of four queries each of which returned a variable number of two field records, (the first number being a telephone prefix, and the second being an ID that represented its pricing model) then join that data together to form a string key that could be used in a hash lookup to determine if the data was already in the cache. Because of technical reasons I couldnt just do the union in a single query so the union was performed by just pushing the result sets into a array.

The problem was that for various reasons two actually equivelent data sets might be returned in slightly different orders. So because virtually every time the order was different, the cache lookup always failed and I always regenerated the data. This meant that the broken caching actually was far worse than not caching at all, as the overall memory consumption (and thus swapping, etc) was vastly higher because I was storing multiple copies of the resulting large data structures. Naturally my first thought was "dummy, jusy sort the union array before you build the lookup key". So the addition of a sort call much (exactly actually) like

@union=sort { $a->[0] cmp $b->[0] } @union;
was added to the code. I then reran expecting the caching to now kick in and to see a nice little speedup of the process it was involved in.

It didn't work! In fact the result was that despite the fact that the caching was now clearly happening and previously created data strcutures were being reused, the overall process was actually much slower. As in at least 10 times slower!

This perplexed me. Why on earth would working caching be slower than no caching at all? It made no sense. I spent some time reviewing the code, and adding debug print statements to see where the extra time was coming from. I soon realized that the slowdown was in the sort statement, and after boggling for a bit over it something occured to me: Each of the four queries was returned sorted by the prefix. These were then merged end to end with a simple push. The result of this was that the @union array was nearly sorted. This was especially true as one of the queries was responsible for about 80% of the records, a second was responsible for about 15% of the records, with the other 5% coming from the remaining two queries. In fact it was the makeup of the three smaller queries that resulting the cache misses as the same record could be in any one of the three depending on circumstance.

So I was sorting an almost sorted array. Which in perl 5.6 (its not clear to me if this is still true in 5.8) is a very bad plan, as that version of perl uses a quicksort implementation. (Actually I had to consult on the CB to verify this, and thankfully Zaxo and tye confirmed that this was in fact the case.) Quicksort as those with a compsci background will recall is average case O(N log2 N), but it has a nasty property that its worst case is O(N*N). And its worst case is on nearly or totally sorted data. Sorting ~8000 records under these circumstances was taking about 17-20 seconds! (And on a farily high powered machine!)

So the worst case of quicksort had just bitten me right on the bum!

After zaxo and tye supported my suspicion that the slowdown was probably due to falling into the worst case of the sort algorithm I removed the "order by" clauses from the queries, and for good measure threw in a call to shuffle as well which left me with something like the following code:

use List::Util 'shuffle'; @union=shuffle @union; @union=sort { $a->[0] cmp $b->[0] } @union;

Presto the ~20 second delay disappeared and the run time of the shuffle went down lower than the resolution I was able to easily measure. (I didnt bother with hires timing). Essentially the combined shuffle/sort was neglible time compared to the sort alone. Case closed and a nice little speedup for the program. :-)

I think the moral of this story is that the black box mentality that one often encounters in certain areas of development (and IMO particularly often in the perl programming community) is a bad thing. If you dont have at least a reasonable idea of how the black box performs its tasks, and thus its strengths and weakenesses then solving this kind of problem would be very difficult. Why should two opertions (and traversals of the array) be faster than one? A lot of programmers I suspect would have just shrugged and said "well sort knows best" and left it at that. Because I knew enough about the insides of the black box that is the perl sort function I was able to fairly easily identify the cause and then decide upon what I consider to be a counter-intuitive (at face value anyway) solution to resolve the problem.

My advice to folks out there is that when you work with black boxes (databases, indexing, modules, perl functions etc) do a bit of homework. Find out how they are implemented in at least abstract high level terms. Don't just assume that the black box works and thats all you need to know. Because one day the black box won't work as it should and if you don't understand at least the rudiments of its internals you'll probably never even know that there is something wrong. I'm not saying that you shouldn't use a DB or a module or perl builtin without reading the source code (although with CPAN modules there are other good reasons to do exactly that) but you should at least understand how the item works sufficient to be able to debug its use when it does go pear shaped.

Anyway, I hope this posting motivates some to inform themselves of the principles by which their regular tools do their jobs. Simply because something is a black box doesnt mean you cant get to grips with how it works at a high level. And that knowledge is very likely to pay off in spades one day.

Cheers.


---
demerphq

    First they ignore you, then they laugh at you, then they fight you, then you win.
    -- Gandhi


Replies are listed 'Best First'.
Re: Bitten by the worst case (or why it pays to know whats inside the black box)
by Zaxo (Archbishop) on Jun 26, 2004 at 16:10 UTC
    its not clear to me if this is still true in 5.8

    Perl 5.8 defaults to mergesort on most platforms, though quicksort is still available through use sort '_quicksort';.

    The mergesort algorithm does not have the nasty worst case performance which bit demerphq. It guarantees to run in O(NlogN) time. It is also stable, meaning that items which compare equal in the sort retain their order in the result. Its disadvantage is that it uses O(N) memory for temporary storage.

    After Compline,
    Zaxo

      Also note that perldoc sort also says after 5.8 that "large arrays" are shuffled before sorting if you tell perl to use quicksort (although it doesn't specify how large "large" is . . .).

        If I understand the code correctly (not guarenteed) then the minimum size after which a pre-shuffle is performed before quicksorting is 255.

        /* QSORT_PLAY_SAFE is the size of the largest partition we're willing to go quadratic on. We innoculate larger partitions against quadratic behavior by shuffling them before sorting. This is not an absolute guarantee of non-quadratic behavior, but it would take staggeringly bad luck to pick extreme elements as the pivot from randomized data. */ #ifndef QSORT_PLAY_SAFE #define QSORT_PLAY_SAFE 255 #endif

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: Bitten by the worst case (or why it pays to know whats inside the black box)
by dws (Chancellor) on Jun 26, 2004 at 16:58 UTC

    Unfortunately however due to an oversight on my behalf the caching was ineffective and I was never reusing the old data and was always regenerating the same data structures.

    My brain has now been so warped by doing Test-Driven Development that when I hear of problems like this I immediately think "missing unit test!" and "what test could I have written while implementing caching to verify that the cache is effective?"

    One simple approach is to expose some sort of "cache load" counter, then test to make sure that is or isn't incrementing, as appropriate. Something like the following is quick to write:

    my $old_cache_count = $db->cache_count(); my $data = $db->fetch('known key 1'); ok( $data, 'expected value for key 1' ); # we expect that to have caused a cache load my $new_cache_count = $db->cache_count(); ok( $new_cache_count, $old_cache_count + 1 ); $old_cache_count = $new_cache_count; $data = $db->fetch('known key 2'); ok( $data, 'expected value for key 2' ); # the cache shouldn't have reloaded ok( $db->cache_count(), $old_cache_count );
    You might need a way to load know test data, but if you're developing repeatable unit tests, that's a problem you'll need to tackle at some point.

Re: Bitten by the worst case (or why it pays to know whats inside the black box)
by thor (Priest) on Jun 26, 2004 at 14:12 UTC
    The only observation that I have is that if I were a maintenance programmer and say the shuffle in there, I'd say "WTF?" I'm not saying that it's the wrong thing to do, it's just counterintuitive to shuffle an array before sorting it.

    thor

      it's just counterintuitive to shuffle an array before sorting it.

      And yet this is what many quicksort implementations do by default in order to avoid the worst case. But you are correct, Ill add a comment explaining the purpose of the shuffle so that it doesnt cause any confusion later on. Thanx for suggestion.


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi


Re: Bitten by the worst case (or why it pays to know whats inside the black box)
by mpeppler (Vicar) on Jun 26, 2004 at 14:28 UTC
    My advice to folks out there is that when you work with black boxes (databases, indexing, modules, perl functions etc) do a bit of homework. Find out how they are implemented in at least abstract high level terms.
    Absolutely. In my case this has meant understanding how the query optimizer works in Sybase - because although it will usually do a good job, sometimes (as with the quicksort issue shown here) you can get pathological queries that you think should work just fine, but that due to some optimizer decision actually go off the deep end... If you understand how the optimizer works it is then much easier to re-write the query to be more efficient.

    Michael

      you can get pathological queries that you think should work just fine, but that due to some optimizer decision actually go off the deep end...

      Agreed. I think with DB's in particular a sound understanding of how they do their business is very useful. Both in DB design and in writing efficient queries. In fact the number of times that I've heard people say "I dont need to know how DB's work, its a black box" was a motivation for writing this node.

      Also this type of understanding has on at least one occasion earned me signifigant brownie points. The Sybase DBA I work with the most, (who is now my boss) once wondered aloud why when dropping a clustered index and rebuilding it didnt always result in the same record ordering. The reason of course was that internally Sybase was using an unstable sort and two records who were key equivelent might be reordered. The fact I was able to explain this to the man that has taught me practically all of the Sybase stuff I know earned me a lot of respect. So sometimes the payoff isnt even in the code domain, but in real life.


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi


        Agreed. I think with DB's in particular a sound understanding of how they do their business is very useful. Both in DB design and in writing efficient queries. In fact the number of times that I've heard people say "I dont need to know how DB's work, its a black box" was a motivation for writing this node.
        As is my current situation of helping people who write code that uses Apples WebObjects java framework. WO presents objects to the programmer that represents database rows - but the programmer has no control over how these rows are fetched. In general it works OK, but when it doesn't it's a major PITA to try to find a way to make it work at a reasonable speed.

        Michael

Re: Bitten by the worst case (or why it pays to know whats inside the black box)
by duff (Parson) on Jun 26, 2004 at 23:23 UTC

    Nice meditation!

    After a bit of review I realized that the problem was due to the way I was implementing the cache lookup. The process was essentially this: produce the union of four queries each of which returned a variable number of two field records, (the first number being a telephone prefix, and the second being an ID that represented its pricing model) then join that data together to form a string key that could be used in a hash lookup to determine if the data was already in the cache.

    Some other ideas:

    • use the queries themselves as the key into the cache instead of the results
    • since most of the data is sorted already, instead of just pushing onto the end, insert them at the correct place
    • and, inspired by a recent QOTW ... it shouldn't matter that the data are sorted asciibetically, just that they're sorted the same way each time. So you can do something like this:
      use List::Util qw(shuffle); my $alphabet = join '', shuffle 'a'..'z'; # do this only once and sav +e it to a file for later reading my $t; @union = map { $_->[1] } sort { $a->[0] cmp $b->[0] } map { eval "(\$t = lc \$_) =~ tr/$alphabet/a-z/"; [ $t, $_ ] +} @union;
      Not sure if there's a win there or not, but it's interesting anyway :)

      Hmm. Some interesting thoughts here. I can't use the queries as part of the problem is that the results for different queries can be the same. Although its only implied these are lists of Prefix=>Zone Assignments for customers. So two customers may have the same assigments. I think instead of merging the lists. (which isnt that terrible an idea BTW) I might actually rework the way the queries are executed so I can do the union of the four in a single query and let the DB sort it. The last idea certainly is an interesting approach to avoiding the shuffle.

      Thanks for the brainfood. :-)


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi


        In general, if you can let the database do the sorting/merging of the four queries it is likely to be faster (after all, that's what a database server is optimized to do!).

        There may of course be situations where trying to get the database to do the right thing requires too much shoe-horning of the data to be worth it...

        Michael

Re: Bitten by the worst case (or why it pays to know whats inside the black box)
by tilly (Archbishop) on Jun 27, 2004 at 02:13 UTC
    Life is about trade-offs.

    Thinking of things as black boxes makes it easier to bring someone to the point where they can be productive, and makes development far easier.

    Knowing what is inside the box allows you to solve obscure problems when they arise.

    The vast majority of the time, thinking in terms of black boxes saves you time and energy. I agree that it can be worth it to learn how the box works a bit, though, because when it goes wrong it can be very useful to know that. But the payoff doesn't come that frequently, and often isn't all that large. Besides as long as someone on your team knows, and you know who to delegate to when confused, the effect isn't all that different from actually knowing it yourself. (Except that the other person gets the credit. Also ignore this if you don't work in a team.)

    Perhaps what I'm saying is that while in an ideal world we would all know everything about every topic, in the real one we can't expect to. So learn what you can, make black boxes of many things, and fill in some of your ignorance as time and opportunity present themselves. But accept that there will be ignorance.

      Obviously there is a point where the cost of looking into the black box exceeds the benefit of doing so. If you rarely use the black box, then there is little if any point in investigating how it works, and as you say this point is all the more true if you have access to people who do understand its innards when you get stuck on some bizzare boundary issue.

      However, when one uses a tool regularly I believe the benefit of learning the basics of how it works far exceed the costs. Databases are a good example. Most programmers end up having to regularly use some form of relational database. Yet I think a lot refuse to learn even the rudiments of the principles of its operation. This to me is just plain silly. Its not just a matter of dealing with obscure problems, but rather of a more holistic approach. Understanding a bit of how referential integrity works, how indexes work can and should be a guiding force in how you design queries, how you design table relationships, etc. Consider that fully normalizing a database is good for RI purposes but may be very bad for performance as it will typically mean far more joins to get the same data. And as a joins cost increases with the size of data involved this can be a very bad design decision despite its academic idealness.

      A different example would be filesystems on your favourite operating system. A programmer will be using and interacting with the filesystem in just about every program they write. Understanding the basics of how a filesystem works enables the programmer to make basic design decisions that in the long term pay off big time. For instance consider the three level directory structure in CPAN. Originally this structure was not employed, however had the programmers considered the scaling potential of CPAN in conjunction with the scaling properties of a flat directory structure they would have avoided the considerable effort required to restructure the data mid way. Im fairly sure that the programmers knew both points, but I suspect that the black box mentality was at least in part behind them not making the connection between the two.

      So i guess my point is that if you use a black box regularly you need to apply some kind of multiplier to the immediate benefit you gain from the investigation before considering the cost. Doing your homework earlier will help avoid costly and time consuming changes later.


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi


Re: Bitten by the worst case (or why it pays to know whats inside the black box)
by bluto (Curate) on Jun 26, 2004 at 15:44 UTC
    And that knowledge is very likely to pay off in spades one day.

    Agreed. "Tranparency or performance, pick one" only bothers me occasionally since I don't usually care about performace much anymore. When it does bother me, it bites hard. Back when I was using C a lot I found I would try to think ahead about potential problems like this. Now that I use Perl, I find I get lulled by the "good laziness" it fosters and start incorporating some of my own "bad laziness" (i.e. not thinking at all)...

Re: Bitten by the worst case (or why it pays to know whats inside the black box)
by l3nz (Friar) on Jun 27, 2004 at 09:14 UTC
    Hello, first of all my congratulations for finding out. That's very good and you're right, people tend to use components or structures without understanding them at all.

    The only thing that seems suspicious to me s that you have to use an 8000-value key to query your cache. Are you sure this is the only way to go? maybe there is a shorter input query with less independent parametrs you could use as a cache key. You could have a rough-hit and then check afterwards whether the entry is appropriate or not.

    Also, in exchanging against a complex and costly structure building process, you might probably do for storing the cached results on disk or in a database and then reading when needed. This means you don't use RAM for the thing and you could have a very simple cronjob running purging the stale entries cache.

    From what I read, the process itself might be taking a couple of seconds, as you say it was 10x worse when sorting. I don't think reading a cached entry from disk should take averagely more than 0.2 seconds (and likely much less), so that's anyway a 10x improvement without the risks and hassles of ending physical RAM or finding yourself on swap space.

    Just my $0.02

      Actually you may have a point regarding the key. However im not so sure about the ondisk versus in ram issue. Im on a box with large ram, and virtually all of the complex data structures will be reused several times, so my thinking is that I will just let the OS decide if the data should be swapped out or not. Only one of the large strucutures is used at a time, and is used for a reasonable amount of time, so I suspect that avoiding swapping isnt a win at all, and possibly even would be a loss. BTW, the 8000x2 data elements end up being about 100k+ of RAM, but even there its more the construction time than the memory that im worried about. It takes a while to build that 100k structure. (Ive actually contemplated rewriting Tie::Array::PackedC in C, or recoding not to use it at all to reduce the run time, but at this point im overall satisfied with the performance.)

      Anyway, more brainfood to digest. Thanks.


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi


Re: Bitten by the worst case (or why it pays to know whats inside the black box)
by dragonchild (Archbishop) on Jun 28, 2004 at 03:22 UTC
    Excellent post! I do have a semantic quibble, though . . .

    Anyway, I hope this posting motivates some to inform themselves of the principles by which their regular tools do their jobs.

    In my mind, if you use something more than occasionally, you should have a good idea of its strengths and weaknesses. This, however, has nothing to do with cracking open the black box. It has to do with knowing the specs of the black box. That's just doing your homework.

    The perfect example, in my mind, is a motherboard. I have an extremely basic, almost monkey-level, understanding of the circuitry my livelihood is based on. But, I do know how to read specs. So, I know what machine makes a better DB vs. web server. Yet, I can't draw a logic gate to save my life.

    Another example would be a choice of webserver - Apache vs. IIS. There are very valid reasons to pick one over the other. Yet, I have never looked at a line of source code for either. In fact, I don't know .NET or ASP more than a week's worth of playing around. But, I can confidently recommend IIS over Apache, when it's warranted - because I know the specs.

    Buried deep in the docs for 5.6.x is a statement saying that Perl's sort is implemented over qsort(). Back in CS 210, we learned that qsort() is pathological over nearly-sorted lists. That's just knowing the specs. I'm pretty sure you didn't read the code for the implementation of sort and work out how the box does things; you read the specs.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      How do you feel about this example of cracking open the black box?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
        Again semantics, but I would argue that the only time you ever need to crack open the black box is when you feel that the black box's specifications aren't what you need. File::Temp has a certain set of CPU and RAM specifications along with its feature specifications. tachyon obviously felt that the hardware specs for File::Temp were too high for his needs. So, cracking open the black box to find out how it works is appropriate.

        My point is that figuring out how the black box does its work is different from analyzing the low-level specs for the black box. We should all, as we improve in skill, have a greater understanding for the low-level specs. I still argue that I don't need to read the source in order to understand how bucket brigades work in Apache.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://369817]
Approved by Happy-the-monk
Front-paged by rlb3
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-03-29 02:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found