After recently getting involved in the always-new-but-ancient recurring thread about compiled languages and efficiency, I thought I'd share an example of how the first-pass guess at 'efficient programming' often isn't.

I have a database (MySQL) with five tables of up to 1.2m rows each. The entire db is just over 1.2gb on the filesystem. This particular mule has MySQL, Apache, and Mozilla all running on the same FreeBSD-4 platform which has 256M of RAM. The deployment system will have more headroom, but it's still important to be careful because swapping out the httpd or mysqld processes would be unacceptable.

The goal is to build (as a background task) a file which will contain the select box options for dropdown menus for the various tables and web pages.

My first approach was to build a query that would return an array reference including each of the ten fields, and then process them separately. MySQL crashed on the second table, telling me it needed 512MB of RAM to process the query.

Okay, fine. My next pass at a solution had me selecting for one field at a time, but it still crashed about halfway through the third table.

TOP is your friend on a BSD system, and my good friend showed me that both my SWAP and my Perl process size were growing.

My next pass had me using system() to run a separate process for each table, but this still slowed to a crawl when it got near the live RAM limit. My program was spending most of its time in the swread state, pushing stuff back and forth to swapspace.

My final pass has me separating each MySQL query into a separate perl process spawned by system(). (Yes, I could have fork()ed, but this is easier and more maintainable.) This seems counterintuitive, but it actually is much easier on the system, and the overhead of process management is negligible in the big picture.

Why did my memory usage go over the top? Even when you undefine variables, the memory used is not returned to the operating system until the process terminates. Thus, a solution with many small queries and processes runs much faster and is much more "ecologically" sound than a solution which would be faster in a "perfect" world with infinite RAM. :D

Replies are listed 'Best First'.
Re: The Long Way 'Round...
by samtregar (Abbot) on May 02, 2005 at 20:59 UTC
    My final pass has me separating each MySQL query into a separate perl process spawned by system(). (Yes, I could have fork()ed, but this is easier and more maintainable.)

    It might be easier (if you don't grok fork()) but I doubt it's more maintainable. With system() you have to keep track of the path to the programs to run in sub-processes. With fork() everything stays in one script and you don't need to track a dependency.

    Krang uses fork() to deal with a similar problem - memory usage in large publishing jobs. It's definitely a technique that every *nix hacker should have in his toolbox.

    -sam

Re: The Long Way 'Round...
by Anonymous Monk on May 02, 2005 at 21:20 UTC
    Even when you undefine variables, the memory used is not returned to the operating system until the process terminates
    I'm not a FreeBSD user (so this may or may not be true), but usually the OS can reclaim memory that's munmap'ed. Of course you'd also need a special version of perl that used mmap instead of malloc, and each variable would take a VM page of storage (4k on x86), but that's another story.

      I believe if you use the system malloc(3) on FreeBSD rather than Perl's then memory does get returned to the OS when free(3) gets called.

      Update: Well, it's not working on my FreeBSD 5.0 box but it does on a Tiger box; both have usemymalloc='n'. Strange . . .

Re: The Long Way 'Round...
by Anonymous Monk on May 02, 2005 at 21:00 UTC
    Why did my memory usage go over the top?
    Because of a race condition in MySQL on BSD.