Memory, Apache::SizeLimit, BSD::Resource::getrusage(), and the long road

Recently I tried to get Apache::SizeLimit to work on some FreeBSD-based web servers that some of you might be familar with. It turned into a humorously frustrating safari (as too often happens) and I'd like to share it.

The point of Apache::SizeLimit is to get your mod_perl (for example) processes to exit after servicing a request that leaves them consuming too much memory, thus preventing a shortage of real ("physical") memory¹ and causing your whole server computer to slow down.

¹ As opposed to virtual memory, which is usually mostly limited, in aggregate, by swap space.

Unlike I did above, my experience is that the vast majority of places that talk about memory fail to make clear the distinction between physical memory and virtual memory, much less make it clear which sub-measure of these two broad categories they are talking about.

For example, quoting the Apache::SizeLimit documention:

This module allows you to kill off Apache httpd processes if they grow too large. You can make the decision to kill a process based on its overall size, by setting a minimum limit on shared memory, or a maximum on unshared memory.

My immediate reaction was "Ah, once again someone who doesn't bother to say whether they are talking about physical memory or virtual memory." Well, shortly after that paragraph there was a disclaimer about it being "highly platform dependent" so I went to the referenced section to see if, for my platform (FreeBSD), I could get a clearer answer. The answer for my platform was, in total, thus:

Uses BSD::Resource::getrusage() to determine process size. This is pretty efficient (a lot more efficient than reading it from the /proc fs anyway).

Okay, a quick look at BSD::Resource::getrusage() showed me that a lot of numbers are returned by it and quite a few of them have to do with memory usage and none of them appear to match what was described in the first paragraph I quoted. I was a bit annoyed that the Apache::SizeLimit documentation didn't deign to document anything about what it used from getrusage(). So I went digging for the relevant source code.

The copy of SizeLimit.pm installed in the web servers in question had:

sub bsd_size_check {
    return (&BSD::Resource::getrusage())[2,3];
}
# ...
            $HOW_BIG_IS_IT = \&bsd_size_check;
# ...
    my($size, $share) = &$HOW_BIG_IS_IT();

    if (($MAX_PROCESS_SIZE && $size > $MAX_PROCESS_SIZE)
                           ||
        ($MIN_SHARE_SIZE && $share < $MIN_SHARE_SIZE)
                           ||
        ($MAX_UNSHARED_SIZE && ($size - $share) > $MAX_UNSHARED_SIZE))
+ {
[download]

(not in that order, though).

So BSD::Resource reports that

2 maxrss: maximum shared memory or current resident set
3 ixrss: integral shared memory

were being used (the first as "total process size" and the second for "shared size"). The "or" was less than enlightening and the short, simple descriptions made me suspect there was a good chance that something more complex was hiding here.

So I still didn't have much confidence in coming up with reasonable values to give to Apache::SizeLimit (I had ideas about what to set a maximum size to according to 'top' output, which I'd reviewed a ton of over the previous months). So I entered the expression getrusage() into a restricted-access page on the web server so I could see some real values that it returns for the actual processes that I was interested in. I compared those values to the output from /usr/bin/top and found that none of them matched the process size, which is what I wanted to set a limit for. The first number was quite a bit too small while the second looked like it was in bytes not kilobytes, but still didn't match. It certainly looked inappropriate to do that simple subtraction (because it resulted in a large, negative number), from the code also quoted above:

        ($MAX_UNSHARED_SIZE && ($size - $share) > $MAX_UNSHARED_SIZE))
+ {
[download]

Next, I followed the BSD::Resource documentation's advice:

For a detailed description about the values returned by getrusage() please consult your usual C programming documentation about getrusage()

Googling for man getrusage and selecting the first link to mention BSD got me:

ru_maxrss: the maximum resident set size utilized (in kilobytes).
ru_ixrss: an integral value indicating the amount of memory used by the text segment that was also shared among other processes. This value is expressed in units of kilobytes * ticks-of-execution.

Since the two values are said to be measured in quite different units, that explained some of why the simple subtraction in:

        ($MAX_UNSHARED_SIZE && ($size - $share) > $MAX_UNSHARED_SIZE))
+ {
[download]

was inappropriate.

I found the units "kilobytes * ticks-of-execution" to be curious enough that I expected to find some more explanation of what that really meant. It fairly clearly described which kilobytes were being measured but made no mention of which "ticks of execution" were being multiplied. I think including the formula for deriving the appropriate value for "ticks of execution" should have been included. Frankly, I was stumped at what that value might be.

Unfortunately, despite searching for and finding quite a few different places that documented these values, I never found one that expanded on that explanation.

So I switched to searching for code that made use of ru_ixrss to figure how to use it properly. The first thing that I noticed was that most people made no use of it other than just reporting it verbatum for the viewer to try to interpret.

I did eventually find several uses of it, most of which involved adding the user CPU time to the system CPU time, multiplying by 100, and dividing the result into ru_ixrss. Trying that on my running httpd processes did not produce any results that I was able to recognize or line up with anything reported by 'top' or 'ps'. So I was still unsure.

Well, I could at least set a limit on "total process size" and see how that worked out. So I had done that earlier. It turned out to not be working too well, however, with too many httpd processes becoming too large and causing too much paging and even quite a bit of swapping (paging is where parts of the process are moved out of physical memory so that they only reside in a swap file while swapping is where an entire process has all of its pages removed from physical memory -- though these terms are often interchanged), causing the web servers and web pages to be slow to respond.

Eventually this all sank in and I realized what both of those numbers from getrusage() meant.

The first number is the number of kB of physical memory that is currently allocated to the process. This makes it rather inappropriate for use for something like Apache::SizeLimit.

Both 'ps' and 'top' report the virtual (total) size of each process followed by how much of that size is resident in physical memory ('ps' calls them VSZ and RSS while 'top' calls them SIZE and RES). At any particular moment, usually some fraction of the pages of virtual memory that make up a process are not stored in any physical page of memory. Having about 1/2 of the pages "resident" (in physical memory) is fairly typical, for example.

As the total size of all of the virtual memory for all processes (and the kernel) grows to be much more than the amount of physical memory installed in the computer, the moving of less-recently-used pages of memory out of physical memory becomes more aggressive. So most processes have a smaller percentage of their virtual pages in physical memory and they are more likely to try to use a page and find it not resident. That causes a page fault which blocks the process from doing work until the kernel can find a free page of physical memory and read the appropriate data into it (usually from the swap space though sometimes from some other "backing storage" such as the executable where that page was originally loaded from).

So if all of your processes are growing too large (in virtual memory size), then the system spends more time servicing page faults and so runs less efficiently. It also means that a smaller part of each processes' virtual memory is resident in physical memory.

So (on FreeBSD) the first number used by Apache::SizeLimit (as "total process size") is only the size of the part that is resident in physical memory. So, as the system gets overloaded because all of your httpd processes are slowly using more and more (virtual) memory over time, there occurs a corresponding shrinking in the percentage of these growing (virtual) sizes that are resident. So the resident size doesn't increase as much and may even shrink. So the resident size is the wrong place to look to determine if your processes are getting too big. The processes getting too big will just push to make the resident sizes relatively smaller.

The second number is [and two others from getrusage() are] useful for determining the average amount of (different types of) virtual memory used, over a given period. It appears that the kernel keeps a simple count of kB of shared pages and adds this to ru_ixrss each time the process finishes one time slice of execution.

So the numbers I computed earlier that didn't match anything were actually part of the total (virtual) process size, but they didn't match for two reasons. First, they were only the shared part. Second, they were the average size over the life of the process.

I actually used two other sizes from getrusage() in my computations:

ru_idrss: an integral value of the amount of unshared memory residing in the data segment of a process (expressed in units of kilobytes * ticks-of-execution).
ru_isrss: an integral value of the amount of unshared memory residing in the stack segment of a process (expressed in units of kilobytes * ticks-of-execution).

and even added those together and still got a number much smaller than the process size from 'top' and 'ps'. But adding all three should get me the average size of the process over the life of the process. So if finally makes sense why it was about 1/2 of the current total process size (if the growth in size is close to linear, then the average size will be half the current size).

So even though the kernel appears to have efficient access to the current value for these sizes, getrusage() doesn't give us those. In order to get something close to those from getrusage(), we need to call it twice. And we have to separate these two calls far enough so that at least one clock tick of CPU is used by the process. Then we can do some subtraction and division and get average sizes over that (short) period and have something close to the current process size (at the start / end of that period).

So Apache::SizeLimit is using two wrong values. In writing this node, I checked the latest version's source code and found this rather funny:

# rss is in KB but ixrss is in BYTES.
# This is true on at least FreeBSD, OpenBSD, & NetBSD - Phil Gollucci
sub _bsd_size_check {
    my @results = BSD::Resource::getrusage();
    my $max_rss   = $results[2];
    my $max_ixrss = int ( $results[3] / 1024 );

    return ( $max_rss, $max_ixrss );
}
[download]

That was a guess I tried as well, but I quickly realized that it wasn't right and did more digging.

BTW, most of the code that I found that actually came close to correctly using these values assumed that there are always 100 ticks each seconds. But /usr/include on the FreeBSD web servers showed that a clock tick on them is actually 1/128th of a second.

So the correct way to get the values that Apache::SizeLimit wants is closer to:

sub _bsd_size_check {
    # kB = kilobytes, kBt = kilobytes * ticks (of CPU):
    my( $userSecs, $sysSecs, $res_kB, $shared_kBt, $data_kBt, $stack_k
+Bt )=
        BSD::Resource::getrusage();
    my @next;
    do {
        # keep refetching until one clock tick passes
        @next= BSD::Resource::getrusage();
    } while(  $data_kBt == $next[4]  );
    # Compute the differences between the two results:
    for( $userSecs, $sysSecs, $res_kB, $shared_kBt, $data_kBt, $stack_
+kBt ) {
        $_= shift(@next) - $_;
    }
    my $ticks= ( $userSecs + $sysSecs ) * 128;
    my $shared_kB= $shared_kBt / $ticks;
    my $unshared_kB= ( $data_kBt + $stack_kBt ) / $ticks;

    return( $unshared_kB + $shared_kB, $shared_kB );
}
[download]

Except that experimenting with real FreeBSD processes shows that the "kBt" values get sizes added in discreet units while the CPU utilization numbers get values added in other than whole "ticks" so, when measuring the difference between such "close" snapshots, you just set $ticks to 1 even though the calculated number of ticks might be 0.01 or 1.38.

So this means that it can be quite hard to not have your calculation of number of "ticks" to be off by 1, which can make a big difference in your resulting numbers.

And even after all of that, I was still missing some small amount from the total process size. Not that it surprised me that the total process size was more than just the sum of those three rather specific measures.

Also, the above code is no longer that efficient so it'd probably be better to rework the code to call getrusage() at the start of servicing a web page request and then again at the end when the process size is checked. (It usually took a few hundred iterations to get across a "tick" boundary in the above code but once took only 7, for example.)

Which makes me want to use /proc or qx(ps -ovsz= -p$$) instead of the perverse getrusage(). But not even my administrative login has access to /proc on these web servers and it seems rather extreme to fork+exec to an external program just for a process to find out how big itself is.

So I'll probably work with calling getrusage() at the start and end of each web page request. Though now I've got to learn the proper way to tell an Apache child to exit cleanly after finishing this request, since I won't be using Apache::SizeLimit, for some obvious reasons².

² I'm actually currently using Apache::SizeLimit to place a limit on the resident memory size. It turns out to be quite fragile in that going slightly higher than the current value allowed processes to grow much larger and cause lots of paging and swapping while a slightly lower value caused each httpd to exit after servicing hardly any web page requests. But at least it, along with a bunch of other recent changes, is keeping PerlMonks response times much more reasonable in the short term until I can replace it with something much more robust.

I'll probably also do a little bit of looking into where the kernel finds these simple sizes that it adds into the rusage structure every tick. Just accessing those directly should be just as efficient and would be tons less confusing and hugely less error-prone. But I suspect that road will end at the "not allowed" sign.

And I haven't even talked about how I threw up my hands and decided to instead have an external watchdog that would monitor 'top' output and just tell any httpd that was getting too big to just "finish your current request and then exit", how the Apache documentation was mute on that point, and then the task of looking into the Apache source code and running experiments to finally prove that such just isn't possible either³.

Heh, I was just watching 'top' output and noticed the SIZE and RES values for one particular process were slowly growing except every so often they would suddenly jump down a large amount and then jump back up. I think I may be witnessing 'top' doing its best to calculate the number of "ticks" and every so often calculating one too many ticks. I had actually scanned the 'top' source code in hopes of finding how to use getrusage() correctly or even to see where 'top' was getting process size information. I never did find it and I decided that 'top' likely didn't use getrusage() [since it was not looking at its own process size and since I saw notes about it needing /proc mounted], but it may well be (even indirectly) doing the same type of calculations. I find that a little bit sad... but funny.

- tye

³ But it'd be nice if some future version of Apache would, instead of having children just ignore SIGUSR1 when in the middle of servicing a request, they would make note of having received a SIGUSR1 and just exit when they are done with their current request after that. Heck, that might even allow the Apache parent process code to become slightly simpler. But, to be clear, the important reason to do that is so that people can easily write external watchdogs that gracefully clean up Apache children.

Comment on Memory, Apache::SizeLimit, BSD::Resource::getrusage(), and the long road Select or Download Code

Replies are listed 'Best First'.
Re: Memory, Apache::SizeLimit, BSD::Resource::getrusage(), and the long road by clinton (Priest) on Jul 15, 2007 at 11:36 UTC
Very interesting read, tye, and if you search the modperl list archives, you'll see that the question of how to read real memory usage has long lacked a satisfactory answer. You may take a look at Stas Bekman's article on Calculating Real Memory Usage, which tries hard to give an answer (via GTop) which is good enough, if not completely accurate, or fast. Though now I've got to learn the proper way to tell an Apache child to exit cleanly after finishing this request You can do this by calling `$r->child_terminate()`, which will cause the child to exit cleanly after finishing the request. If you were to implement this as an external watchdog, (and I'm not sure how signal handlers would work if the child is not running Perl code at the moment you send the signal), you could just add a process at the end of each request (or every N requests) which would check for a file `apache/exit/$PID` and call `$r->child_terminate()` at that point. Clint	[reply] [d/l] [select]
Re: Memory, Apache::SizeLimit, BSD::Resource::getrusage(), and the long road by perrin (Chancellor) on Jul 15, 2007 at 16:09 UTC
As one of the people who has maintained Apache::SizeLimit over the years, I can tell you that BSD support has always been dependent on patches from BSD users. Periodically, someone would come along and say that there was something incorrect about it, and offer a suggestion or patch, like the one that Phil Gollucci put in to the most recent release. If you think you have a solution, the mod_perl devs would be glad to accept a patch. Linux support has been pretty solid, but we had an issue a couple of years back when we discovered our attempts to tell how much of the process was being shared by copy-on-write did not work. We discovered that Linux was not making that information available and the differences we saw were accounted for by other things. (That, by the way, is what the docs are talking about when they refer to shared and unshared memory.) A recent change on the Linux side allowed us to correct this, with some cost to performance. It's definitely tricky to get the right settings. If you tune it so that apache won't start swapping under heavy load, that can lead to perverse behavior under light loads, when processes exit without needing to. My usual approach is to tune MaxClients lower and let the processes get bigger. I figure I'm trading some concurrency for improved speed in the lower number of children (because they won't be spawning new procs all the time). Your idea of using an external watchdog should work. Send a SIGUSR2 and make a perl handler for it that calls $r->child_terminate.	[reply]
Re: Memory, Apache::SizeLimit, BSD::Resource::getrusage(), and the long road by almut (Canon) on Jul 15, 2007 at 18:12 UTC
Thank you VERY much for digging into this, tye! (Yes, I know, this node of mine doesn't add anything of technical value, but I felt it just needed to be said...)	[reply]