http://qs1969.pair.com?node_id=1112666


in reply to HELP: Does anyone here routinely, deliberately, overcommit their servers? Or know anyone who does? Or heard tell of same? Anyone?
in thread RAM: It isn't free . . .

The working-set sizes are consistently smaller than the virtual-memory allotment that the applications need to be able to request.
You live in a very peculiar world...

I observe that it is extremely rare for the part that I quoted to not be the case. It is far from "peculiar", in my experience.

But I would agree that cases of hosts being operated in manners that make heavy use of this fact used to be the rule but now is more likely to be considered something to carefully avoid (even obsessively avoid), especially in Production hosts.

But I read that making heavy use of this type of over-commitment of memory (virtual memory allotment vs. working set size) is the rule in cloud architectures. But, especially in the cloud (or other systems of many virtual machines), the term "over-commit" is usually resorted to when the accumulated (desired) working sets exceed the total physical memory used, rather than when the total virtual memory allocations exceed the total physical memory.

But that isn't always the case. We try to discourage even strict over-commit in our Production VM racks. In some ways, this can be more important than it was for racks of separate physical hosts.

A large collection of virtual machines working together as the Production environment for some organization should be carefully built so that each VM has resources allocated to it sufficient to handle not just the peak load, but some percentage more than that so that an unprecedented peak in traffic or (more likely) some problem that increases load happening during peak is very unlikely to overload the system, leaving the system at least able to shed / refuse load enough that the requests that aren't rejected complete satisfactorily while the rest "fail fast" (after a reasonable number of retries, probably at several layers), hopefully being registered for another retry significantly later.

But those problems at peak often cause many virtual machines to use significantly more resources than they normally do at peak. If your VMs have not been configured with low enough hard limits on resources (memory size, number/speed of CPU cores) or your VM hosting platform is over-committed vs. the sum of those hard limits (adding in I/O throughput, a resource which I don't think VMs do a good job of partitioning), then you risk the whole VM platform becoming significantly degraded. Now your 20 web servers having the problem and the 30 related servers suffering from knock-on effects are also impacting many of the servers you need to use to actually fix the problem, more than they would if you used separate physical hosts.

Our Production VM hosting platforms have the majority of their resources idle, even at peak.

While a real cloud has enough diversity that a significant percentage of the virtual machines going into overload at the same time is so unlikely that the expense of strictly avoiding over-commit would make the cloud's pricing not competitive.

- tye