|Welcome to the Monastery|
Re^2: RFC: Abusing "virtual" memory (failure)by tye (Sage)
|on Nov 27, 2007 at 05:35 UTC||Need Help??|
When a machine starts using virtual memory, its performance drops substantially.
Either the machine is using virtual memory from before it finished booting or it is running in "real mode" and thus probably not running a modern, multi-user operating system. That may sound like nit-picking since you probably meant something more like "using paging space" but even that doesn't make much sense to me (and the rest of what you said makes the distinction important beyond a nit-picking correction). "Making heavy use of paging space" would be more accurate, or just "paging heavily". But, yes, running out of physical memory can cause a system to become extremely inefficient and have a hard time getting much of anything done and even have a hard time recovering. I remember our ancient VMS system had a kernel trap that if it was spending more time futzing around trying to figure out how to get the next thing done than it was spending actually doing things, then it would just give up, flush buffers, and reboot.
Therefore if you have an expensive server, think about getting extra RAM and disabling swap. Or even just disabling swap.
I've never seen that option. I guess it makes sense for it to exist given that something like Linux is able to run on tiny systems lacking an appropriate resource to hold paging space.
By contrast if that machine had no virtual memory,
I think you mean "had no paging space" (a.k.a. "swap space" but I try not to say "swap" when I mean "paging" and "swap space" is mostly used for paging not swapping entire processes out of memory). The modern multi-user operating system with its protections are based around virtual memory so I doubt you are running without virtual memory, just with virtual memory that is not allowed to grow larger than the size of physical memory.
the failures are much more obvious. Plus there is a good chance that the offending memory hog will die fast, and the server is likely to be able to continue doing everything else it is supposed to do.
My experience is that running out of virtual memory means that there will likely be something somewhere other than the "one hog" that runs into a case of malloc() or realloc() failing. And my experience is that it is extremely rare for code to be written to deal well with malloc() or realloc() failing. So if we have a system that has run out of virtual memory, then we schedule it for a reboot ASAP. Often, the corruption caused by the virtual memory exhaustion isn't obvious in the short term so most often the system appears to continue on rather normally. But in the cases when the reboot wasn't done, eventually something about the system became flaky.
AIX had an interesting take on this problem. It would notice that it was coming close to running out of memory and so would pick a process to kill based on some heuristics that I don't recall ever seeing documented. (It also didn't care how much virtual memory a process allocated, just how much virtual memory the process used, thus malloc() would never fail.) My experience was that AIX's heuristics in this area almost always picked the wrong victim. I don't know if that is an indication of the problem being significantly harder than it might at first appear or if it is just IBM implementing something stupid (likely a combination).
Sad to say, but it is too easy to write C code that doesn't bother to check whether malloc() returned NULL. So having one or more processes have an internal failure at random, some of them silently, sounds much worse to me than AIX's idea of having one process die obviously and in a relatively controlled manner. And I recall AIX's solution not being well liked.
So I advise caution to anyone planning on taking your advice.
Unfortunately, I have not seen a magic bullet for dealing with mis-behaved memory hogs. And my experience says that this approach isn't a magic bullet, either. Our arsenal of weapons against this problem is comprised of such various and diverse elements as testing, load boxing, monitoring, post mortem analysis, isolation, ... The problem gets really hard when your "production systems" are the ones used by a bunch of users and programmers (the best single tool I've seen from the side-lines there is having a daemon that suspends any process that makes a nuisance of itself and notifies the "watchers" for intervention).