Upon profiling the memory usage of tasks on a cluster, it can be seen that
there are 2 distinct types of tasks: well behaved, and poorly behaved.
Well behaved tasks consume nearly all of the resources which they request.
Such applications generally have large caches or memory mapped files.
Further, they generally account for the total amount of memory they have
allocated to ensure they down allocate more then they have reserved. It
may be the case that these processes allocate a large block of memory and
the proceeds to waste it internally, but this is a degenerate case. I will
not deal with as this is not a problem with the cluster, but rather a
problem with the application author. Figure 3 depicts a process who is
relatively well behaved. The average memory usage is approximately 1100MB
and the maximum usage is approximately 1350MB (a difference of only 250MB).
This is all about 5% below the allocation (shown as peak in the graph). As
it can be seen by the 90th percentile line, memory spikes generally stay
well below the max and peak.
Figure 4 depicts a poorly behaved task where the average memory usage
(1375MB) is well below the maximum (2000MB). Even though 90th percentile
memory usage is only around 1500MB, the task clearly has memory usage spike
which approach the allocation of 2000MB. Specifically, this task wastes
about 500MB of memory on average.
Clearly, if all tasks were well behaved we could schedule them to maximize
memory usage because we can accurately predict how much memory they will
use. However, from inspection of jobs submitted to the cluster, most jobs
consist of mostly poorly behaved tasks. There is also another interesting
correlation: poorly behaved tasks tend to be non-interactive. This gives
us the freedom to slow, or at the limit, pause the task without having
adverse effects on it's intended result. As an example, we are free to
pause an off-line scientific computation though we are not allowed to
pause a web server which is serving live traffic. Note that it is quite
simple to pause a running task; all that needs to be done is to have the
OS scheduler stop scheduling that particular process. The UNIX command
nice(1) could be used for this purpose, but a modified kernel scheduler
would probably be a better fit.
Given the ability to pause a running task, and slightly more information
about the resource usage profile of that task, we can tightly bound the
resource usage on a given node by sacrificing task running time. A simple
approach would be to define the normal and peak resource consumption of a
given task; this can be measured through previous runs of that task. We
then say that a task is approaching its peak if it's usage exceeds 110% of
its normal usage. Then, we can bound memory usage by only allowing one
task which is approaching it's peak usage to run at a time. This then
bounds total resource usage at the sum(normal usage) + max(peak - normal) +
sum(normal * 110%). Further, if we impose a total order on tasks residing
on a given node, we can ensure that we fairly service paused jobs by always
having the scheduler consider them in order.
Figure 5 depicts a memory model of a typical poorly behaved, non-interactive
process running in my simulation environment. This is the sort of process
we would be willing to pause if its resource usage began to peak. I found,
through measuring processes in the wild, that an average memory usage of
500MB and a peak usage of 1200MB is fairly reasonable. Further, in an
attempt to model reality, the peaks occur based on a probability which
varies with time. Figure 6 shows a simulated cluster node running 6 poorly
behaved tasks who's properties are identical to the process depicted in
Figure 5.
Based on this task's average usage, we know that at all times the node will
have 3000MB of memory in use. However, at times when multiple tasks begin
to peak, as much as 6000MB of memory may be in use. On a machine with more
limited memory, such as the average cluster node which has only 4500MB of
memory, such tasks could not coexist scheduled in the current manner.
Figure 7 shows the memory usage of a node implementing the proposed
scheduler which only allows a single job to peak at a time. In this case,
we have successfully bounded totally memory usage at approximately 4000MB
which would fit on the average cluster node. However, we paid the price in
running time for a given task. In a typical run of my simulated poorly
behaved process, its resource usage peaks about 10 times. For simplicity,
we can then assume that this corresponds to roughly 10 units of work. In
the case of the aggressively scheduled processes, each process' memory
usage only peaks 3 times representing a 3x longer running time for each
task. Clearly this multiplier is heavily dependent on the number of
processes on the machine and how often their resource usage peaks.
However, as long as the task's normal and peak memory usage is defined
well, this multiplier will be a small constant. Further, defining normal
and peak usage for a task can be easily automated and thus can be defined
quite accurately.