Overload Protection in SailFin : What's new in 2.0
Overload protection feature has been part of earlier releases of SailFin, lets start by describing the what can be improved in the current implementation.
- The algorithm (in the earlier releases) for detecting an overload was based on the fact that if a certain number of continuous samples remain above the configured threshold then the system is overloaded. This was a simple and straightforward to implement and configure but resulted in a behavior that can be best described as spiky in nature. This was because the overload was cleared as soon as one sample dropped below the threshold. So , if an overload is detected and an alarm is triggered it could be cleared and raised over and over again during relatively short periods when the CPU is oscillations is high.
- Apart from this, the only way to report the overload condition was by logging it in the server log file, which is not a good mechanism of alerting the user.
- On the http side of it, even under maximum load we made an effort to send back a 503 response to the client which may not be right thing to do when the system is starved of resources.
- The configuration of OLP was using properties which is not a standard way of exposing information to the user.
The implementation in 2.0, tries to address the problems described above.
Overload Detection Algorithms
The Overload detection algorithm has been enhanced to provide two different modes - CONSECUTIVE and MEDIAN. CONSECUTIVE is the same as current option with the addition that all samples below threshold are also counted before clearing alarm. Currently the alarm is cleared as soon as one sample falls below the threshold, making it extremely sensitive.
Two different algorithms for detecting an overload situation (and the eventual clearing of it) are:
CONSECUTIVE – the configured number of samples all have to be above (for activating the ALARM) or below (for clearing the ALARM) the threshold.
MEDIAN –the median value of the configured number of samples have to be above (or below) the threshold. If the number of samples is even, the median value is computed as the mean of the two middle values.
The Overload mechanism has been separated into a detection unit with a reporter which notifies all listeners of an overload event when overload is raised or cleared. The event will include the type of algorithm causing the overload and the traffic type (SIP, HTTP, etc). The action taken by the listener is up to the implementation of the listener. The rejection listener will reject or drop traffic. The logging listener will log warning statements. Example of other possible listeners: the JMX notification listener could send JMX notifications etc. A default JMX notifier Mbean ("olpjmxnotifier") for OLP is registered by default under 'com.sun.appserv'.
The overload protection configuration has been moved under a separate section, - "overload-protection-service" and has a set of attributes that can be configured to tune the olp behavior. More on olp configurtion coming in the next blog.