Monitoring Requirements for Web Application Servers

[ZEST:Developing a web-based application which you want ISDA to deploy and monitor? You should read this entire document, but note especially the bits in red!]

General ISDA Application Server Setup:

We have a number of servers which require monitoring. Our current standard model for deploying these has a few distinct pieces:

the server (usually a linux box running RHEL; whether it's hardware or virtual is immaterial)
apache (plain and/or SSL), front-ending for one or more instances of...
apache-tomcat, our servlet container of choice
the applications themselves

Monitoring Requirements:

Web-based applications deployed for ISDA monitoring services can enjoy the expectation of a stable and monitored servlet container (currently Tomcat) for their use. Each servlet container must contain a special web application that is specifically designed to monitor all other applications that are deployed to the container. This special web application will be referred to as the MONITOR for the remainder of this document. The MONITOR must support service of a standard page retrievable by the URL path: /<container_id>-monitor, where <container_id> is the servlet container identifier. The content of the response page must satisfy the following criteria:

if the MONITOR detects that an application has failed, the text of the response page must contain the word FAILURE. It is also recommended (but not required) that the response page contain brief text as to the nature of the failure.
retreival completes within 30 second response window.

Note that, if you only want the operations staff to monitor the availability of the servlet container, the MONITOR application only has to return a static page containing nothing more than "SUCCESS". However, you are encouraged to make checks of application state and connectivity to any backend services you may utilize, so long response time of the MONITOR is within the expected response window.

In any case, note that while verbose error messages (example: "FAILURE: backend data server lotsobits.mit.edu is offline") can be helpful for debugging purposes, operational staff response beyond ensuring a functional and health servlet container, restarting the application, and notifying the application developer is on a discretionary and workload-permitting basis.

Critical Monitoring: ("Urgent Response")

Our most critical monitoring involves knowing when things have crashed hard and are in need of immediate attention. (This includes OIS Server Operations, which may be handling initial response to critical problems.) Our current mechanisms for this are handled by the OIS Nagios installation, which checks:

per-server, basic: ICMP ping
per-server, apache health monitoring: a simple Nagios HTTP fetch test of a non-tomcat-served URL, /ping.html [ZEST:optional if server is SSL-only]
per-server, apache SSL health monitoring: a imple Nagios HTTPS fetch test of a non-tomcat-served URL, /ping.html [ZEST:optional for non-SSL servers]
application state monitoring: Nagios HTTP[ZEST:S] fetch of /<container_id>-monitor (see Monitoring Requirements above)

Notes:

Tomcat and the applications served therefrom are fairly inseperable, rely on the same jvm, and thus share a test, but each server will generally have multiple separate containers for different application groups
Ideally, this critical reporting functionality will remain unused (see Proactive Status Monitoring below).

Proactive Status Monitoring: ("Impending Doom")

Ideally, critical reporting functionality will remain unused, because nothing will ever crash. Much of our infrastructure is reliable, but Tomcat is a notable exception, and we've found a number of ways for it to go south. There are also some system parameters we'd like to track. So, more monitoring:

Tomcat: resident size for each running jvm
system memory statistics
system disk usage statistics
Tomcat crash detection and restart (no, this isn't just status monitoring)

Much of this will probably be packaged with scripts of our own devising, with daily reporting via email; perhaps more frequently for alarm states. We're investigating tools from SourceLabs for Tomcat instance management.

Since Tomcat crashes due to transient problems or resource exhaustion have been all-too-common, we plan to have a crashed Tomcat automatically and immediately restarted once any crash state has been saved, though not ad nauseam.

Application-Specific End-to-End Tests:

These are designed as supplementary tests which are capable of making end-user-like calls to the web applications, thus testing their entire range of function without simply relying on self-reporting. In a few cases these are also performable via Nagios, but generally speaking we use a separate server (isda-console1.mit.edu) with its own custom-coded test array.

IPS cannot set up end-to-end testing for every application we host or support. We have constrained this requirement to web services for which we have assumed primary support.

These are notes on easy HTTP-based tests which will perform a basic functionality check of not just an application but also any backend servers it's relying upon.

thalia: "GET /libraries" from a valid domain; needs both Alfresco and the database running to return a library list.
mitid: isda-console1 currently has end-to-end testing implemented for this service.
geocodes: isda-console1 currently has end-to-end testing implemented for this service.
UA: isda-console1 currently has end-to-end testing implemented for this service.
roles: TBD

Child pages

Monitoring Requirements for Web Application Servers