\[Developing a web-based application which you want ISDA to deploy and monitor? You should read this entire document, but note especially the bits in red\!\]
General ISDA Application Server Setup:
We have a number of servers which require monitoring. Our current standard model for deploying these has a few distinct pieces:
Monitoring Requirements:
Web-based applications deployed for ISDA monitoring services can enjoy the expectation of a stable and monitored servlet container (currently Tomcat) for their use. Each servlet container must contain a special web application that is specifically designed to monitor all other applications that are deployed to the container. This special web application will be referred to as the MONITOR for the remainder of this document. The MONITOR must support service of a standard page retrievable by the URL path: /<container_id>-monitor, where <container_id> is the servlet container identifier. The content of the response page must satisfy the following criteria:
Note that, if you only want the operations staff to monitor the availability of the servlet container, the MONITOR application only has to return a static page containing nothing more than "SUCCESS". However, you are encouraged to make checks of application state and connectivity to any backend services you may utilize, so long response time of the MONITOR is within the expected response window.
In any case, note that while verbose error messages (example: "FAILURE: backend data server lotsobits.mit.edu is offline") can be helpful for debugging purposes, operational staff response beyond ensuring a functional and health servlet container, restarting the application, and notifying the application developer is on a discretionary and workload-permitting basis.
Critical Monitoring: ("Urgent Response")
Our most critical monitoring involves knowing when things have crashed hard and are in need of immediate attention. (This includes OIS Server Operations, which may be handling initial response to critical problems.) Our current mechanisms for this are handled by the OIS Nagios installation, which checks:
per-server, apache health monitoring: a simple Nagios HTTP fetch test of a non-tomcat-served URL, /ping.html \[optional if server is SSL-only\] |
per-server, apache SSL health monitoring: a imple Nagios HTTPS fetch test of a non-tomcat-served URL, /ping.html \[optional for non-SSL servers\] |
application state monitoring: Nagios HTTP\[S\] fetch of */<container_id>-monitor* <span style="color: #ff3300">(see Monitoring Requirements above)</span> |
Notes:
Proactive Status Monitoring: ("Impending Doom")
Ideally, critical reporting functionality will remain unused, because nothing will ever crash. Much of our infrastructure is reliable, but Tomcat is a notable exception, and we've found a number of ways for it to go south. There are also some system parameters we'd like to track. So, more monitoring:
Much of this will probably be packaged with scripts of our own devising, with daily reporting via email; perhaps more frequently for alarm states. We're investigating tools from SourceLabs for Tomcat instance management.
Since Tomcat crashes due to transient problems or resource exhaustion have been all-too-common, we plan to have a crashed Tomcat automatically and immediately restarted once any crash state has been saved, though not ad nauseam.
Application-Specific End-to-End Tests:
These are designed as supplementary tests which are capable of making end-user-like calls to the web applications, thus testing their entire range of function without simply relying on self-reporting. In a few cases these are also performable via Nagios, but generally speaking we use a separate server (isda-console1.mit.edu) with its own custom-coded test array.
IPS cannot set up end-to-end testing for every application we host or support. We have constrained this requirement to web services for which we have assumed primary support.
These are notes on easy HTTP-based tests which will perform a basic functionality check of not just an application but also any backend servers it's relying upon.