View Source

General Setup:

We have a number of servers which require monitoring. Our current standard model for deploying these has a few distinct pieces:

the server (usually a linux box running RHEL; whether it's hardware or virtual is immaterial)
apache, front-ending for one or more instances of...
apache-tomcat, our servlet container of choice
the applications themselves

Critical Monitoring:

We need to know when things have crashed hard and need to be restarted. ("We" here includes Server Operations, which will be handling initial response to critical problems.) Current mechanisms to detect problems are:

server: ping
apache: simple Nagios HTTP fetch test of a non-tomcat-served URL (perhaps optional given the tomcat tests; it's not generally apache that crashes, it's the much-abused tomcat jvm)
tomcat & applications: simple Nagios HTTP fetch of an application-served URL
application-specific testing: something (likely HTTP-based) which will check functionality for the application and the backend services upon which it relies

Notes:

tomcat and the applications served therefrom are fairly inseperable, rely on the same jvm, and can share a test, but each server is expected to have multiple separate containers for different application groups
a basic apache test may be superfluous
in-depth application-specific testing is not covered here

Ideally, this critical reporting functionality will remain unused.

Status Monitoring:

Ideally, critical reporting functionality will remain unused, because nothing will ever crash. Much of our infrastructure is reliable, but tomcat is a notable exception, and we've found a number of ways for it to go south. There are also some system parameters we'd like to track. So, more monitoring:

tomcat: resident size for each running jvm
system memory statistics
system disk usage statistics
tomcat crash detection and restart (no, this isn't just status monitoring)

Much of this will probably be packaged with scripts of our own devising, with daily reporting via email; perhaps more frequently for alarm states. We're investigating tools from SourceLabs for tomcat instance management.

Since tomcat crashes due to transient problems or resource exhaustion have been all-too-common, we plan to have a crashed tomcat automatically and immediately restarted once any crash state has been saved, though not ad nauseam.

Application-Specific Tests:

These are notes on easy HTTP-based tests which will perform a basic functionality check of not just an application but also any backend servers it's relying upon.

thalia: "GET /libraries" from a valid domain; needs both Alfresco and the database running to return a library list
mitid: use built-in ping service which checks database connectivity (details?)
geocodes: TBD (just make some trivial query?)
roles: TBD