General Setup:
We have a number of servers which require monitoring. Our current standard model for deploying these has a few distinct pieces:
Critical Monitoring:
We need to know when things have crashed hard and need to be restarted. ("We" here includes Server Operations, which will be handling initial response to critical problems.) Current mechanisms to detect problems are:
Notes:
Ideally, this critical reporting functionality will remain unused.
Status Monitoring:
Ideally, critical reporting functionality will remain unused, because nothing will ever crash. Much of our infrastructure is reliable, but tomcat is a notable exception, and we've found a number of ways for it to go south. There are also some system parameters we'd like to track. So, more monitoring:
Much of this will probably be packaged with scripts of our own devising, with daily reporting via email; perhaps more frequently for alarm states. We're investigating tools from SourceLabs for tomcat instance management.
Since tomcat crashes due to transient problems or resource exhaustion have been all-too-common, we plan to have a crashed tomcat automatically and immediately restarted once any crash state has been saved, though not ad nauseam.
Application-Specific Tests:
These are notes on easy HTTP-based tests which will perform a basic functionality check of not just an application but also any backend servers it's relying upon.