You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

General Setup:

We have a number of servers which require monitoring.  Our current standard model for deploying these has a few distinct pieces:

  • the server (usually a linux box running RHEL; whether it's hardware or virtual is immaterial)
  • apache, front-ending for one or more instances of...
  • apache-tomcat, our servlet container of choice
  • the applications themselves

Critical Monitoring:

We need to know when things have crashed hard and need to be restarted.  ("We" here includes Server Operations, which will be handling initial response to critical problems.)  Current mechanisms to detect problems are:

  • server: ping
  • apache: simple Nagios HTTP fetch test of a non-tomcat-served URL (perhaps optional given the tomcat tests; it's not generally apache that crashes, it's the much-abused tomcat jvm)
  • tomcat & applications: simple Nagios HTTP fetch of an application-served URL
  • application-specific testing: something (likely HTTP-based) which will check functionality for the application and the backend services upon which it relies

Notes:

  • tomcat and the applications served therefrom are fairly inseperable, rely on the same jvm, and can share a test, but each server is expected to have multiple separate containers for different application groups
  • a basic apache test may be superfluous
  • in-depth application-specific testing is not covered here

Ideally, this critical reporting functionality will remain unused. 

Status Monitoring:

Ideally, critical reporting functionality will remain unused, because nothing will ever crash.  Much of our infrastructure is reliable, but tomcat is a notable exception, and we've found a number of ways for it to go south.  There are also some system parameters we'd like to track.  So, more monitoring:

  • tomcat: resident size for each running jvm
  • system memory statistics
  • system disk usage statistics
  • tomcat crash detection and restart (no, this isn't just status monitoring)

Much of this will probably be packaged with scripts of our own devising, with daily reporting via email; perhaps more frequently for alarm states.  We're investigating tools from SourceLabs for tomcat instance management.

Application-Specific Tests:

These are notes on easy HTTP-based tests which will perform a basic functionality check of not just an application but also any backend servers it's relying upon.

  • thalia: "GET /libraries" from a valid domain; needs both Alfresco and the database running to return a library list
  • mitid: use built-in ping service which checks database connectivity (details?)
  • geocodes: TBD (just make some trivial query?)
  • roles: TBD
  • No labels