Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Wiki Markup
\[Developing a web-based application which you want ISDA to deploy and monitor?  You should read this entire document, but note especially the bits in red\!\] 

General ISDA Application Server Setup:

We have a number of servers which require monitoring.  Our current standard model for deploying these has a few distinct pieces:

  • the server (usually a linux box running RHEL; whether it's hardware or virtual is immaterial)
  • apache (plain and/or SSL), front-ending for one or more instances of...
  • apache-tomcat, our servlet container of choice
  • the applications themselves

Critical Monitoring: ("Urgent Response")

Our most critical monitoring involves knowing We need to know when things have crashed hard and need to be restartedare in need of immediate attention.  ("We" here This includes OIS Server Operations, which will may be handling initial response to critical problems.)  Current mechanisms to detect problems are:Our current mechanisms for this are handled by the OIS Nagios installation, which checks:

  • per-server, basic: ICMP ping
  • Wiki Markup
    per-server, apache health monitoring: a simple Nagios HTTP fetch test of a 
  • server: ping
  • apache: simple Nagios HTTP fetch test of a
    non-tomcat-served
    URL (perhaps optional given the tomcat tests; it's not generally apache that crashes, it's the much-abused tomcat jvm)
  • tomcat & applications: simple Nagios HTTP fetch of an application-served URL
  •  URL, /ping.html \[optional if server is SSL-only\]
  • Wiki Markup
    per-server, apache SSL health monitoring: a imple Nagios HTTPS fetch test of a non-tomcat-served URL, /ping.html \[optional for non-SSL servers\]
  • Wiki Markup
    per-application state monitoring: Nagios HTTP\[S\] fetch of <application-root>/ping.jsp <span style="color: #ff3300">(see application requirements below)</span>
  • per-replicated-service state monitoring: as above (ping.jsp), but accessed via an application root URL which is handled by the f5 load balancerapplication-specific testing: something (likely HTTP-based) which will check functionality for the application and the backend services upon which it relies

Notes:

  • tomcat and the applications served therefrom are fairly inseperable, rely on the same jvm, and can thus share a test, but each server is expected to will generally have multiple separate containers for different application groups
  • a basic apache test may be superfluous
  • in-depth application-specific testing is not covered here
  • Ideally, this critical reporting functionality will remain unused; see "Proactive Status Monitoring" below.

Application Requirements for Deployment:

Web-based pplications deployed for ISDA monitoring services can enjoy the expectation of a stable and monitored servlet container (currently tomcat) for their use.  All applications must provide at least minimal self-reporting, however.  All applications must support service of a standard page retreivable from <application-root>/index.jsp.  This page's content must satisfy the following criteria:

  • page text contains exactly one of "SUCCESS" or "FAILURE"
  • retreival completes within 30 seconds

Note that, if you just want operations staff to notice that your application has completely shut down, you as an application developer are welcome to make this page static content containing nothing more than "SUCCESS", but you are also welcome to make checks of application state and connectivity to any backend services you may utilize, so long as such checks do not impact the retreivability of ping.jsp well within the expected window.  If your test process may take an extended period which could exceed the ping.jsp retreival window but still not represent an error condition, this must not impact the timely retreivability of ping.jsp; your application must instead make any required checks asynchronously with ping.jsp used only to report on those checks.

In any case, note that while verbose error messages (example: "FAILURE: backend data server lotsobits.mit.edu is offline") can be helpful for debugging purposes, operational staff response beyond ensuring a functional and health servlet container, restarting the application, and notifying the application developer is on a discretionary and workload-permitting basis.

Proactive Status Monitoring: ("Impending Doom")

Ideally, critical reporting functionality will remain unused, because nothing will ever crash.  Much of our infrastructure is reliable, but tomcat is a notable exception, and we've found a number of ways for it to go south.  There are also some system parameters we'd like to track.  So, more monitoring:

...

Since tomcat crashes due to transient problems or resource exhaustion have been all-too-common, we plan to have a crashed tomcat automatically and immediately restarted once any crash state has been saved, though not ad nauseam.

Application-Specific End-to-End Tests:

These are designed as supplementary tests which are capable of making end-user-like calls to the web applications, thus testing their entire range of function without simply relying on self-reporting.  In a few cases these are also performable via Nagios, but generally speaking we use a separate server (isda-console1.mit.edu) with its own custom-coded test array.

These are notes on easy HTTP-based tests which will perform a basic functionality check of not just an application but also any backend servers it's relying upon.

  • thalia: "GET /libraries" from a valid domain; needs both Alfresco and the database running to return a library list
  • mitid: use built-in ping service which checks database connectivity (details?)isda-console1 currently has end-to-end testing implemented for this service
  • geocodes: TBD (just make some trivial query?)
  • UA: TBD
  • roles: TBD