Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

This page is under construction

The Shibboleth IdP uses terracotta for clustering (idp session state, etc.). The terracotta configuration file is /usr/local/shibboleth-idp/conf/tc-config.xml. The terracotta server is a separate (Java) program that runs on each node in the cluster; the tomcat program (which is the container for the idp web application) is the terracotta client.

Unfortunately, the terracotta software is not robust in all situations. It can have problems recovering if both nodes are restarted at the same time (e.g. following a power outage). Also, if the server loses contact with a client, e.g. if the client takes too long doing a garbage collection, the server can declare the client dead, and then refuse its subsequent reconnection attempts. In the latter case, the client (i.e. tomcat) will need to be restarted to rejoin the cluster.

Checking the cluster health

The terracotta distribution includes /usr/local/terracotta/bin/server-stat.sh, which performs a JMX query to obtain a server's status, including whether it is the active node or passive. Without arguments, it connects to the server on localhost; you can also provide the -f <configfile> option to query all servers in the cluster. (The standard terracotta JMX TCP port is 9520, so, when querying a server other than localhost, the machine's firewall must be configured to allow connections to that port from the peer node(s)). In a healthy cluster, one server will have the ACTIVE role (state ACTIVE-COORDINATOR), and the other(s) will have the PASSIVE role (state PASSIVE-STANDBY). For example, here is the output displayed on the active node:

No Format
# /usr/local/terracotta/bin/server-stat.sh
localhost.health: OK
localhost.role: ACTIVE
localhost.state: ACTIVE-COORDINATOR
localhost.jmxport: 9520

Here is the output displayed on a passive node:

No Format
# /usr/local/terracotta/bin/server-stat.sh
localhost.health: OK
localhost.role: PASSIVE
localhost.state: PASSIVE-STANDBY
localhost.jmxport: 9520

The server-stat.sh tool's is limited, however, in that it only displays the status of the server itself; it says nothing about what clients are connected to the server, if any. To address that, we have a homegrown Java tool, /usr/local/lib/tc-status-exe.jar, which displays information about the client(s) connected to the server, as well as the server status; for example:

No Format
# $JAVA_HOME/bin/java -jar /usr/local/lib/tc-status-exe.jar -c /usr/local/shibboleth-idp/conf/tc-config.xml
idp-1.mit.edu.health=OK
idp-1.mit.edu.role=PASSIVE
idp-1.mit.edu.state=PASSIVE-STANDBY
idp-1.mit.edu.clientcount=0
idp-2.mit.edu.health=OK
idp-2.mit.edu.role=ACTIVE
idp-2.mit.edu.state=ACTIVE-COORDINATOR
idp-2.mit.edu.clientcount=2
idp-2.mit.edu.clientlist=idp-2.mit.edu:36209,idp-1.mit.edu:48462

In this example, we see that idp-2.mit.edu is the active node, idp-1.mit.edu is passive, and both clients are properly connected to the active server. (The number following the colon in the client specification is the client's TCP port number). If a client is missing from the active node's clientlist, that client's tomcat should be restarted.

Restarting the cluster

Determine which node is active, by checking the node server state using either the /usr/local/terracotta/bin/server-stat.sh script, or the homegrown tc-status-exe.jar tool (see above). The active server node will be in the ACTIVE-COORDINATOR state, e.g.:

No Format
# /usr/local/terracotta/bin/server-stat.sh
localhost.health: OK
localhost.role: ACTIVE
localhost.state: ACTIVE-COORDINATOR
localhost.jmxport: 9520

A passive node should be in the PASSIVE-STANDBY state:

No Format
# /usr/local/terracotta/bin/server-stat.sh
localhost.health: OK
localhost.role: PASSIVE
localhost.state: PASSIVE-STANDBY
localhost.jmxport: 9520

The active node should be restarted first. The passive node should detect this and take over the active role. Wait for the restarted node to enter PASSIVE-STANDBY state before proceeding; check its server log for errors. If it has a problem recovering state, it is likely due to corrupted data; check the server log for errors. In this case, it will not enter the STANDBY state, and manual intervention will be needed. First, try removing the terracotta server's "dirty" saved object data, e.g.:

No Format
# /etc/init.d/terracotta stop
# rm -rf /usr/local/shibboleth-idp/cluster/server/data/dirty-objectdb-backup/*
# /etc/init.d/terracotta start

If it still fails to recover to STANDBY state, stop the server again, and remove the object data, e.g.:

No Format
# /etc/init.d/terracotta stop
# rm -rf /usr/local/shibboleth-idp/cluster/server/data/objectdb/*
# /etc/init.d/terracotta start

Once this node reaches the PASSIVE-STANDBY state, you can proceed to restart the other node (newly active).

If both nodes are restarted at the same time, e.g. after a power failure, it is likely that manual intervention will be required, to clean up the data directories.

Log Files

On our IdPs, the terracotta server log is in /usr/local/shibboleth-idp/cluster/server/logs/terracotta-server.log. Normally, this will contain regular statistics logs of memory usage, but also exceptional events, including problems communicating with clients. We also capture the server standard output/error in /usr/local/terracotta/logs/terracotta.log.

The terracotta client log is in /usr/local/shibboleth-idp/cluster/client/logs-127.0.0.1/terracotta-client.log.