This page is under construction
The Shibboleth IdP uses terracotta for clustering (idp session state, etc.). The terracotta configuration file is /usr/local/shibboleth-idp/conf/tc-config.xml
. The terracotta server is a separate (Java) program that runs on each node in the cluster; the tomcat program (which is the container for the idp web application) is the terracotta client.
Unfortunately, the terracotta software is not robust in all situations. It can have problems recovering if both nodes are restarted at the same time (e.g. following a power outage). Also, if the server loses contact with a client, e.g. if the client takes too long doing a garbage collection, the server can declare the client dead, and then refuse its subsequent reconnection attempts. In the latter case, the client (i.e. tomcat) will need to be restarted to rejoin the cluster.
Checking the cluster health
The terracotta distribution includes /usr/local/terracotta/bin/server-stat.sh
, which performs a JMX query to obtain a server's status, including whether it is the active node or passive. Without arguments, it connects to the server on localhost; you can also provide the -f <configfile>
option to query all servers in the cluster. (The standard terracotta JMX TCP port is 9520, so, when querying a server other than localhost, the machine's firewall must be configured to allow connections to that port from the peer node(s)). In a healthy cluster, one server will have the ACTIVE
role (state ACTIVE-COORDINATOR
), and the other(s) will have the PASSIVE
role (state PASSIVE-STANDBY
). For example, here is the output displayed on the active node:
# /usr/local/terracotta/bin/server-stat.sh localhost.health: OK localhost.role: ACTIVE localhost.state: ACTIVE-COORDINATOR localhost.jmxport: 9520
Here is the output displayed on a passive node:
# /usr/local/terracotta/bin/server-stat.sh localhost.health: OK localhost.role: PASSIVE localhost.state: PASSIVE-STANDBY localhost.jmxport: 9520
The server-stat.sh
tool's is limited, however, in that it only displays the status of the server itself; it says nothing about what clients are connected to the server, if any. To address that, we have a homegrown Java tool, /usr/local/lib/tc-status-exe.jar
, which displays information about the client(s) connected to the server, as well as the server status; for example:
# $JAVA_HOME/bin/java -jar /usr/local/lib/tc-status-exe.jar -c /usr/local/shibboleth-idp/conf/tc-config.xml idp-1.mit.edu.health=OK idp-1.mit.edu.role=PASSIVE idp-1.mit.edu.state=PASSIVE-STANDBY idp-1.mit.edu.clientcount=0 idp-2.mit.edu.health=OK idp-2.mit.edu.role=ACTIVE idp-2.mit.edu.state=ACTIVE-COORDINATOR idp-2.mit.edu.clientcount=2 idp-2.mit.edu.clientlist=idp-2.mit.edu:36209,idp-1.mit.edu:48462
In this example, we see that idp-2.mit.edu is the active node, idp-1.mit.edu is passive, and both clients are properly connected to the active server. (The number following the colon in the client specification is the client's TCP port number). If a client is missing from the active node's clientlist, that client's tomcat should be restarted.
Restarting the cluster
Determine which node is active, by checking the node server state using either the /usr/local/terracotta/bin/server-stat.sh
script, or the homegrown tc-status-exe.jar tool (see above). The active server node will be in the ACTIVE-COORDINATOR
state, e.g.:
# /usr/local/terracotta/bin/server-stat.sh localhost.health: OK localhost.role: ACTIVE localhost.state: ACTIVE-COORDINATOR localhost.jmxport: 9520
A passive node should be in the PASSIVE-STANDBY
state:
# /usr/local/terracotta/bin/server-stat.sh localhost.health: OK localhost.role: PASSIVE localhost.state: PASSIVE-STANDBY localhost.jmxport: 9520
The active node should be restarted first. The passive node should detect this and take over the active role. Wait for the restarted node to enter PASSIVE-STANDBY
state before proceeding; check its server log for errors. If it has a problem recovering state, it is likely due to corrupted data; check the server log for errors. In this case, it will not enter the STANDBY
state, and manual intervention will be needed. First, try removing the terracotta server's "dirty" saved object data, e.g.:
# /etc/init.d/terracotta stop # rm -rf /usr/local/shibboleth-idp/cluster/server/data/dirty-objectdb-backup/* # /etc/init.d/terracotta start
If it still fails to recover to STANDBY
state, stop the server again, and remove the object data, e.g.:
# /etc/init.d/terracotta stop # rm -rf /usr/local/shibboleth-idp/cluster/server/data/objectdb/* # /etc/init.d/terracotta start
Once this node reaches the PASSIVE-STANDBY
state, you can proceed to restart the other node (newly active).
If both nodes are restarted at the same time, e.g. after a power failure, it is likely that manual intervention will be required, to clean up the data directories.
Log Files
On our IdPs, the terracotta server log is in /usr/local/shibboleth-idp/cluster/server/logs/terracotta-server.log. Normally, this will contain regular statistics logs of memory usage, but also exceptional events, including problems communicating with clients. We also capture the server standard output/error in /usr/local/terracotta/logs/terracotta.log.
The terracotta client log is in /usr/local/shibboleth-idp/cluster/client/logs-127.0.0.1/terracotta-client.log.