Issue-Response Playbook

IPS team members should consult this playbook every time they participate in issue-resolution activity on behalf of ISDA.

Issue Definition

Our primary point of contact for support issues is the isda-ops mailing list. Everyone on isda-ops@mit.edu should be monitoring this mailing list and the ISDA:Admin RT queue on help.mit.edu (https://help.mit.edu/Search/Results.html?Order=DESC&Query=(%20Queue%20%3D%20'ISDA%3A%3AAdmin'%20)%20and%20(%20Status%20%3D%20'new'%20or%20Status%20%3D%20'open'%20or%20Status%20%3D%20'stalled'%20)&Rows=50&OrderBy=id&Page=1&Format=%0A%20%20%20'%3Cb%3E%3Ca%20href%3D%22%2FTicket%2FDisplay.html%3Fid%3D__id__%22%3E__id__%3C%2Fa%3E%3C%2Fb%3E%2FTITLE%3A%23'%2C%0A%20%20%20'%3Cb%3E%3Ca%20href%3D%22%2FTicket%2FDisplay.html%3Fid%3D__id__%22%3E__Subject__%3C%2Fa%3E%3C%2Fb%3E%2FTITLE%3ASubject'%2C%0A%20%20%20Status%2C%0A%20%20%20QueueName%2C%20%0A%20%20%20OwnerName%2C%20%0A%20%20%20Priority%2C%20%0A%20%20%20'__NEWLINE__'%2C%0A%20%20%20''%2C%20%0A%20%20%20'%3Csmall%3E__Requestors__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__CreatedRelative__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__ToldRelative__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__LastUpdatedRelative__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__TimeLeft__%3C%2Fsmall%3E').
1. Email representing immediate operational issues should be forwarded to RT (via the isda-admin-rt list), while email representing bug reports should be filed in Jira.
2. If someone makes a support request through another channel, we email the issue to isda-ops (and file it appropriately in RT or Jira) to make sure the rest of the team is notified.
3. Automated responses from monitoring software come through to isda-ops@mit.edu, where they can be monitored and forwarded to the RT queue if necessary.
All team members in an Operations role who are on the floor must meet to discuss the issue. A person is selected to lead the resolution. This is the Tech Lead.
1. If possible, they must include non-operational staff in the discussion who are assigned to, or conversant with, the system in question.

1. The Tech Lead is not necessarily operations personnel. This assignment is up to staff available at the time of the issue report.
The Tech Lead responds to the initiators of the message, alerting those parties that resolution is underway.
The Tech Lead must define the type of issue and then proceed accordingly. Be sure to file the issue appropriately (Jira or RT) depending on type:
1. Urgent Response: A system, whether that is a whole server or a particular application, is unresponsive.
2. Bug Report: An application is not doing the right thing in some particular case, but it is not generally broken; a system is not down.
3. N.B.: for Operations staff, both kinds of issues are priority #1. It is acceptable for all other responsibilities to be on hold until resolution (system down) or handoff (bug report).

Urgent Response (Resolution)

For an unresponsive component, either the Server Operations team or an automated process should have attempted to restart the component.
1. Someone familiar with the system must check to see if restart procedures occurred and if that temporarily resolved the problem.
2. If not, Tech Lead or designee restarts component manually, determines if this resolves issue or if a more persistent problem exists.
Notification: preliminary problem description (and resolution, if applicable) sent to Recipient List:
initiator of problem ticket, isda-leaders@mit.edu, isda-integrators@mit.edu, isda-ops@mit.edu, and if any end-user applications could have been affected, computing-help@mit.edu
In conjunction with managers currently present, Tech Lead forms Team to troubleshoot issue.
1. It is ZIPS expectation that Emergency Response takes precedence over other project work.
SCRUM: No resolution work should proceed until SCRUM is performed with available resources to discuss process and possible resolutions.
1. Tech Lead is project manager for duration of issue resolution. Tech Lead is final arbiter for delegation of tasks, priorities, and timing.
Notification: If resolution is lengthy, Tech Lead will update Recipient List at least once per day of status of resolution.
Post Mortem: Tech Lead reviews response with IPS team lead. If emergency response offers the opportunity for improvement of process, Tech Lead calls a post-mortem with parties who participated in the resolution.

Bug Reports-Handoff

Tech Lead notifies a team leader or manager responsible for each tier of the system affected.
1. Tech Lead collects information from managers on which mail lists to send notification of issue. This is the Recipient List for this issue. Note it in the ticket.
2. Tech Lead and managers determine staff members responsible for issue. This is the Team.
Tech Lead sends message to Recipient List notifying them of the issue.
Contact the Team and do preliminary troubleshooting to determine the nature of the issue and, therefore, the staff responsible to remedy the issue.
Transfer ticket information to system of record for the issue resolvers.
Notify the Recipient List of the transfer and the managers now responsible for the issue.

Child pages

Issue Definition

Urgent Response (Resolution)

Bug Reports-Handoff

5 Comments

Paul B Hill

Paul B Hill

Andrew M Boardman

Paul B Hill

Andrew M Boardman