IPS team members should consult this playbook every time they participate in issue-resolution activity on behalf of ISDA.
Issue Definition
- All issue reports must come through isda-ops@mit.edu. Automated responses from monitoring software come through this address, too.
- If someone contacts IPS through another channel, we email the issue to this list ourselves, to make sure the rest of the team is notified.
- Team members in an Operations role who are on the floor discuss the issue and choose who will lead the issue-resolution cycle. This is the Tech Lead.
- The Tech Lead responds to the initiators of the message, alerting those parties that resolution is underway.
- The Tech Lead enters the issue into the Request Tracker database.
- The Tech Lead must define the type of issue and then proceed accordingly. Be sure to flag the issue in Request Tracker as the type that you have determined:
- Emergency Response: A system, whether that is a whole server or a particular application, is unresponsive.
- Bug Report: An application is not doing the right thing in some particular case, but it is not generally broken; a system is not down.
- For Operations staff, both kinds of issues are priority #1. It is acceptable for all other responsibilities to be on hold until resolution (system down) or handoff (bug report).
Emergency Response-Resolution
- For an unresponsive component, either the Server Operations team or an automated process should have attempted to restart the component.
- Check to see if restart procedures occurred and if that temporarily resolved the problem.
- If not, Tech Lead restarts component manually, determines if this resolves issue or if a more persistent problem exists.
- Notification: preliminary problem description sent to Recipient List:
initiator of problem ticket, zips@mit.edu, isda-leaders@mit.edu, isda-integrators@mit.edu, ops@mit.edu, and if any end-user applications could have been affected, computing-help@mit.edu - In conjunction with managers currently present, Tech Lead forms Team to troubleshoot issue.
- It is ZIPS expectation that Emergency Response takes precedence over other project work.
- SCRUM: No resolution work should proceed until SCRUM is performed with available resources to discuss process and possible resolutions.
- Tech Lead is project manager for duration of issue resolution. Tech Lead is final arbiter for delegation of tasks, priorities, and timing.
- Notification: If resolution is lengthy, Tech Lead will update Recipient List at least once per day of status of resolution.
- Post Mortem: Tech Lead reviews response with IPS team lead. If emergency response offers the opportunity for improvement of process, Tech Lead calls a post-mortem with parties who participated in the resolution.
Bug Reports-Handoff
- Tech Lead notifies a team leader or manager responsible for each tier of the system affected.
- Tech Lead collects information from managers on which mail lists to send notification of issue. This is the Recipient List for this issue. Note it in the ticket.
- Tech Lead and managers determine staff members responsible for issue. This is the Team.
- Tech Lead sends message to Recipient List notifying them of the issue.
- Contact the Team and do preliminary troubleshooting to determine the nature of the issue and, therefore, the staff responsible to remedy the issue.
- Transfer ticket information to system of record for the issue resolvers.
- Notify the Recipient List of the transfer and the managers now responsible for the issue.