You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

IPS team members should consult this playbook every time they participate in issue-resolution activity on behalf of ISDA.

Issue Definition

  1. All issue reports must come through isda-ops@mit.edu. Automated responses from monitoring software come through this address, too.
    1. If someone contacts IPS through another channel, we email the issue to this list ourselves, to make sure the rest of the team is notified.
  2. Team members in an Operations role who are on the floor discuss the issue and choose who will lead the issue-resolution cycle. This is the Tech Lead.
  3. The Tech Lead responds to the initiators of the message, alerting those parties that resolution is underway.
  4. The Tech Lead enters the issue into the Request Tracker database.
  5. The Tech Lead must define the type of issue and then proceed accordingly. Be sure to flag the issue in Request Tracker as the type that you have determined:
    1. Emergency Response: A system, whether that is a whole server or a particular application, is unresponsive.
    2. Bug Report: An application is not doing the right thing in some particular case, but it is not generally broken; a system is not down.
    3. For Operations staff, both kinds of issues are priority #1. It is acceptable for all other responsibilities to be on hold until resolution (system down) or handoff (bug report).

Emergency Response-Resolution

  1. For an unresponsive component, either the Server Operations team or an automated process should have attempted to restart the component.
    1. Check to see if restart procedures occurred and if that temporarily resolved the problem.
    2. If not, Tech Lead restarts component manually, determines if this resolves issue or if a more persistent problem exists.
  2. Notification: preliminary problem description (and resolution, if applicable) sent to Recipient List:
    initiator of problem ticket, zips@mit.edu, isda-leaders@mit.edu, isda-integrators@mit.edu, ops@mit.edu, and if any end-user applications could have been affected, computing-help@mit.edu
  3. In conjunction with managers currently present, Tech Lead forms Team to troubleshoot issue.
    1. It is ZIPS expectation that Emergency Response takes precedence over other project work.
  4. SCRUM: No resolution work should proceed until SCRUM is performed with available resources to discuss process and possible resolutions.
    1. Tech Lead is project manager for duration of issue resolution. Tech Lead is final arbiter for delegation of tasks, priorities, and timing.
  5. Notification: If resolution is lengthy, Tech Lead will update Recipient List at least once per day of status of resolution.
  6. Post Mortem: Tech Lead reviews response with IPS team lead. If emergency response offers the opportunity for improvement of process, Tech Lead calls a post-mortem with parties who participated in the resolution.

Bug Reports-Handoff

  1. Tech Lead notifies a team leader or manager responsible for each tier of the system affected.
    1. Tech Lead collects information from managers on which mail lists to send notification of issue. This is the Recipient List for this issue. Note it in the ticket.
    2. Tech Lead and managers determine staff members responsible for issue. This is the Team.
  2. Tech Lead sends message to Recipient List notifying them of the issue.
  3. Contact the Team and do preliminary troubleshooting to determine the nature of the issue and, therefore, the staff responsible to remedy the issue.
  4. Transfer ticket information to system of record for the issue resolvers.
  5. Notify the Recipient List of the transfer and the managers now responsible for the issue.

  • No labels