Define the Problem
- ops-help@mit.edu: Is the primary communications conduit for service problems.
- ops-help generates a request tracker ticket. Use this thread for the whole problem resolution (for urgent response) or hand-off (bugs).
- Email representing urgent problems should be forwarded to the appropriate business owner, service owner, and the help desk.
- Support personnel will handle issue resolution (customer or community communications) through their channels. These are the service owner and/or the help desk.
- If a support request requiring a technical response came through another channel, forward it to ops-help.
- Automated responses from monitoring software come through various service support lists. Assume that list is part of the communications thread.
- Choose a Tech Lead
- Operations and Infrastructure must select a Tech Lead to see the problem resolution through to completion.
- The Tech Lead responds to the internal initiators of the problem report, alerting those parties that resolution is underway.
- The Tech Lead defines the type of issue: Ugent or Handoff.
- For Operations staff, both kinds of issues are pri-1. It is acceptable for all other responsibilities to be on hold until resolution (system down) or handoff (bug report).
Urgent Response (Resolution)
- For an unresponsive component, either the Server Operations team or an automated process should have attempted to restart the component.
- Someone familiar with the system must check to see if restart procedures occurred and if that temporarily resolved the problem.
- If not, Tech Lead or designate restarts component manually, determines if this resolves issue or if a more persistent problem exists.
- Notification: preliminary problem description (and resolution, if applicable) sent to Recipient List:
- initiator of problem ticket, srstaff@mit.edu, the appropriate "announce" list for the service, and if any end-user applications could have been affected, computing-help@mit.edu
- In conjunction with managers currently present, Tech Lead forms Team to troubleshoot issue.
- Emergency Response takes precedence over other project work.
- Tech Lead is project manager for duration of issue resolution.
- Tech Lead is final arbiter for delegation of tasks, priorities, and timing.
- Notification: If resolution is lengthy, Tech Lead will update Recipient List at least once per day of status of resolution.
- Post Mortem: Tech Lead reviews response. If emergency response offers the opportunity for improvement of process, Tech Lead calls a post-mortem with parties who participated in the resolution.
Bug Reports-Handoff
- Tech Lead notifies a team leader or manager responsible for each tier of the system affected.
- Tech Lead collects information from managers on which mail lists to send notification of issue. This is the Recipient List for this issue. Note it in the ticket.
- Tech Lead and managers determine staff members responsible for issue. This is the Team.
- Tech Lead sends message to Recipient List notifying them of the issue.
- Team performs preliminary troubleshooting to determine the nature of the issue and identify the staff responsible to remedy the issue.
- Ticket information is transferred or linked to system of record for the issue resolvers.
- Recipient List is notified of the transfer and the managers now responsible for the issue.
Terminology
Problem: Requiring a technical solution.
Issue: Requiring customer communications or training.
Service Owner: IS&T team primarily responsible for a service.
Business Owner: Person or department that sponsors the service.
Urgent: Requires immediate triage, service down.
Hand-Off: Bug in the system, short-term project.