You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Define the Problem

  1. ops-help@mit.edu: Is the primary communications conduit for service problems.
    1. ops-help generates a request tracker ticket. Use this thread for the whole problem resolution (for urgent response) or hand-off (bugs).
    2. Email representing urgent problems should be forwarded to the appropriate business owner, service owner, and the help desk.
    3. Support personnel will handle issue resolution (customer or community communications) through their channels. These are the service owner and/or the help desk.
    4. If a support request requiring a technical response came through another channel, forward it to ops-help.
    5. Automated responses from monitoring software come through various service support lists. Assume that list is part of the communications thread.
  1. Choose a Tech Lead
    1. Operations and Infrastructure must select a Tech Lead to see the problem resolution through to completion.
    2. The Tech Lead responds to the internal initiators of the problem report, alerting those parties that resolution is underway.
    3. The Tech Lead defines the type of issue: Ugent or Handoff.
    4. For Operations staff, both kinds of issues are pri-1. It is acceptable for all other responsibilities to be on hold until resolution (system down) or handoff (bug report).

Urgent Response (Resolution)

  1. For an unresponsive component, either the Server Operations team or an automated process should have attempted to restart the component.
    1. Someone familiar with the system must check to see if restart procedures occurred and if that temporarily resolved the problem.
    2. If not, Tech Lead or designate restarts component manually, determines if this resolves issue or if a more persistent problem exists.
  2. Notification: preliminary problem description (and resolution, if applicable) sent to Recipient List:
    1. initiator of problem ticket, srstaff@mit.edu, the appropriate "announce" list for the service, and if any end-user applications could have been affected, computing-help@mit.edu
  3. In conjunction with managers currently present, Tech Lead forms Team to troubleshoot issue.
    1. Emergency Response takes precedence over other project work.
    2. Tech Lead is project manager for duration of issue resolution.
    3. Tech Lead is final arbiter for delegation of tasks, priorities, and timing.
  4. Notification: If resolution is lengthy, Tech Lead will update Recipient List at least once per day of status of resolution.
  5. Post Mortem: Tech Lead reviews response. If emergency response offers the opportunity for improvement of process, Tech Lead calls a post-mortem with parties who participated in the resolution.

Bug Reports-Handoff

  1. Tech Lead notifies a team leader or manager responsible for each tier of the system affected.
    1. Tech Lead collects information from managers on which mail lists to send notification of issue. This is the Recipient List for this issue. Note it in the ticket.
    2. Tech Lead and managers determine staff members responsible for issue. This is the Team.
  2. Tech Lead sends message to Recipient List notifying them of the issue.
  3. Team performs preliminary troubleshooting to determine the nature of the issue and identify the staff responsible to remedy the issue.
  4. Ticket information is transferred or linked to system of record for the issue resolvers.
  5. Recipient List is notified of the transfer and the managers now responsible for the issue.

Terminology

Problem: Requiring a technical solution.

Issue: Requiring customer communications or training.

Service Owner: IS&T team primarily responsible for a service.

Business Owner: Person or department that sponsors the service.

Urgent: Requires immediate triage, service down.

Hand-Off: Bug in the system, short-term project.

  • No labels