Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Section
Column

Define the Severity (Identification)

The severity of the issue determines the speed of the response and the process to follow. Severity is a judgment call to be reached between support staff and the service owner.

  • Severity 1: Service(s) functionality completely disrupted. Use urgent response process.
  • Severity 2: Functionality significantly impaired but service(s) available. Use hand-off response process.
  • Severity 3: Functionality slightly impaired. Use hand-off response process.
  • Severity 4: Cosmetic or edge-case defect. Use hand-off response process.
  • Assign staff and respond to Severity 1 issues immediately.
  • Assign staff and respond to Severity 2 issues immediately if service is business critical.
  • Schedule response to Severity 3 issues accepting impact on project time lines.
  • Schedule response to Severity 4 issues avoiding impact on project time lines.

Communication Channels

  • Identify the service owner. This person is the main point of contact for the problem response or names a designate.
  • By whatever channel the problem comes in, the recipient in IS&T must contact the service owner and hdnotify@mit.edu.
  • For Level I severity:
    • The service owner must contact IS&T Senior Staff.
    • The service owner then coordinates a conference call with IS&T Senior Staff and the business owner.
    • The service owner updates 3Down.
    • The service owner can delegate management of triage to specific staff at this point.
  • The service owner defines a recipient list:
    • Track the problem response (technical) thread with issue-tracking mechanism used by the service-owners team; if non-existent ops-help@mit.edu and the ops-help Request Tracker queue
    • Keep the Help Desk in the issue-resolution (customer communications) loop using the hdnotify@mit.edu email list.
    • For non-urgent or longer-term resolutions, the service owner can get a designate from the Help Desk to handle issue resolution, if preferred.
    • Add business owners and known announce and support lists.
  • The service owner should know who has support responsibilities to the service, across IS&T teams. If an ad-hoc support team must be formed, this must be escalated to senior staff for assignment of permanent support responsibilities (see Close Out, below).

Urgent Response (Resolution)

  1. For a completely unresponsive component, Operations and Infrastructure staff have license to perform basic system operations tasks to get the service back online while attempting to contact the service owner and others (restart, adjust capacity, etc.).
  2. The service owner updates 3Down in addition to emailing the recipient list.
  3. When basic system operations are not effective, the service owner (or delegate) takes over as lead of specialist support staff.
    1. Transfer ticket from initial queue (computing-help, ops-help, etc) to the support queue for the service.
    2. From here, no one should perform any task without direction from the lead.
    3. Service owner is final arbiter for delegation of tasks, priorities, and timing.
    4. Service owner (or delegate) and all support staff try to identify the root cause of the problem in addition to restoring service.
  4. The service and business owners must validate the service for the problem to be closed.
  5. If the resolution is temporary, permanent resolution must be addressed. The service owner decides whether to stay in urgent mode or shift to hand-off mode (below).
  6. The service owner communicates with the Recipient List and updates 3Down.
  7. During an extended outage, the service owner or help-desk designate updates the recipient list and 3Down at least daily.

Hand-Off (Bug Reports)

  1. The service owner (or delegate) can move the issue ticket from any initial issue-tracker to the responsible developers' system of record for the service.
  2. Service owner (or delegate) or help-desk designate notifies the recipient list that the problem has moved into a longer-term resolution, with estimates.
    1. Service owner and managers determine staff members responsible for issue. This is the team.
    2. Managers and business owners to modify team schedules when problem interferes with project schedules.
    3. This is not a resource request, it is a project-timeline adjustment to compensate for production support.
  3. Service owner (or delegate) and team communicate relevant technical details and workarounds to the Help Desk (hdnotify@mit.edu), off-band from business owners and end users.
  4. As resolution moves to different components of the system, hand off tech-lead role to appropriate persons.
  5. Use Pipeline for additional communications.

Close Out

  1. The service owner documents the following in internal Hermes site for reference: (link TBD)## outages and resolution (including long term action items)## how well the Problem Response process was followed## follow up on resolution long term action items
  2. Service owner follows up with business owner on long term items

These administrative tasks should happen, if they need to, before closing the problem.

  • If the service owner was unknown or vaguely defined, escalate to senior staff and update Operations and Help Desk records (IS&T Service Portfolio TBD).
  • If specialist development staff with the right skills were undefined or only temporarily assigned, service owner and appropriate managers to establish permanent support responsibilities. Operations and Help Desk need a record of these assignments.
  • If communications channels to stakeholders (announce or support lists) were ill-defined or incomplete, the service owner must be correct those with sponsors and managers for the service(s).
Problem Response Notes 1242011.JPG
Column
width300px
Panel

Terminology

Problem: Requiring a technical solution.

Issue: Requiring customer communications or training.

Service Owner: IS&T senior staff member primarily responsible for a service.  This should be an individual name as point of contact.

Business Owner: Person or department that sponsors the service.

Support Staff: Customer Support and Operations & Infrastructure Staff dealing with the current production problem and issue.

Urgent: Requires immediate triage, service down.

Hand-Off: Bug in the system, short-term project.

Column
width300px

The final version of the IS&T Problem Response Playbook can be found in our knowledge base at: http://kb.mit.edu/confluence/x/ioK2

Panel
borderColorblue
borderStylesolid

Urgent Problem Response Procedure At A Glance

  1. Identification 
    1. Issue is identified.
    2. Recipient of issue in IS&T contacts the service owner and hdnotify@mit.edu.
    3. Service owner and support staff establish severity of the issue.
  2. Communication
    1. The service owner must contact IS&T Senior Staff.
    2. The service owner then coordinates a conference call with IS&T Senior Staff and the business owner.
    3. The service owner updates 3Down.
  3. Resolution 
    1. The service owner defines a recipient list and assembles team to fix the issue.
    2. The service owner can delegate management of team to specific staff at this point.
    3. The service owner tracks the problem response (technical) thread with issue-tracking mechanism used by the team; if non-existent ops-help@mit.edu.
    4. The service owner keeps the Help Desk in the issue-resolution (customer communications) loop using the hdnotify@mit.edu.
    5. The service owner continues to update 3Down in addition to emailing the recipient list.
    6. The service owner and business owner must validate the service for the problem to be closed.
    7. The service owner communicates resolution to recipient list and updates 3Down.
  4. Close Out
    1. The service owner documents the following in internal Hermes site for reference: http://kb.mit.edu/confluence/x/3wG9### outages and resolution (including long term action items)
      1. how well the Problem Response process was followed
      2. follow up on resolution long term action items
    2. The service owner follows up with business owner on long term items
Image Added

Wiki Markup
h2. Define the Problem # *ops-help@mit.edu:* Is the primary communications conduit for service problems. ## ops-help generates a request tracker ticket. Use this thread for the whole problem resolution (for urgent response) or hand-off (bugs). ## Email representing urgent problems should be forwarded to the appropriate business owner, service owner, and the help desk. ## Support personnel will handle issue resolution (customer or community communications) through their channels. These are the service owner and/or the help desk. ## If a support request requiring a technical response came through another channel, forward it to ops-help. ## Automated responses from monitoring software come through various service support lists. Assume that list is part of the communications thread. # *Choose a Tech Lead* ## Operations and Infrastructure must select a Tech Lead to see the problem resolution through to completion. ## The Tech Lead responds to the internal initiators of the problem report, alerting those parties that resolution is underway. ## The Tech Lead defines the type of issue: Ugent or Handoff. ## For Operations staff, both kinds of issues are pri-1. It is acceptable for all other responsibilities to be on hold until resolution (system down) or handoff (bug report). h2. Urgent Response (Resolution) # *For an unresponsive component, either the Server Operations team or an automated process should have attempted to restart the component.* ## Someone familiar with the system must check to see if restart procedures occurred and if that temporarily resolved the problem. ## If not, Tech Lead or designate restarts component manually, determines if this resolves issue or if a more persistent problem exists. # *Notification: preliminary problem description (and resolution, if applicable) sent to Recipient List:* ## initiator of problem ticket, srstaff@mit.edu, the appropriate "announce" list for the service, and if any end-user applications could have been affected, computing-help@mit.edu # *In conjunction with managers currently present, Tech Lead forms Team to troubleshoot issue.* ## Emergency Response takes precedence over other project work. ## Tech Lead is project manager for duration of issue resolution. ## Tech Lead is final arbiter for delegation of tasks, priorities, and timing. # *Notification: If resolution is lengthy, Tech Lead will update Recipient List at least once per day of status of resolution.* # *Post Mortem: Tech Lead reviews response.* If emergency response offers the opportunity for improvement of process, Tech Lead calls a post-mortem with parties who participated in the resolution. h2. Bug Reports-Handoff # *Tech Lead notifies a team leader or manager responsible for each tier of the system affected.* ## Tech Lead collects information from managers on which mail lists to send notification of issue. This is the Recipient List for this issue. Note it in the ticket. ## Tech Lead and managers determine staff members responsible for issue. This is the Team. # *Tech Lead sends message to Recipient List notifying them of the issue.* # *Team performs preliminary troubleshooting to determine the nature of the issue and identify the staff responsible to remedy the issue.* # *Ticket information is transferred or linked to system of record for the issue resolvers.* # *Recipient List is notified of the transfer and the managers now responsible for the issue.* {cloumn} {column:width=300} {panel:borderStyle=solid|borderColor=blue} h2. Terminology *Problem:* Requiring a technical solution. *Issue:* Requiring customer communications or training. *Service Owner:* IS&T team primarily responsible for a service. *Business Owner:* Person or department that sponsors the service. *Urgent:* Requires immediate triage, service down. *Hand-Off:* Bug in the system, short-term project. {panel}