Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Section
Column

Define the Problem

The severity of the issue determines the speed of the response and the process to follow.

  • Severity 1: Service(s) functionality completely disrupted. Use urgent response process.
  • Severity 2: Functionality significant impaired but service(s) available. Use hand-off response process.
  • Severity 3: Functionality slightly impaired. Use hand-off response process.
  • Severity 4: Cosmetic or edge-case defect. Use hand-off response process.
  • Assign staff and respond to Severity 1 issues immediately.
  • Assign staff and respond to Severity 2 issues immediately if service is business critical.
  • Schedule response to Severity 3 issues accepting impact on project time lines.
  • Schedule response to Severity 4 issues avoiding impact on project time lines
  • If services are completely non-operational, the problem is urgent and requires immediate triage.
  • If services are operating but behavior has changed from a known state, the problem is urgent and requires immediate triage.
  • If services are operating but artifacts are discovered, the problem requires hand-off for repair.
  • In either case, a problem preempts project work. The service-owner must escalate the issue if the problem requires development staff who are on a project timeline.

Communication Channels

  • Identify the service-owner. This person is the Tech Lead for the problem response or names a designate.
  • By whatever channel the problem comes in, the recipient must contact ops-help and the service-owner, if known.
  • The Tech Lead defines a Recipient List
    • Track the initial problem response (technical) thread with the ops-help@mit.edu and the ops-help Request Tracker queue.
    • Keep the Help Desk in the issue-resolution (customer communications) loop using the hdnotify@mit.edu email list.
    • Add business owners and known announce and support lists.
  • The tech lead should know who has support responsibilities to the service. If an ad-hoc support team must be formed, this must be escalated to senior staff for assignment of permanent support responsibilities (see Close Out, below).

Urgent Response (Resolution)

  1. For a completely unresponsive component, Operations and Infrastructure staff have license to perform basic system operations tasks to get the service back online while attempting to contact the service owner and others (restart, adjust capacity, etc.).
  2. The tech lead Tech Lead updates 3Down in addition to the Recipient List or requests that the Help Desk do so.
  3. When basic system operations are not effective, the tech lead takes over as lead of specialist support staff.
    1. Transfer ticket from initial queue to the support queue for the service.
    2. From here, no one should perform any task without direction from the lead.
    3. Tech Lead is final arbiter for delegation of tasks, priorities, and timing.
    4. The tech lead and all support staff try to identify the root cause of the problem in addition to restoring service.
  4. The service and business owners validates the service for the problem to be closed.
  5. If the resolution is temporary, permanent resolution must be addressed. The service owner decides whether to stay in urgent mode or shift to hand-off mode (below).
  6. The service owner communicates with the Recipient List and updates 3Down.
  7. During an extended outage, the service owner updates the Recipient List and 3Down at least daily.
Handoff

Hand-Off (Bug Reports)

  1. The Tech Lead can move the issue ticket from the ops-help queue to the developers' system of record for the service.
  2. Tech Lead notifies Pipeline representative for his area so resource needs can be taken to Pipeline.
  3. Tech Lead notifies the Recipient List that the problem has moved into a longer-term resolution, with estimates.
    1. Tech Lead and managers determine staff members responsible for issue. This is the Team.
    2. Pipeline representatives work with managers and business owners to modify Team schedules when problem interferes with project schedules.
    3. This is not a resource request, it is a project-timeline adjustment to compensate for production support.
  4. Tech Lead and Team communicate relevant technical details and workarounds to the Help Desk (hdnotify@mit.edu), off-band from business owners and end users.
  5. As resolution moves to different components of the system, hand off tech-lead role to appropriate persons.
  6. Use Pipeline for resolution progress and to manage additional communications.

Close Out

These things administrative tasks should happen, if they need to, before closing the problem.

  • If the service owner was unknown or vaguely defined, escalate to senior staff and update Operations' records.
  • If specialist development staff with the right skills were undefined or only temporarily assigned, service owner and appropriate managers to establish permanent support responsibilities. Operations needs a record of these assignments.
  • If communications channels to stakeholders (announce or support lists) were ill-defined or incomplete, the service owner must be correct those with sponsors and managers for the service(s).
Column
width300px
Panel
borderColorblue
borderStylesolid

Terminology

Problem: Requiring a technical solution.

Issue: Requiring customer communications or training.

Service Owner: IS&T team primarily responsible for a service.

Tech Lead: The person assigned to manage a triage.

Business Owner: Person or department that sponsors the service.

Urgent: Requires immediate triage, service down.

Hand-Off: Bug in the system, short-term project.

Panel
borderColorblue
borderStylesolid

Open Questions

  1. Where would we record support-team assignments if they are standing, permanent?