You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

Define the Problem

  • If services are completely non-operational, the problem is urgent and requires immediate triage.
  • If services are operating but behavior has changed from a known state, the problem is urgent and requires immediate triage.
  • If services are operating but artifacts are discovered, the problem requires hand-off for repair.
  • In either case, a problem preempts project work. The service-owner must escalate the issue if the problem requires development staff who are on a project timeline.

Communication Channels

  • By whatever channel the problem comes in, the recipient must contact ops-help and the service-owner, if known.
  • Track the initial problem response (technical) thread with the ops-help@mit.edu and the ops-help Request Tracker queue.
  • Keep the Help Desk in the issue-resolution (customer communications) loop using the (TBD) email list.
  • Identify the service-owner. This person, or designate, is the tech lead for the problem response.
  • The service owner updates the parties noted here, stakeholders of the affected services, and the initiator of the problem report with an initial communication.
  • The service owner should know who has support responsibilities to the service. If an ad-hoc support team must be formed, this must be escalated to senior staff for assignment of permanent support responsibilities.

Urgent Response (Resolution)

  1. For a completely unresponsive component, Operations and Infrastructure staff have license to perform basic system operations tasks to get the service back online while attempting to contact the service owner and others (restart, adjust capacity, etc.).
  2. The service owner or designate updates 3Down in addition to communication channels noted above.
  3. When basic system operations are not effective, the service owner takes over as lead of specialist support staff.
  4. In either case, the service owner validates the service for the problem to be closed.
  5. Once the service is operating again, attempt to identify the cause of the problem.
  6. If the resolution is temporary, permanent resolution must be addressed. The service owner decides whether to stay in urgent mode or shift to hand-off mode (below).
    1. Someone familiar with the system must check to see if restart procedures occurred and if that temporarily resolved the problem.
    2. If not, Tech Lead or designate restarts component manually, determines if this resolves issue or if a more persistent problem exists.
  1. Notification: preliminary problem description (and resolution, if applicable) sent to Recipient List:
    1. initiator of problem ticket, srstaff@mit.edu, the appropriate "announce" list for the service, and if any end-user applications could have been affected, computing-help@mit.edu
  2. In conjunction with managers currently present, Tech Lead forms Team to troubleshoot issue.
    1. Emergency Response takes precedence over other project work.
    2. Tech Lead is project manager for duration of issue resolution.
    3. Tech Lead is final arbiter for delegation of tasks, priorities, and timing.
  3. Notification: If resolution is lengthy, Tech Lead will update Recipient List at least once per day of status of resolution.
  4. Post Mortem: Tech Lead reviews response. If emergency response offers the opportunity for improvement of process, Tech Lead calls a post-mortem with parties who participated in the resolution.

Bug Reports-Handoff

  1. Tech Lead notifies a team leader or manager responsible for each tier of the system affected.
    1. Tech Lead collects information from managers on which mail lists to send notification of issue. This is the Recipient List for this issue. Note it in the ticket.
    2. Tech Lead and managers determine staff members responsible for issue. This is the Team.
  2. Tech Lead sends message to Recipient List notifying them of the issue.
  3. Team performs preliminary troubleshooting to determine the nature of the issue and identify the staff responsible to remedy the issue.
  4. Ticket information is transferred or linked to system of record for the issue resolvers.
  5. Recipient List is notified of the transfer and the managers now responsible for the issue.

Close Out

These things should happen, if they need to, before closing the problem.

  • If the service owner was unknown or vaguely defined, escalate to senior staff and update Operations' records.
  • If specialist development staff with the right skills were undefined or only temporarily assigned, service owner and senior staff to establish permanent support responsibilities.
  • If communications channels to stakeholders (announce or support lists) were ill-defined or incomplete, the service owner must be correct those with sponsors and managers for the service(s).

Terminology

Problem: Requiring a technical solution.

Issue: Requiring customer communications or training.

Service Owner: IS&T team primarily responsible for a service.

Business Owner: Person or department that sponsors the service.

Urgent: Requires immediate triage, service down.

Hand-Off: Bug in the system, short-term project.

Open Questions

  1. Where would we record support-team assignments if they are standing, permanent?
  2. Email to use to keep Help Desk informed?
  • No labels