1.        Introduction

This root cause analysis document describes a brief description of the incident, a summary of the events, a discussion of the causal factors and recommendations for preventing recurrence.

Key terms used in this document are defined below:

Root Cause:     

The underlying reason for the occurrence of a problem usually made up of a number of causal factors.

Causal Factors:

Individual events contributing to the root cause.

Contributory Factors:            

Components (e.g. people, processes, configurations) which contributed to the impact of a problem but were not a result of, or linked to its root cause.

Consequential Issues:   

Issues or incidents arising as a direct result of the problem.

Other Observations:

Lessons learnt during the incident or resolution of the incident

2.        Incident Summary

Service Failure

On Monday, 21st June at 13:35, we were alerted to a fault with our primary fibre circuit from Dunsfold to London. The Memset network is designed to have redundancy to protect against such a failure and in normal circumstances the circuit from Dunsfold to Reading provides this redundancy. In this instance it was discovered that in addition to the fault on the primary circuit, a previously undetected fault on the secondary circuit was also present. This fault rendered it unusable also. The two separate failures resulted in a networking outage for the duration of the incident.

Additionally, the multiple failures impact our ability to invoke DR procedures for our website and communications platforms. This resulted in our ticketing system and phone lines being offline for the duration of the incident.

Workaround implementation / Immediate actions      

Engineers on site were able to ascertain that the infrastructure at the Dunsfold datacentre was functional and that the fault was up-stream of the Dunsfold external routers. Both faults were reported to our vendors with senior escalation teams.

BT Openreach engineers attended both the Dunsfold and London sites at around 18:50 and identified that the fault with the Dunsfold to London line a severed cable in the Clapham area. This enabled them to start work rerouting our circuit.

BT Openreach completed the rerouting of the Dunsfold to London fibre at 00:25, 22nd June.

The fault with the secondary circuit from Dunsfold to Reading was resolved by 11:35, 22nd June. This restored redundancy to the network architecture thereby resuming normal service.

3.        Root Cause Analysis (RCA)

This section looks at what truly caused the incident and what contributed to the impact / duration of the incident.

Root Cause

The root cause of this incident has been identified as two separate faults on the fibre circuits providing network connectivity to the Dunsfold datacentre. These multiple failures resulted in loss of primary networking and the redundancy that was in place.

Causal Factors

The fault on the Dunsfold to London fibre circuit has been identified as a fibre break in the Clapham area of London.

The fault on the secondary circuit from Dunsfold to Reading has been identified as fibre degradation a few miles from the Dunsfold datacentre. The cause of this fibre degradation is currently unknown.

Contributory Factors

We were unaware that the secondary fibre circuit Reading had failed until it was required on Monday. While active, this circuit is currently unused for day-to-day traffic since our migration of customer servers from the Amito datacentre. It was understood that this circuit was provided on a managed basis by Focus Group however investigation after the incident has discovered that this information was not correct. Focus Group were not providing monitoring services.

No testing schedule was in place for the secondary circuit to Reading as this was previously used continuously for day-to-day network traffic.

Consequential Issues

The redundancy in place for our website is reliant on at least one of the fibre circuits being available. Technical limitations meant it was not possible to reroute traffic to our second site at Maidenhead.

Our telephone system resides in our Dunsfold site. Again, this is reliant on at least one of the fibre circuits being functional.

4.        Recommended Actions

Area

Description

Status

Action Plan

Root Cause

Establish the time that fault developed on the secondary circuit

Underway

We are working closely with our vendor to establish when this fault developed

Establish the cause of the fibre degradation of the circuit from Dunsfold to Reading

Underway

We are working closely with our vendor to establish the cause of this fault

Causal Factors

Implement internal monitoring on the Dunsfold to Reading Fibre circuit

Complete

Additional monitoring has been put in place to alert to any degradation of service. This monitoring is provided by and responded to by Memset/Iomart

Contributing Factors

Reconfigure network routing so that a proportion of network traffic is passed over the Dunsfold to Reading fibre circuit

Complete

We are now passing traffic across this circuit

Consequential Issues

Migrate Memset management systems to group systems to provide resilience in case of incident

Underway

A project to carry out this work was underway prior to this incident

5.        Likelihood of Reoccurrence

Risk Matrix



                          Impact

Minor

Moderate

Major

Critical

1

2

3

4

Likelihood

High

3

3

6

9

12

Moderate

2

2

4

6

8

Low

1

1

2

3

4

Assessment

The likelihood of recurrence has been categorised as between 0-25%. This is due to the following mitigating factors –

  • Redundancy is now back in place following the repair of the secondary circuit between Dunsfold and Reading
  • The secondary circuit is now being actively used for day-to-day network traffic thus increasing any fault visibility
  • Additional monitoring has been put in place and is monitored 24/7 by Iomart/Memset

The impact of this incident has been categorised as Critical

The overall risk score has been calculated as 4 - Low

APPENDIX A – GLOSSARY OF TERMS

Business as Usual (BAU). How a process, system or situation operates when it is not impacted by a break in service or problem – “normal conditions”.

Causal Factor (CF). Individual events contributing to the root cause.

Consequential Issue (CI).  Issues or incidents arising as a direct result of the problem.

Contributory Factors: Components (e.g. people, processes, configurations) which contributed to the impact of a problem but were not a result of, or linked to its root cause.

Incident. Any event that is not part of the standard operation of a service and causes, or may cause, an interruption to, or a reduction in, the quality of service.

Key Learning’s:  Lessons learnt during the incident or resolution of the incident

Known error. An incident or problem for which the root cause is known and a temporary workaround or a permanent alternative has been identified. If a business case exists, an RFC will be raised, but—in any event—it remains a known error unless it is permanently fixed by a change.

Major incident. An incident with a high impact, or potentially high impact, which requires a response that is above and beyond that given to normal incidents. Typically, these incidents require cross-company coordination, management escalation, the mobilization of additional resources, and increased communications.

Permanent Fix. A solution to the problem that results in zero chance of issue reoccurrence.

Problem. The undiagnosed root cause of one or more incidents.

Root Cause Analysis (RCA). The underlying reason for the occurrence of a problem, usually made up of a number of causal factors.

Root Cause Corrective Action (RCCA). A action or task taken to address the root cause in order to ensure it is permanently fixed and cannot reoccur, or at least mitigated as much as possible to reduce impact of the next occurrence.

Resolver groups. Specialist teams that work to resolve incidents and service requests that initial support cannot resolve themselves. Support team structures vary between organizations, with some using a tiered structure (second, third, and so forth), while others use platform or application-oriented teams (mainframe team, desktop team, network team, or database team).

Service request. Requests for new or altered service. The types of service requests vary between organizations, but common ones include requests for change (RFC), requests for information (RFI), procurement requests, and service extensions.

Solution. Also known as a permanent fix. An identified means of resolving an incident or problem that provides a resolution of the underlying cause.

Workaround. An identified means of resolving a particular incident, which allows normal service to be resumed, but does not actually resolve the underlying cause that led to the incident in the first place.