We walk through the key steps and essential building blocks required to develop a disaster recovery strategy and how to write a disaster recovery plan.
The main aim of IT disaster recovery planning is to generate a detailed recovery plan that can be executed upon in the case of an unforeseen outage.
Such a plan should set out the detailed steps needed to recover your IT systems to a state in which they can support the business after a disaster. In so doing the disaster recovery plan must successfully cover the range of events your organization might face.
But, there are tasks that precede the creation of a detailed recovery plan. Key among these is to carry out a risk assessment and/or business impact analysis that will identify the IT services critical to the organization’s business activities.
This work allows you to determine recovery time objectives (RTOs) and recovery point objectives (RPOs) for the key infrastructure and applications in your environment. From here, you can move on to development of disaster recovery strategies and actual plans.
In this article we go through the key steps and essential building blocks needed when developing a disaster recovery strategy, including how to write a disaster recovery plan.
Developing a DR strategy
According to ISO/IEC 27031, the international standard for business continuity: “Disaster recovery strategies should define the approaches to implement the required resilience so that the principles of incident prevention, detection, response, recovery, and restoration are put in place.”
Here, it is important to understand the distinction made between strategies – which define what you need to do when responding to an incident – and plans, which describe how you concretely intend to execute those requirements.
Key steps in outlining strategy and plans are to:
- Identify critical systems. These could be, for example, payments or manufacturing systems, or whatever is critical to your business. Inherent to the process of identifying these systems is to give some degree of priority in terms of level of protection or recovery.
- Decide on RTOs and RPOs for each system. How soon do they need to be recovered? This can range from no level of downtime being acceptable, through to periods of minutes or hours in which you get them up-and-running again, and how much data can you lose, in time terms? Must you restore to exactly where things left off, or is there some leeway?
- Identify potential threats to each system, or to groups of systems. These can range from buildings taken out by floods or fire, to incidents that affect individual systems, such as hardware failure.
- Develop a prevention strategy. This addresses the identified threats. So, responses could be anything from better flood defenses, upgraded UPS, or improved server/application protection.
- Develop a response strategy. This outlines exactly what needs to happen in case of identified threats causing an outage. This might include failing over to alternative sites or hardware and would be executed with reference to the RPOs and RTOs specified.
- Finally…a response strategy will outline the key tasks required to bring systems back to their primary locations with all protections in place against future outages.
Other factors to consider in DR strategy
- People: Questions you’ll need to ask in this area might include: What is the availability of staff/contractors that might be needed in key areas when implementing DR plans? What training will we need to arrange for them?
Also, you may need to ensure there’s some duplication of critical skills so there can be a primary and a backup person in key areas:
- Physical premises: Questions to ask here may include: Are there alternate work areas on the same site? Do we need to arrange for the use of different company locations, third-party sites, employees’ homes, or portable buildings?
Following on from this you’ll need to consider site security, staff access, ID badges etc. at the alternate location.
Depending on the options available, and your organization’s needs, you may need to consider access to premises properly configured for IT systems. This may include having raised floors, power and cooling for IT systems and people, sufficient electrical power, and voice and data infrastructure.
- Data: This is primarily an area to look at in terms of preparation. Key areas are backup of data in accordance with specified RTO and RPO requirements and methods of data storage. You should also consider data protection capabilities at the alternate site.
- Suppliers: You should identify and establish contracts with primary and alternate suppliers for all critical systems and processes, including the sourcing of people.
Translate disaster recovery strategy into DR plans
When disaster recovery strategies have been developed they can be translated into disaster recovery plans.
The main task here is to take the steps outlined earlier that concluded with your response strategy and add a new stage: recovery action steps.
For example, if the issue is failed server hardware, the key recovery action steps will be to verify the cause of the outage, obtain and install a new server, test it and fail systems back to it.
Developing DR plans
A disaster recovery plan should provide a step-by-step description for responding to an unplanned outage. The goal is to have an easy-to-use and repeatable set of steps that allow recovery of IT assets and to return them to normal operations within the limits set out by your RPOs and RTOs.
The structure of the DR plan
These are the main sections that should be included in a disaster recovery plan:
- Introduction: The initial pages of the DR plan should describe the overall organization-level response to emergencies. The purpose and scope of the IT disaster recovery plan within that should be outlined also. Who has approved the plan, who is authorized to activate it, and a list of other relevant plans and documents should all be cited.
- Roles and responsibilities: Roles and responsibilities of disaster recovery team members should be set out. It should give their contact details, any spending limits if equipment has to be purchased, for example, and their precise role and responsibility in a disaster situation.
- Incident response: An incident response process should be described, the aim of which is to quickly assess situations, determination their severity, contain the incident if possible or appropriate and to notify management and other key stakeholders.
- Plan activation: Based on the findings from incident response activities, the next step is to determine if disaster recovery plans should be invoked, and which elements in particular according to the situation faced.
- Document history: A record should be kept of disaster recovery plan document dates and revisions. It should include dates of revisions, what was revised and who approved the changes.
- Procedures: These are the response and recovery activities as specified in the plans – the recovery action steps – and are the core of the document. The more detailed these are the more likely it is that the IT systems will be recovered to normal operation.
- Appendices: At the end of the plan, these can include systems, networks, and application topologies and inventories, including dependencies, plus contracts and service-level agreements (SLAs), supplier contacts, and any information useful to the recovery process.
The creation of the disaster recovery plan is only the start of the process. For a DR plan to work if it is called upon will need to be tested. It will need employees who are fully aware of your DR plans, that know their responsibilities in case of disaster, and have been trained to play their part.
The only way to ensure the best chance that DR plans can succeed in bringing systems back from an outage is to test them regularly, and to ensure that they are kept up-to-date with regard to people and physical assets.
Written by: Antony Adshead, Storage Editor for ComputerWeekly.com.