Disaster Recovery and Business Continuity Plan Development and Management

Overview

Although they are distinct, development of Disaster Recovery and Business Continuity plans will have considerable overlap, especially since much of your Business Continuity plan addresses IT system outages that exceed the capabilities of the Disaster Recovery plan. It will also overlap with your Information Security program, since a major component of Information Security is protecting your IT system from outages and data loss.

Development of these programs will differ depending on the maturity level of your business as far as corporate governance, whether any such plans are already in place, and even your company culture. But, the process will generally include these steps:

Create an inventory of critical functions supporting the company's objectives and activities.
Get a designation by top management of risk tolerance and thresholds.
Conduct Risk Management for Business Continuity, which will include risks to the IT system, and identify feasible controls to reduce and mitigate risks.
Assess current capabilities and develop a strategy to implement these controls; this will include actually creating the Disaster Recovery and Business Continuity plans.
Monitor and track implementation.

Disaster Recovery Planning

When creating a Disaster Recovery plan, you must devise a way to determine whether a given outage will impact critical business functions beyond acceptable thresholds; that is, whether the outage is a disaster. This is important because Disaster Recovery operations involve other departments besides IT, while recovery from an outage that is not a disaster can and should be handled entirely by IT.

When Disaster Recovery is initiated, designated business managers outside the IT department will monitor the progress of the Disaster Recovery operation. These managers will have the authority to initiate execution of the Business Continuity Plan if they determine that the Disaster Recovery operation is not on track to restore aspects of IT operations within the relevant thresholds.

For outages that are not disasters, the IT department should monitor the IT system, and their own service restoration activities, for Disaster Alert Indicators. These are measures or conditions that indicate when handling of an outage should be escalated to Disaster Recovery operations.

Given this, it should be clear why you cannot simply direct your IT department to develop a Disaster Recovery plan: your IT department cannot define what kinds of outages are disasters and what are not. Whether an outage or data loss incident rises to the level of a disaster can only be determined if your business managers have done the following in advance:

Performed Risk Management;
Related your IT assets (hardware, communications, applications, and data) to critical business functions; and
Defined the maximum allowable downtime (and other relevant thresholds) for each IT asset, which is derived from understanding how each asset and the IT system as a whole support each critical business function, and to what extent and for how long a degradation or interruption of each critical function can be tolerated.

Disaster Recovery planning

This planning requires input and participation of business managers outside IT. Once all this information has been collected, analyzed, and documented, then you can make a chart of Disaster Alert Indicators for each type of potential service-disrupting incident, to determine when the Disaster Recovery plan must be activated.

Any problem or outage in your IT system has the potential to be a disaster. If you've never performed Disaster Recovery planning, then you may be at tremendous risk of failing to maintain critical business functions that could have been sustained had proper planning been performed. Your IT department may have implemented the best high availability configurations they could with what they have. And if they're really squared away, they have charts and diagrams of how long services will be unavailable given various component failure scenarios, and they know exactly how long data restoration will take in the event of a data loss incident and whether any data may be unrecoverable. Even with all that, business managers must take the initiative and apply the funding and direction to ensure that sufficient resiliency capabilities are in place, validated, and tested to support business requirements. This is only done by proper Disaster Recovery planning.

Business Continuity Planning

Since Disaster Recovery is so focused on IT, the breadth of your plans for different types of outages is generally limited compared to Business Continuity, since the objective of Disaster Recovery is always to get applications, data, and communications back online. Business Continuity is much broader because it covers all manner of events that can disrupt your business. This is why, once all planning is complete, what we generally call a Business Continuity plan will actually be several completely distinct Business Continuity plans.

Emergency Response planning is a separate discipline, typically initiated by Human Resources or Facilities Management under your company's Safety Program. When developing your Business Continuity plans, naturally, you will coordinate with and incorporate any existing Emergency Response plan. This way, in the event of a natural disaster or terrorist attack during business hours that scatters your employees at the same time as it damages your IT system and your office building, the process of accounting for employees' safety and location; establishing communications to decide on when, where, and how employees will return to work; and monitoring Disaster Recovery operations will all be coordinated under your Business Continuity plans.

So, what about non-critical functions of the business? So far, we've ignored these, and there's a good reason. See, Business Continuity planning already consumes substantial time and resources to create and maintain, and it's distracting to routine operations. It's hard enough to get management to see the value, since it doesn't produce any revenue whether it ends up being used or not. And when calamity finds you, even with the best preparation and rehearsal, you can be sure that new unanticipated problems will arise. So, it would be patently irresponsible to include anything that isn't truly critical to business operations in the plan, because this would increase the cost and complexity, and thereby increase the chance the plan will be abandoned by management or that execution will fail.

Business Continuity planning

Here's how it should work. Once the Business Continuity plan has been executed to completion after a business-interrupting event, management may then commence planning to restore any non-critical functions that were interrupted. Planning shouldn't be done before the event, for the reasons given above, and it shouldn't be done until the critical services are restored. See, in an actual Business Continuity event, depending on the severity of the interruption and the scale of loss, your business may be significantly altered for the foreseeable future, or even permanently. If it is, then the required scale and nature of the non-critical functions previously in place will naturally have changed. Management can then assess the company's financial situation, define the need for the functions that were previously in place, and then initiate a program to re-establish any such functions that had been interrupted.

Maintenance

On a regular schedule, and when significant changes to your business occur that may change your risk tolerance thresholds or the dynamics of your controls, you must review your Disaster Recovery and Business Continuity plans. Reviews can range from the most basic readiness checks to a full dress-rehearsal. Some examples:

Sanity check of the written plans by the plan manager. This will confirm that any assumptions made or resources required are still valid, and that named individuals with roles and responsibilities have been trained and will be able to perform their duties when needed.
Checks and inspections of reserved resources. This involves making sure spare equipment, or other special equipment such as a standby laptop with special software to be used for diagnostics, hasn't gone missing, and testing it for readiness and proper function. Your IT department should also power up any cold spare equipment on a schedule and make sure its software is upgraded to the same level as that run by whatever it will replace.
Round-table rehearsal with the team. All those involved do a conceptual walk-through, to reveal any conflicts, problems with assumptions, and choke points. Include service partners and warranty providers in this as well—don't find out when it's too late that a warranty expired, or that an online backup service stopped working because no one got the notification that the credit card information needs to be updated.
Testing automated and manual failover of IT system components in a disruptive manner. This requires coordination with all potentially affected departments.
Performing the steps to restore data or virtual machine images from backup, in as realistic a scenario as possible.
Full company rehearsal of response to a given scenario; this will be driven as much by the managers of the Emergency Response plan as by Disaster Recovery or Business Continuity plan managers.

What kind of testing to do, and when, will be determined by management based on risk thresholds, with input from the planning process.

Continue Reading