Risk Management Example
Introduction
This is a simple yet complete illustration of how the J.D. Fox Exec Risk Management process is applied. It is simple in that it follows the process for just one particular asset, and it covers only one or a few sets of parameters, options, and decision paths we will consider for protecting this asset. It doesn't include systems that interact with the protected asset. Certain analytical methods (such as quantitative analysis and integration with the Information Security Program) are omitted or simplified. But, the example is complete in that the entire process is described from start to finish.
This example, however, is realistic and relevant. The situations described here are similar to what we observe for many real-life businesses, as are the solutions available. You may even recognize some parallels to conditions in your company.
The Network-Attached Storage Device
Let's say your company has a high-capacity network-attached storage device that holds all of your employees' work product and administrative files. You have an IT manager, and he configured the system to back up to remote hosted (cloud) storage to protect against data loss. But, your company has never done any analysis to determine whether this backup is adequate in terms of capacity or speed of recovery.
To address this, we will implement the J.D. Fox Exec Risk Management Process by performing the following steps.
- Establish scope. The scope will be Disaster Recovery, to ensure your company can restore the files on the storage device within acceptable time limits in the event of failure.
- Business managers establish goals and objectives, and describe risk tolerance levels, based on their knowledge of short- and long-term business requirements, routine operations, and financial position.
- Loss or corruption of employees' work product is not acceptable, except in cases of such widespread calamity that all of your clients are out of business.
- E-mail archives, and old project files, should be protected by reasonable efforts. They are there for convenience, but are not used in ordinary business operations.
- Inability to work due to lack of availability should not last more than one day, and absolutely cannot last longer than three business days.
- Employees can reconstruct work if up to a week's changes are lost, by reviewing e-mails and recalling conference calls that drove the creative process. That would take about two days.
- Reconstructing lost work farther back becomes more expensive in terms of time, and increases the risk that the file won't be reconstructed correctly because of loss of memory of what was said over the phone, since the meetings are fast-paced and note-taking is haphazard.
- Identify and classify relevant assets for protection from risk.
- Develop a list of critical functions and dependencies.
- Identify constraints, such as from statute, regulations, or policies.
- Identify existing business processes, practices, and policies from all departments.
- Perform Business Impact Analysis (BIA).
- Recovery Point Objective (RPO) defines the maximum allowable age of our most recent backup, as specified by management. In our case, the RPO is one week. That is, restoring the files as they were up to a week ago would be acceptable, because the prior week's work can be reconstructed. This is not to say we can't recommend a backup solution that offers a shorter RPO; an RPO of one day is very easy to achieve with a local backup. But, if we end up going with remote backups, the price for more frequent backups can be significant.
- Maximum Tolerable Downtime (MTD) is the maximum amount of time the system can be unavailable. As for our storage device, this is one day. We calculated this from management's designation that the inability to work (due to the storage device being offline) cannot last more than three days, and from the fact that if work is lost, it might take two days to recover it. So, we should have the storage device back online in one day to give users the two days they may need to recover lost work, so they can be back on track within three days.
- Recovery Time Objective (RTO) is how long the actual process of restoring data should take. In our example, in case of total data loss, we have to be able to recover the entire backup in the same day to be able to meet MTD. This is a metric that is often overlooked, by the way, and business managers can be unpleasantly surprised when they have to restore all their data and find out it could take a week to pull everything down from an Internet backup.
- Perform Risk Assessment, which involves these steps:
- Determine threats and vulnerabilities.
- Hard drives fail. Note: a hard drive is a box about the size of your hand that magnetically stores data. There are several of these inside our sample storage device, with each bit of data stored in duplicate across two or more drives. We explain this here because, to this day, a surprising number of people refer to a desktop computer as a "hard drive".
- Malware encrypts or corrupts the files.
- Firmware update to the device bricks it.
- Network port failure or other network or performance problems of unknown cause.
- Your building and all equipment in it are destroyed.
- Determine likelihood and impact.
- Identify the potential events that are currently above risk thresholds given their likelihood and impact.
- Develop potential risk reduction controls for these potential events.
- Analyze these proposed controls. Determine the implementation cost of each control, and by how much it will reduce associated potential losses.
- Document the risk that would remain after the control is implemented.
- Define assessment criteria (metrics), collect baseline data, and devise a plan to collect performance data.
- Create a plan for implementation of approved controls, including collection of performance data.
If you don't approach management correctly, they're prone to say, "Well, back up everything. Don't we already back up everything?" And the answer is, yes, you are backing up everything. But, we need to determine if the backup system in place is adequate. Only if management properly defines what needs to be protected, and for how long, and defines how fast data needs to be recovered and how many different revisions of files need to be kept, can we select the proper upgrade path to take, and ensure value on your investments in new equipment or services. Even if our analysis reveals the current system is adequate, this is certainly no waste of time—as you will see, this process establishes a method by which we can regularly monitor the equipment, software, and processes in place to protect your critical assets. With that, you can identify and address any misalignment between your technical recovery capabilities and management's requirements, in a timely manner. So, management can properly support this process by establishing risk tolerance formally. To continue with this example, let's say they specify the following:
These are the upper limits. Lower or more granular limits may be set by data owners. In this case, the production employees' supervisor offers the following:
For our example, of course, this is the network storage device, on which all files are saved.
Critical functions include the employees' being able to develop their files. These files represent completed work product for your clients, which are e-mailed or uploaded to clients' cloud storage directly from your local storage.
We won't have any here. But, as a sidebar example, a common constraint arises when you store credit card information, which makes you subject to PCI DSS security rules. If this were the case, these would have to be considered when deciding where and how to store backups, whether on physical media or in the cloud.
An example of something to consider is whether users access the files remotely after hours, or whether the files are left idle all night. If users have files open at all hours, then, depending on the types of files, the backup system may not ever be able to make a proper copy of the file. And if it comes time to restore, you could find the files are all corrupt, and have to be treated as essentially lost completely. If we know that users have files open, we can either limit our options for backup solutions to those that can make clean copies of open files, or influence the department manager to establish and enforce rules requiring users to close files and log off prior to the backup time window.
This will take all the information gathered above and define the Recovery Point Objective (RPO), Maximum Tolerable Downtime (MTD), and Recovery Time Objective (RTO) for protected assets and functions. Here is a little more about what these mean.
Our MTD and RTO are the same here, but that's not always the case. Read the article on this topic to learn more.
Here are some of the identified threats:
This is actually quite involved. We'll just cover one of the threats: hard drive failure. To determine likelihood, we might look at system logs of the device to determine if there any errors or warnings about the drives. We must also look at the internal redundancy configuration of the device; some configurations are more susceptible than others to lose data in case a hard drive fails. We can look at past history, the reliability ratings for the hard drives, the system's age, and other factors such as whether the room where it is stored is properly cooled. This will give us a way to compute the likelihood of failure.
As for the impact, we need to look at such simple things as whether we have spare hard drives. We must also examine our current backup system, and perform a test restore to measure how long restoring the data would take.
As mentioned, you already have a remote hosted (cloud) backup service active in your storage device. That is, it copies everything to a remote hosted storage provider. It has copied the bulk of your data long ago, and every night it copies new and changed files. Since it backs up every night, this easily meets our RPO. But, with a simple test, we find that pulling down all the data from the backup would take about three days! Our MTD and RTO cannot be met.
One solution is to implement a local backup system that can restore all the data the same day. We might also plan to reduce the likelihood of hard drive failure that could lead to loss of the entire volume of data.
Or, consider migrating everything to remote hosted (cloud) storage, rely on the cloud storage provider to back up your data within their systems, and get rid of your local storage device. This cloud storage would be different from your current cloud backup, in that the cloud backup only archives the files for restoration when needed, whereas cloud storage would enable users to work on the files directly on the provider's servers.
We would start with price quotes for purchase and implementation of a local data backup system.
For reducing the likelihood of hard drive failure leading to data loss, we could add more hard drives to the existing storage system, or add another network attached storage device, so that additional online copies can be available at all times. However, we quickly find that the expense is not justified, because it is impossible to reduce the likelihood of data loss to zero, meaning you will still need a way to restore data within one day.
For the cloud storage strategy, the cost would be more than just the monthly fees and the migration process. We have to consider the soft costs of reduced performance and control over your data that comes with moving to remote storage, and see how this might impact productivity. In addition, we would have to involve your security team to a great extent to examine the implications. This is where having discovered and analyzed business processes, policies, and constraints (in an earlier step) will come in useful.
Let's say the local backup looks like the best option. We can demonstrate to management that without the local backup, your system is beyond their risk thresholds. We advise management of the current likelihood of the system going down, and the lost revenue and other costs such an outage would incur when it takes more than three days to recover. We then show how the likelihood of this happening is reduced to near zero if a local backup is put in place, and compare those cost savings with the cost of purchasing, installing, and maintaining the new local backup. Management will then assess whether to invest in the local backup system, or adjust their risk tolerance to accept this risk.
"Assessment criteria" is a list of the measurements we will take to evaluate the reliability and performance of the critical functions of the risk control we put in place. In this case, our criteria would be whether the backups complete every night, and whether restoring the data can be done in the required time.
"Baseline data" is the initial measurement based on the assessment criteria, to be compared to updated measurements over time to identify trends that need to be addressed. In this case, we would measure the time it takes to restore. You would continue to collect performance data, because as your files grow, you will need to adjust the backup strategy once your data can't be restored within the MTD/RTO anymore. By measuring regularly and observing the growth trends, this will allow you to initiate plans to upgrade the backup system with enough time to implement the upgrade before the restore time passes the acceptable threshold.
You could also make operational adjustments to reduce restore time as your files grow, such as working with departmental managers to separate and archive aging files to remove them from the backup set. As you may recall, old files were designated to be preserved with reasonable effort, meaning you don't pay special attention beyond routine backups, and don't need to measure how long it takes to restore. Depending on the cost of the cloud backup system that was in place all along, and the local backup capacity, we could either keep backing up old files to the cloud backup, or move them to the local backup system. On the local backup system, archived files would be in a separate backup set, so their presence would not affect restore time of the current files.
This previous paragraph demonstrates again why Disaster Recovery is not an IT-only function, and works best when procedures and operations are understood by the risk manager in charge of developing and maintaining your Disaster Recovery plan.
Install the backup system, and collect the performance data according to the plan created in the previous step. Schedule test restores to measure restoration time and ensure that the backups are being successfully performed.
Epilogue
It seemed like your cloud backup was sufficient. You logged in to the backup provider's website every so often, and saw all your files were there. But, as you can see, once we consulted with top-level and departmental managers about their operational requirements, and applied some analysis to your technical capabilities, we revealed that your current system was inadequate, and were able to identify a solution. This solution, due to our rigorous analysis, has value that is real and documented. While this value isn't in the form of revenue, management will have a record of their risk position before and after implementation, showing that anticipated lost revenue from potential outages has been reduced by more than the cost of the backup system upgrades.
As mentioned, this is only a fraction of what we need to assess and evaluate for your business, and that's just for Disaster Recovery. Add Information Security and Business Continuity to the mix, and we can find quite a number of potential threats and vulnerabilities, identified and analyzed through each of these disciplines, that will justify investments or procedures changes that will protect your profits. During this process, we will also identify any subscriptions or services related to security or continuity that are not providing the expected value, which can then be discontinued, providing instant cash flow benefit. The Risk Management process for Disaster Recovery, Business Continuity, and Information Security have significant overlap, which is why you get the best value for this kind of work by performing comprehensive risk management for all three of these disciplines at the enterprise level.
Which is better, finding out you could have done all this after disaster strikes, or getting started now? Contact J.D. Fox Exec today.