MTD, RTO, and RPO Explained
Introduction
As we've discussed elsewhere on this site, Information Security and Disaster Recovery planning require business managers to establish performance objectives based on their intricate knowledge of business operations and goals. The more conspicuous of these objectives relate to availability of IT systems following an outage, and recoverability of data after it's been accidentally deleted or corrupted. The three most important are:
- Recovery Point Objective (RPO)
- Maximum Tolerable Downtime (MTD)
- Recovery Time Objective (RTO)
These function as thresholds for Information Security, Disaster Recovery, and Business Continuity activities, so you might see these objectives referred to as such throughout this site or while working with J.D. Fox Exec. Specifically, as part of your Information Security program, you will implement high-availability configurations (redundant equipment, primarily) to ensure that your IT system remains available within these thresholds in the face of routine problems and equipment failures. In case of a problem that exceeds the capabilities of your high-availability design, Disaster Recovery planning addresses replacing equipment, and/or recovering data from backup, to restore support for critical business functions within these thresholds. Finally, Business Continuity planning addresses how to sustain or recover critical business activities in cases where an IT system failure exceeds the capabilities of the Disaster Recovery plan to meet these same thresholds.
Recovery Point Objective (RPO)
RPO defines the maximum allowable age of the most recent data backup. A data backup is a static copy of files and/or databases that is kept completely separate from the storage that is accessed by users. If a storage device fails, files are accidentally or maliciously deleted, or a database becomes corrupt due to a hardware or software malfunction, the files or databases can be recovered from the backup.
The RPO is defined by business and departmental managers, and any designated data owners. It is described as a period of time. For example, the recovery point may be "one hour", "end of previous business day", or "one week". It is derived based on:
- How often data is updated;
- How much expense and/or effort would be required of your users to reconstruct data created or updated since the last backup, if possible;
- If reconstruction wouldn't be possible, how much recent data your company can tolerate losing permanently, considering the likelihood of a catastrophic data loss event.
For optimal value, you will design and maintain your backup system to meet the RPO.
The primary constraint that may prevent meeting the RPO is how long it takes to make each backup, which is determined by the data size and how fast the data can be copied to the backup system. For example, if it takes more than 24 hours for the daily backup to complete, you won't be able to meet a one-day RPO because the next day's backup won't start on time. This may not be the case when the backup system is initially set up, but the situation may develop as the data grows (which is why all automated backup systems need to be monitored carefully).
If a backup system isn't able to meet the RPO, you may not necessarily have to purchase more backup equipment, or a faster Internet connection in the case of remote backups. You can employ other techniques, such as segmenting data, which can often provide a more durable solution than incrementally increasing backup system capacity. For example, you may be unable to back up everything nightly and thus can't meet an RPO of one day, but if you can separate data that might not require a daily RPO from that which does, you can then successfully back up the daily RPO data every night.
If your system happens to have the capability to make more frequent backups than the RPO requires, you can either increase backup frequency, or keep it the same but use the extra capacity to save more, older, archived copies of backups, depending on what is more valuable to management.
For a typical business office with ordinary files, the RPO is one day, as a daily backup system is generally inexpensive to implement and maintain, and having to reconstruct one day's work in the unlikely event of a significant data loss event is an acceptable risk. You might see a one-hour RPO or shorter in a business where a high-volume transactional database is deployed, such as those serving a busy e-commerce website.
Redundant storage is quite common, and might seem like it offers an RPO of "zero". It can be as simple as a storage device having two hard drives inside and saving every bit of data on both drives simultaneously; if one hard drive fails, all data remains available. Or, you could have a complex system like a high-end database server shipping transaction logs in real-time to a secondary database server at a remote site; if the primary database server fails, the other one can pick up where it left off. Redundancy is a common and effective strategy to provide for continuous availability where management's requirements justify the investment. However, it is not a replacement for backup, because only with a backup can you recover deleted files, or a database that becomes corrupt due to user error or a software malfunction.
Maximum Tolerable Downtime (MTD)
MTD is the maximum amount of time an application or data can be unavailable to users, as specified by business management. This is based on the impact on business functions, and analysis of anticipated lost revenue and other costs that are incurred for every hour, day, or week a given application or database might be unavailable.
This threshold is used during Disaster Recovery and Business Continuity planning at the executive level. But, since the MTD is an operational determination and not technical, IT systems managers may need to provide the ability to recover technical function sooner following a crash. This is because, following restoration of applications or data after a crash, the system will not be considered fully operational until the users catch up on any work they missed during the outage (with the MTD clock still ticking). In that regard, the MTD serves as a metric by which to compute a more granular threshold, the RTO.
Recovery Time Objective (RTO)
RTO will generally be a technical consideration, to be determined by the IT department. This defines how quickly you should be able to recover a software function, replace equipment, and/or restore lost data from backup, following an outage or data loss event. For every piece of equipment, software application, and database, your IT systems manager should examine the applicable MTDs, make a diagram of IT component dependencies, and work backwards through those dependencies to determine how much time can be allocated to recover each technical component. This diagram must also include the time, mentioned above, that business managers and users will need to validate that the data was recovered correctly, and to catch up on work, all within the MTD.
The term Work Recovery Time (WRT) may be used to refer to this catch-up work. Generally, you'll see it referenced in environments where this catch-up work might be significant. In cases where the data and functions to be recovered are relatively simple, the WRT can be very short, nonexistent, or ill-defined, in which case it doesn't need to be itemized. For example, if the e-mail boxes for users who send and receive only a few e-mails per hour go offline for half a day, the e-mail system will be considered fully operational immediately once they're brought back up.
Redundancy, repair times, and availability and skill of technicians factor into the RTO. And, of course, how long it takes to restore data from your backup system. Technical implementation to meet an RTO can be straightforward, like having a 4-hour response warranty on your Internet router. Or, it can be quite involved in the case of a large IT system spread out among different locations, where site-to-site connectivity, routers, switches, physical servers, storage, and software function all have to be accounted for to ensure that a remote branch office can regain access to files within the MTD, should any of these systems fail or malfunction.
On high volume systems (large number of transactions or users), management may find it valuable to define gradations of the MTD, to avoid over-investing in redundant equipment, software licenses, and on-call technicians. For example, a system may be allowed to be down for four hours once per quarter, but down all day only once per year. Or, 25% of users can be offline for a whole day, but if 75% or more of users are disconnected, access must be restored within four hours. Or, a website can be down two hours during the day, but up to six hours at night. In this regard, you may see articles on other sites using other terms to cover this (such as Maximum Tolerable Period of Degradation). But, you will always find special considerations when managing MTD, and these should be handled as an integral part of determining RTOs as described above, so we won't try to categorize them or assign acronyms unnecessarily.