Disaster Recovery and Business Continuity Examples
Here are some vignettes showing Disaster Recovery and Business Continuity in action.
This first example will illustrate the progression from non-disaster outages handled by IT, to Disaster Recovery scenarios, to full Business Continuity plan execution.
Let's say your business has dozens of graphic designers, and they use an application that not only stores their design files, but links into a project management database so that files can be organized and located hierarchically or by tags. Through this system, users record project notes for the files or groups of files and communicate with teams assigned to each project.
The servers hosting your database and project files run in virtual machines on a physical application server in your server room, using a network storage device to store the massive amounts of graphic data you have. Your application server is actually part of a high availability cluster along with a similar physical server, meaning that if the primary physical server goes offline, the virtual machines will re-launch on the secondary server. You have other servers in place that perform other functions as well.
In order to avoid significant financial loss or falling behind in deadlines set by clients, management has determined that the servers cannot be offline for more than a day.
So, let's say early one morning the application server's power supply goes haywire and fries the motherboard. In this case, the virtual machines automatically reboot on the other physical server, and services resume after a few minutes (this is called failover). The IT department observes the server crash, and checks the function of the application to ensure the failover went smoothly. They will also set upon fixing the failed server. This is an example of a redundant configuration handling a hardware failure without having to initiate Disaster Recovery procedures. Depending on the capabilities of your equipment and software, the failover of the virtual machines may happen so efficiently that the users don't even know there was a problem.
Consider what happens if the secondary server goes down before the failed server is repaired. Or, what if when the virtual machines failed over, they failed to boot, and the entire application and its data were corrupt? Or, what if the shared storage device, which holds your virtual machines and the database, crashes? In any such case, you will need significant intervention to avoid more than one day's downtime. If your business had performed proper Disaster Recovery planning, you would have backups of your database and your virtual machines, you would have identified which virtual machines are required to resume critical business operations, and you would have a plan for restoring those virtual machines on alternate hardware.
If this happens, your IT department will initiate these planned Disaster Recovery procedures. The process might involve restoring the virtual machines and the database from the backup system to one of the other physical servers. Virtual machines not required to support critical business functions will remain off until the hardware repairs are complete. To enable this temporary server to run the critical virtual machines, some or all of the non-essential functions that normally run on this server might have to be shut down, to avoid overloading it. Since this is a Disaster Recovery operation, IT would communicate with designated managers, who would then relay information to the design department, and anyone who normally uses the non-essential services that were shut down, advising of the estimated times that services will be restored.
In addition, when you recover from backup after total loss or corruption, files that were created or edited since the last backup may not be recoverable, which means recent work will be lost. The maximum amount of work that may be acceptably lost will have been determined during Disaster Recovery planning through input from the design team managers, and this information will have determined the budget for the backup system. This again illustrates how input and planning from managers and departments outside IT will give value to your Disaster Recovery plan, and why it is critical this be done before disaster strikes.
Once the virtual machines and database are recovered from backup, and any additional data is reconstructed to the extent possible, IT will focus on repairing the two failed servers, in order to return IT services to full capacity.
What if, though, the database had grown so large that it was going to take three days to restore from the backup system? Or, worse, what if IT discovered after the catastrophic crash described above that the daily backup system hadn't successfully backed up for two months? Of course, these are failures that should be avoided by monitoring and testing the Disaster Recovery plan. But, once a disaster hits and the Disaster Recovery plan fails, you need to face this new challenge. And that, of course, will be addressed by the Business Continuity plan.
The Business Continuity plan, if properly developed, will define steps to salvage the company's reputation and resume work on existing projects in the event the IT systems manager makes the unfortunate report that two months' work has been lost. As you can see, especially after we've described the progression of events to get to this point, this process would be almost entirely outside the realm of IT. Project schedules may have to be adjusted, insurance claims made, financial compensation made to clients, and non-technical recovery of information employed (such as, asking clients to send back whatever design files your designers may have uploaded into their systems).
The Network Switch
This next example will demonstrate how the same incident may have vastly different effects depending on your level of preparation.
Let's say you have a sprawling office suite with about thirty workstations. Users can tolerate a few hours' downtime. However, having your network inaccessible all day is unacceptable to management, as it would cost your business too much in lost revenue and reputation.
One day, your high-capacity network switch fails. It just shuts off and won't turn on. Your users cannot connect to your shared storage device, network printers, or the Internet, and the phones are down, until this is fixed. Is this a disaster?
Well, in such a scenario, here are conditions that would make it not a disaster:
- Your IT department has a pre-configured spare on the shelf, and a technician is on-hand or nearby to install it.
- If you don't have a spare, you have a warranty or service contract with the manufacturer to replace it within four hours, and your IT department has a backup of the configuration that can be quickly applied to the new unit.
In this case, the outage can be handled by IT, and it does not require instigation of any Disaster Recovery plan. However, let's say that any one of the following conditions unfold:
- You have a spare switch only, and no warranty. IT gets the spare off the shelf, and they find that it won't turn on. You need to order a new switch, and the model you currently use (the one that just failed) is not available, so the configuration will have to be rebuilt manually in the new model when it arrives—two days from now.
- You have a spare switch, which IT installs and powers on, but for some reason network communications are not working. Your technicians can't figure out why, and they can't give you an estimate for when it will be fixed.
- You don't have a spare, only a four-hour replacement warranty. But, it's late Monday afternoon, and your warranty provider surprisingly says they'll deliver the new one on Wednesday. You angrily escalate to the highest levels, but cannot get past the powerless service representatives at their foreign call center, and the company's headquarters office on the East Coast is closed already.
- You don't have a spare, only a warranty. The new device arrives two hours later, but IT can't find the configuration backup, or even documentation on how it should be configured. Your IT department has to manually reconfigure the VLANs, SVIs, ACLs, etc., to ensure proper transmission of your data and voice traffic. Service may be unavailable for more than a day while they figure this out.
So, you had a high availability configuration and plan, but it was inadequately tested for this kind of outage. What initially, on paper, should not have been a disaster, is now threatening to cause your system to be unavailable beyond the acceptable length of time. You have a disaster!
Your Disaster Recovery plan will include any condition that will cause the network switch to be offline for more than a day as a Disaster Recovery Alert Indicator. So, IT will inform management of these technical problems. Operations managers and IT will together review the Disaster Recovery plan, and see that there are no further technical measures to take—the switch simply needs to be replaced and reprogrammed, which is already in progress. Since it will take longer than is acceptable, management will promptly implement the Business Continuity plan, which may involve relocating key users to another facility to gain network access, moving your application software to that network, transferring up-to-date data to the new facility, and contacting your voice service provider to redirect inbound phone calls. As you can imagine, all of this should be tested and rehearsed regularly, just like your Disaster Recovery plan.
Here is an example of an incident that instantly exceeds the scope of your Disaster Recovery plan, by design.
Your company has local applications and a very big database that your users rely on. You've done Risk Management, and have thoroughly documented the financial and intangible costs for outages of various lengths of time. Based on this, company management approves investments in redundant onsite equipment and continuous onsite data backups, limited offsite replication of your virtual machines, and nightly offsite backups. The offsite replication and backups are not designed to be made immediately available when needed. They only offer the capability for IT, in the event your entire facility is somehow destroyed, to rebuild the network at a new facility and resume business activities, with some financial loss due to the length of the outage and even some loss of recent data. More robust offsite replication and backup was simply not in the budget, because the chances they would ever be needed were very small, and management decided to simply accept that risk.
Your Disaster Recovery plan involves manually activating or installing redundant equipment when hardware fails, failing over from one server to another in case of application or database corruption, and restoring data using your fast local backup system when users accidentally wipe out files or database records. And the plan serves you well for many years.
Then, one night, the unthinkable happens. Your office building catches fire, and when you arrive for work the next morning, you find nothing but blackened, twisted ruin. This unlucky business realized a loss they never budgeted to prevent, for which there was no Disaster Recovery plan. It was a calculated decision, and, because of this, a Business Continuity plan had been developed and was in place.
To restore business functions, the Business Continuity plan will call upon acquiring a new facility, server equipment, workstations, and telephones, and then importing the virtual machines and data from the offsite backup. If everything goes according to the plan, the business will be out of operation for several weeks, and at least the last day's work prior to the fire will be lost. However, with appropriate insurance, and, again, a conscious decision to accept this risk, the cost in direct expenditures for the purchases, indirect expenses for recreating the lost data, lost income from being out of operation, and lost reputation, was all deemed acceptable by management. If the risk tolerance designated by management would not allow for this length of downtime or this level of financial, data, or reputation loss, then the offsite replication and backup system would have to be much more robust, such as including spare equipment already in place at an alternate location, for faster recovery. All of this will be determined when proper and thorough Risk Management is performed during development of Disaster Recovery and Business Continuity plans.
Most of the examples on this page describe on-premises IT systems. If you've moved some or all of your IT system into the cloud, this doesn't change anything as far as the need for Disaster Recovery and Business Continuity planning.
Even if your office has a very light IT footprint and relies on hosted services, Disaster Recovery can still apply to you. Let's say your company e-mail is hosted with Microsoft Office 365 or Google G Suite. These service providers occasionally go offline for several hours. If you have a small business and the IT system at your office consists of your users' laptops, your Internet access router, and Wi-Fi access points, this will probably be tolerable.
But have you thought about data loss? Hosted service providers back up their own systems so that if they have massive data center failures they can restore service. They don't generally offer backups that customers can access, however. So if you accidentally delete your data and it's purged from their system, you're not getting it back.
We've seen that accidental data deletion with hosted service providers is more common than with on-premises equipment, because of the per-user, decentralized nature of storage in Microsoft OneDrive or Google Drive. Also, systems managers might overlook how data can be deleted, in ways that won't happen with on-premises systems. For example, Andy is about to leave the company and he e-mails a bunch of large files to another employee, Bertha, from his laptop. The hosting provider's system doesn't actually attach the files to the e-mail; it automatically saves them to Andy's online storage (OneDrive or Google Drive), grants permission for Bertha to access them, and inserts a link to the files in the e-mail Bertha receives. Then, Andy leaves the company, his laptop is wiped, and his online services account is deleted. Later, Bertha tries to access those files directly from her e-mail inbox, and finds the files are gone, and cannot be retrieved.
A more devastating scenario might involve an employee deleting files on purpose. If no one notices prior to end of the recovery period and you don't have your own backup, those files are gone forever.
The simple answer is to implement a backup; there are several hosted services that will pull data stored in your Microsoft or Google business accounts and archive them, out of the reach of users, for a monthly fee. You might have intuitively understood this should be done, but an astounding number of small businesses who use these hosted services never think of it. This, and other similar considerations you might not have thought of, will be addressed if you perform at least some basic Risk Management and Disaster Recovery planning. And, as you've already learned, when a backup is performed as part of an ongoing Risk Management program, it is much more likely to be there when you need it, than when backup is configured in response to a directive from management and then paid no more mind.
Going back to service outages, let's say you have hundreds of employees around the state, and they rely entirely on e-mail, instant messaging, and sharing files through the Microsoft or Google cloud platform to do their work in a very fast-paced business. Maybe your corporate culture is one where your top management are the kind that always seem to call with an emergency, needing this or that file right away. Should you have some sort of Business Continuity plan for service outages? It has happened, where such services have been down for an entire day and even several days, and it's remotely possible that in the future a longer outage could occur.
Think e-mail being down would be a problem? What if your company's entire phone system is hosted remotely? The same thinking applies to applications you run that exist entirely in your web browser, such as Salesforce.com, NetSuite, or QuickBase.
Or, to make it bigger, imagine a company that provides online services, and has everything in a cloud hosting provider, such as Amazon Web Services (AWS). By everything, we mean their web servers, version control system, development platforms, and staging servers are all in EC2 virtual machines, and a tremendous amount of video content is in Amazon's S3 data storage system. AWS has had long outages in the past, so extensive they were in the news.
The answer for all of this, as with on-premises, is to conduct Risk Management, determining how much your business will lose if any of these will go down, and the likelihood this will happen.
For our example web services company with their own applications running in AWS virtual machines, their analysis of potential financial and intangible loss due to outages could be used to set a budget for implementing Disaster Recovery capability in the cloud. For example, they could have web server virtual machines outside of AWS (such as in Microsoft Azure or IBM Cloud), and a replication system to enable bringing them up within a designated amount of time, while only losing so much data or so many customer transactions up to a threshold approved by management.
Of course, you can't really do Disaster Recovery if fully-hosted e-mail, web-based productivity applications, or voice services go down, because they're not your IT systems to manage. So, the response to service outages would be to implement Business Continuity. For e-mail, this might involve subscribing to a third-party e-mail continuity service, which can provide an alternate website outside your regular provider through which users can access and reply to inbound mail and, depending on the capabilities of the provider, present past e-mails as well (from before the current outage). For communications, there is a wide range of redundant systems you can implement, which will offer various utility depending on price, technical capabilities, and level of user training required to ensure that, if you need to move everyone to the new system, the procedure would be successful.
No matter how small or big your company, you have to consider your Internet connection itself. If that goes down, then it's like all your hosted services went down at once. Redundant Internet connections at your facility is an intuitive solution, but proper Risk Management and functional testing should be employed to determine whether the increased cost is justified. Cost can be a huge factor, because you may be in a location where you have only one viable option for Internet, and the only backup system available uses wireless (microwave) or satellite, which can be very expensive.
The length of an outage is the most intuitive metric for measuring its impact, and for establishing thresholds to support Disaster Recovery and Business Continuity planning. For most businesses, it's pretty much all that matters. But, are there cases where considerations other than downtime will impact critical functions?
One common metric is the scope of the outage in terms of number of users or customers affected, along with downtime. For example, an online service provider may accept 10% of their customers being down for up to eight hours, but if an incident happens where 50% or more customers are affected, it must be resolved within an hour. Or, so long as a least two of your mobile sales people can get into the system and process orders, a problem with access to the system can be given more time to resolve than if all of them are locked out.
Financial loss can serve as a metric by itself, with no consideration for downtime. Companies that are regulated, meaning the government follows closely what they do and will impose monetary penalties for missteps, are more likely to use this metric. For example, your business might have a storage system with archived data that is of no use to you any more, but is required by law to be kept for a certain time. Being caught without the data could generate heavy fines, possibly enough to put you out of business. Without proper risk management and planning, you may not allocate the proper budget and procedures to ensure redundant storage and backups are protecting you adequately.
Regardless of downtime considerations or government fines, data loss itself is a significant metric. Your planning and analysis might reveal that you have certain tolerance for data to be lost in the event of database corruption. If you ever see the term Recovery Point Objective (RPO), then that's what this is referring to. As an example, you might employ dozens of people who enter information into your custom database from information they find on the Internet, or maybe from surveys returned by potential customers. If there is database corruption, then you can tolerate restoring from the previous night's backup, and having your users re-enter today's information. So your RPO is one day. But, if you have a database that tracks orders placed by customers on your high-volume website, you would certainly require the ability to restore that data right up to the point of failure, or implement other technologies besides backup (such as continuous availability configurations), which require a much greater level of investment.
Finally, you can have critical functions for which the length of downtime doesn't apply at all. Think about the IT systems that support, say, the live broadcast of a concert that hundreds of millions of people are going to watch. A technical problem can't disrupt the broadcast for more than a few minutes at most. Revenue would be lost from viewers tuning away if it goes down, and missing the ads that bring in revenue based on number of views, and also in the future in loss of reputation leading to lost future contracts. This means you have to make it so that it pretty much can't possibly go down.
Most of these are niche cases. But, it's important to understand these exist. The only way these can be missed in your business is if you do not perform proper Risk Management and planning for Disaster Recovery and Business Continuity.
If you've read about Disaster Recovery and Business Continuity elsewhere, no doubt you're familiar with the types of alternate sites you may maintain for your IT system, to enable resumption of business activities, or access to services you offer through the Internet, in the event of calamity at your current facility. There are no hard-and-fast definitions, but, in general:
- A cold site is a facility that you maintain, which has the capability of being converted to a data center and/or office for your employees within a few days or weeks. The electricity, phone lines, and Internet access should be available and activated, or able to be activated quickly, so you can move in your equipment, restore your data from backups, and resume business. For cost savings, you may split ownership or rent with other businesses.
- A warm site is one where you have some spare equipment in place, with communications services already running, and you have some or all of your virtual machines, data files, and databases replicated. With this additional preparation, your operations can be moved within a day or two.
- At a hot site, the critical components of your current IT system are essentially duplicated (servers and workstations), with capability for failover of applications and databases, to enable instant or nearly instant transition to the alternate site.
Of course, since any of these will incur significant cost to maintain, which one is appropriate, if any, will be determined through rigorous Risk Management and Disaster Recovery and Business Continuity planning.
Let's go back a bit to the differences and overlap between Disaster Recovery and Business Continuity. Sometimes you'll see hot/warm/cold sites in discussions about Disaster Recovery, and sometimes it will be described as part of Business Continuity. So, here's a little explanation of the distinction.
Recall that Disaster Recovery is about restoring the IT system to the way it was, while Business Continuity focuses on alternate arrangements (permanent or temporary) and continuing operations even in the face of unrecoverable losses, and has much broader requirements for participation and preparation on the part of management, employees, partners, and customers. If a business maintains a data center for Internet-based services and has an alternate site set up to take over in the event of a complete outage at the primary site, the only employees at either site are IT systems administrators, and the company's administrative offices are somewhere else completely, then management of the alternate site can fall under Disaster Recovery. But, if a business has an alternate site in place because management has determined the expense is justified in order to be able to promptly move all salespeople, developers, and administrative personnel from one office to a pre-staged, known location, and resume business activities within designated and known time frames following destruction of the building, then this falls under Business Continuity.
If the power goes out, and your facility has no backup generator, your IT and communications systems will most likely be instantly affected. Even with battery backups, unless you've invested significantly in power resilience, your entire system cannot continue running for more than a few hours. You have a disaster.
Failure of external services is particularly worrisome because its restoration is generally outside your control, and you have no way really to know when it will be restored.
When power goes out, you will most likely be facing both Business Continuity and Disaster Recovery plan activation. Assuming you don't have backup power, your entire IT system may be completely unavailable. This situation is out of the scope of Disaster Recovery. The ability of your business to resume activities promptly will depend on the virtue of your Business Continuity plan. Of course, whether you move to an alternate site, acquire a generator, or just wait it out, will have been determined during Risk Management and Business Continuity planning.
Disaster Recovery often kicks in after power has been restored. Of all the technology equipment you have in your server room and network closet, almost all of it has been powered on continuously for months or years. Unfortunately, every time equipment is rebooted, there's a small chance it won't come up normally, and this grows the longer it's been since the last reboot. Maybe someone plugged in a USB hard drive that will alter the boot device order, a software update was installed or network configuration change made but not tested, or some sort of hardware failure occurred at some point that doesn't cause the equipment to fail right away but will make it stop the boot process to display an error message. Startup errors are even more likely if battery backups were not in place or they did not provide sufficient runtime for your IT systems administrators to properly shut down server applications and dismount databases to avoid corruption. If server startup problems cannot be fixed easily, you may be looking at a disaster even if the power outage lasted only a few minutes. Or, if a database were corrupted due to the server shutting off in the middle of an index update, you are certainly looking at a bonafide disaster, and recovery operations starting with restoring the database from backup.
This all applies just the same to planned power outages, which many office buildings routinely employ these days for emergency systems testing, although, with the advance notice, problems related to ungraceful shutdown should certainly be avoidable.
This vignette will illustrate the handling of non-critical functions after a Business Continuity scenario.
Your business manufactures cranes, and provides maintenance service for your brand. Your headquarters office is in Los Angeles, and you have a manufacturing plant and warehouse outside Nashville with a product design team that works there as well. The product design team supports quality control during fabrication, but spends most of their time conceptualizing and testing innovative new designs for cranes to adjust to changing market demands, lower costs, improve safety, or add new features or capabilities to improve the value of your cranes.
A tornado obliterates the Nashville facility; execution of the Business Continuity plan enables salvaging most of the materials and establishing warehouse operations at a temporary facility in Georgia to complete production in progress for existing contracts. The product design team's functions were not included in the Business Continuity plan. The business then decides, instead of rebuilding a new plant, to divest of manufacturing, and use its well-known brand name to expand the crane maintenance and repair side of the business to service all brands. So, the product design team is never reconstituted. Or, if management did decide to rebuild the plant and restore manufacturing to full pre-tornado capability, this would be such a major project that the recreation of the product design team would be only a small portion of it. In either case, you can see why management would have been right not to include product design functions in the Business Continuity plan.
Some types of business-interrupting incidents can have virtually no effect on your IT system. For example, you could have a branch office in a foreign country with linguists who sit and translate documents all day, which is the main service your business provides. The supervisor gets them logged in to your servers over the Internet to get their documents and upload their work, and logs them out before they leave. You have them work in this one building to improve information security, so they can be closely supervised since you pay them by the hour, and so it's easier for them to review each other's work.
One day, the facility is destroyed in a fire. All your employees escaped with their laptops. Your IT department would really have nothing to do with recovery. How quickly your translation operations resume from this interruption would depend on how well you can communicate with the translators to tell them where to go and when, and get a new facility set up with Internet access. You may also have planned to reduce your security requirements, in case a new facility would be hard to find, to allow translators to connect using their Internet connections from home. All of this would be determined, again, during proper Risk Management and Business Continuity planning.
What about a pandemic disease event? What if your employees are quarantined and can't come to work? Your IT systems are immune to biological viruses, thankfully. But, if physical work can't be done, this will impact your operations just the same as technological work, and you'll want to plan for this in advance.
Let's say your company is an industrial supplier of metal brackets for construction. You have one main manufacturer that you buy about 90% of your products from, and they suddenly and surprisingly go out of business. This is certainly completely unrelated to IT. But, your Risk Management plan will have covered this, and your Business Continuity plan would have steps you can take before and after such an event, such as having insurance to cover such an interruption, alternate albeit inferior manufacturers to fill orders (with higher prices or shipping fees), or transitioning your business to sell other kinds of products, like maybe paper and office supplies.
Finally, one of the more grim aspects of Business Continuity planning is dealing with the potential death or long-term disability of executives and employees. This is conceptually different from everything else on this page, and intuitively it might seem like concern for humanity and healing would take precedence over business and profits, especially in the case of something truly hideous like a mass-murder. But, the best way to ensure compassionate handling is to plan in advance what you can plan for, so that losses from preventable business failures don't add to the misery. Especially in the case of a partnership or similar small business, where each individual's continuous industry and skill are indispensible, it would show disrespect to the lost to let everything he worked for go to waste, and not have had a plan to give his share of what you built together, whether the business continues or not, to his family.
Let's not describe a graphic example here. Suffice to say that key individuals and key groups of individuals should be identified during Risk Management, and the business impact of their loss measured, to identify those who are critical to the operation of the business, so you can make a plan. Of course, cooperation and understanding from partners and customers may certainly be considered when devising your response to the untimely demise of key personnel.