A data network is the foundation of any enterprise infrastructure. Thus, it’s crucial to build a disaster recovery plan that will bring mission-critical services back online within a specified period.
Disaster recovery plans have been around for decades. Yet, they must be thoughtfully designed to change and adapt to new architectures and business use cases. Modern evolutions include cloud computing, remote workforces and software-defined networking.
In this article, we’re going to look at six practical steps you can follow to help build a successful disaster recovery plan for your network. These steps can be used as a framework to build the type of network disaster recovery plan that will fit your specific recovery goals.
1. Designate mission-critical vs. nonmission-critical network segments
Building automated redundancy and resiliency into a network is an expensive, time-consuming and complex process. That’s why your first step is to designate segments of the network that are considered mission-critical. It’s in these parts of the network where you can justify the added cost and complexities to create automated failover. Common areas where true automated resiliency is often found are in data centers, WAN connectivity and network access to cloud resources.
Areas where full resiliency isn’t commonly implemented include the wireless LAN and the access layer where most end users connect. For these less critical network segments, it’s usually enough to have a recovery plan in place to manually bring the network back online using replacement hardware and software service-level agreements (SLAs) that meet specified turnaround times.
2. Build resiliency into mission-critical network segments
Once you’ve designated which network segments are considered mission-critical enough to justify the cost of full resiliency, the next step is to design and build it. The network segment you’re working on will determine which resiliency techniques will work best.
Most data centers include the use of dynamic routing protocols and virtual overlay network technologies to provide real-time failover across hardware and links. Alternatively, WAN resiliency is typically handled through software-defined WAN technologies.
Lastly, cloud resiliency can be implemented any number of ways — some of which rely on rapid-failover technologies designed and maintained by the public cloud service provider with which you’ve partnered. No matter the case, it’s important to look at all options before choosing the one that’s right for you.
3. Configuration backups
As networks continue their move from hardware-centric infrastructure to one that’s far more software-centric, proper network configuration management is becoming increasingly necessary. For legacy networks, maintaining a catalog of configurations often includes the use of an automated backup application that remotely logs in to each network component and copies and pastes the configurations into a text file. These configuration files can then be organized, stored and retrieved in the event a hardware or virtualized component loses it.
Alternatively, newer network architectures configure the entire network from a centralized location. This central location can either be on premises or in a public cloud. Either way, creating and storing backups from a central location becomes far easier, yet just as important.
4. Cold spares, rapid hardware replacement — or both
In terms of replacement hardware in the event of a disaster, two choices are available. The first is to purchase identical hardware components and put them on a shelf in the event of a hardware failure. These components are known as cold spares and are ideal when you require the least amount of network downtime. It’s also the most expensive option, as you must purchase duplicate equipment.
The other option is to select an appropriate hardware replacement SLA from the manufacturer based on your needs. In some cases, hardware comes with a limited lifetime warranty. That said, the hardware replacement for this option can take up to several weeks to arrive. In comparison, paid hardware replacement plans typically range from next business day all the way down to two- and four-hour replacements on a 24/7 basis.
Most organizations opt to purchase cold spare hardware for certain parts of the network and rely on hardware replacement contracts for others. Again, choosing the right option for your network boils down to how critical the component is and the loss of business functionality when an outage occurs.
5. Set recovery expectations, responsibilities and communication channels
Once the plans and process are put in place to handle a network disaster, the next step is to document recovery expectations, responsibilities and communications channels. Expectations typically revolve around service recovery times for each part of the network. Responsibilities dictate the role of each person, department or third-party entity, along with detailed duties they are responsible for performing.
Lastly, a communications channel document should be created to detail the optimal communication flows between network technicians who are fixing the problem and channels to best communicate progress to the rest of the organization.
6. Post-outage root cause analysis
No network disaster recovery plan is complete without a step to assess what happened, why it happened and how to better recover when the next network disaster hits. This is where the team’s performance is critiqued at a detailed level to ensure processes were properly followed. Additionally, this root cause analysis step should be used to better refine or change processes if the assessment determines the current process is lacking or suboptimal.
Build a flexible network disaster recovery plan
My final advice when you’re building a network disaster recovery plan is to assume major technological and architectural changes will occur sooner rather than later. Thus, it’s necessary to develop processes and procedures in such a way where they can be easily modified and communicated to the necessary stakeholders without having to start from scratch. Creating disaster recovery plans that are flexible will save you a great deal of time in the long run.