'Gremlins' Can Wreck Disaster Recovery Plans

While most businesses do a fairly good job of identifying the global threats to business operations, they often don't realize the smallest details can undermine seemingly robust business continuity or disaster recovery plans.

How can an organization make an accurate assessment of risk without knowing all of the obstacles that may stand in the way of a complete return to normal business operations? The best way to uncover the gremlins that may sabotage an organization's speedy and seamless post-incident recovery is to periodically conduct real world tests of each element of the plan and update it accordingly.

Take the case of two companies that had what appeared to be solid disaster recovery plans but failed to see the gremlins lurking.

o Company 1: A Failure To Test!

This company was a large professional services firm operating from a home office hub in a metropolitan high-rise that connected to its multiple satellite offices via a Web-based network. Because of this company's total dependence on its IT systems for all of its core business operations, it determined that protecting its IT network would be a mission-critical element of any disaster recovery plan.

Extensive measures were taken to install redundancy throughout its vast IT system. For several years, in fact, the organization had been making backups of its entire system each week. The backups were tested weekly for data integrity, ensuring that the backup was successful and the data intact.

The company's disaster recovery plan anticipated the need for an operational hot-spot–a secondary location equipped with necessary components, including working utilities, desks, phones, networked computers and broadband Web access–and there was a standby contract in place for such a facility. From a senior management perspective, the company's vital IT system seemed well-protected and disaster-ready.

When the largest hail storm on record rendered the high-rise home office building uninhabitable, the company promptly moved into its hot spot location and attempted to resume operations, only to find out there was a gremlin in its recovery plan.

When the organization selected its hot spot, there had been a detailed analysis of exactly how much capacity it would require at the alternative site. The contract specified the number of desks, computers and phones that were to be operational.

Unfortunately, during the 18-month period after the hot spot was contracted and prior to the devastating hail storm, no one from the IT department had ever taken the data backups to the hot spot and attempted to boot and install the company's data.

It was presumed, based on detailed specifications in the standby contract, that the company's backup data would be compatible with the IT system at the hot spot. That turned out to be the gremlin.

The company's IT system had a sophisticated firewall system that prevented access from unauthorized users and locations. All secure access to the network and system administration had been done exclusively through the home office hub, which was damaged or destroyed in the massive hail storm.

When the backup data was installed at the hot spot, the firewall system interpreted the entire hot spot as an unauthorized system and locked up, preventing any access to the company's files. Had someone tried to install the backup data at the hot spot before the catastrophe, this problem could have been identified and easily fixed while the network hub was operational.

o Company 2: Too Much Testing!

The second organization was a diversified financial services company with many of its larger operations in coastal states located in hurricane- or flood-prone areas. The company had almost doubled in size via organic growth and acquisitions over the previous 10-year period.

There was a full-time risk manager and risk management staff operating from the main corporate office. The company had done a good job of bringing all of its new operations into the existing disaster recovery plan, despite rapid expansion and growth.

Communication between each regional hub and the corporate office was maintained with land lines and backed up with independent direct satellite uplinks in the event of any primary interruption.

Based on its experience from previous flooding incidents and coastal storms, the company decided to install diesel-powered generators with power outputs averaging 500kw at all of its regional processing hubs to ensure basic power for operations.

These generators were tested monthly and run for approximately 30 minutes as part of the company's ongoing testing procedures. From a top-down perspective, this company apparently had a very robust and integrated plan for business recovery.

When Hurricane Katrina hit, however, the company realized there had been a gremlin undermining its plan. During the selection and installation of the backup power generators, the company determined that 500 gallons of diesel would supply fuel to run each power plant for approximately four eight-hour business days.

All of the 500-gallon storage tanks had been filled during the initial installation, and there was a stand-by contract in place to refuel each location in the event a storm interrupted main power at any location.

The gremlin turned out to be the monthly testing of the backup generators–they consumed about eight to 10 gallons of diesel fuel each month. During the two years of monthly testing pre-Katrina, the coastal locations had used up almost half of the available fuel in the storage tanks. Because no one thought about this type of fuel depletion, the refueling contract did not call for periodic refueling in the absence of a specific weather event.

When Katrina hit, the generators worked just as anticipated, but they ran out of fuel after a day and a half and left the regional hubs dark for almost two days, until the refueling contractor made its scheduled stop.

Had anyone taken the time to physically check the fuel gauge on the generators before the storm, it would have been obvious that the tanks had been substantially depleted. The plan did not call for anyone to look at the fuel gauges, however.

How can an organization defend against these sabotaging gremlins? When it comes to testing a recovery plan, presume nothing, test everything and reevaluate the plan based on the tests. Every element of the plan that is critical to recovery needs a trial implementation–as though there was a real emergency or disaster. The plan also needs to include the necessary funding for field testing and updating.

Lastly, have someone outside of the planning process review and evaluate the plan. Fresh eyes bring fresh perspective and may be what's needed to spot the gremlins that can hijack contingency plans.

Don H. Donaldson, RPA, CIC, CRM, CHS is president of LA Group in Montgomery, Texas. He may be reached at [email protected].

Want to continue reading?
Become a Free PropertyCasualty360 Digital Reader

Your access to unlimited PropertyCasualty360 content isn’t changing.
Once you are an ALM digital member, you’ll receive:

Breaking insurance news and analysis, on-site and via our newsletters and custom alerts
Weekly Insurance Speak podcast featuring exclusive interviews with industry leaders
Educational webcasts, white papers, and ebooks from industry thought leaders
Critical converage of the employee benefits and financial advisory markets on our other ALM sites, BenefitsPRO and ThinkAdvisor