Go to Top

Disaster recovery in a cold war nuclear bunker

This situation occurred quite a few years ago, which is why we can talk about it now, but the lessons learnt are still relevant today.

The Situation

Kroll Ontrack received a call from the NATO Communications and Information Agency situated just outside The Hague in The Netherlands.  They had a Windows 2000 server failure and could not bring the main data volume back online.  The data volume had the only copy of some important logistics documents that were needed the following week for a major NATO exercise.  The server hardware was a stand alone HP tower with an external drive enclosure. Four SCSI hard drives had been configured as a RAID 5 configuration using a Mylex single-channel RAID controller. I think the total data volume size was no more than 98 GB. The server had been rebooted as part of a maintenance upgrade and two of the drives had not come back up online.  They could have been forced back online, but no one could be sure if both drives had dropped out at the same time.  There was a slight possibility the RAID had been running in degraded mode for some time. They also had the option of just forcing one of the drives online, but if this was the one that had failed first the old data would cause corruption, and if FDisk ran this would potentially cause more damage  (FDisk was set to run automatically on server start-up) Another unusual aspect of this situation was there was no backup.  Due to the transportability of a backup tape it was decided that this would present too high a security risk, so no backups were made.  It was also decided that the redundancy in the RAID 5 would give enough data protection. Due to the security level of the data removing the drive enclosure from the site was prohibited and therefore could not be sent to one of our data recovery labs.  Any data recovery attempts would have to be done onsite in the secure facility, under guard.  This turned out to be underground in a nuclear bomb protected bunker, which presented some unexpected complications. To prevent the security of our own recovery software being compromised, a security code is required. This is normally obtained from head office with a quick two-minute phone call. But as this was a secure facility all mobile phones were confiscated before entry and all landline calls needed to be vetted and authorised before they can be made.  Each authorised call required a 12 digit code to be entered followed by the authorised phone number.  The next complication was at all times we were either locked into a secure area behind a huge blast door or had to be escorted at all times, even to the loo.

The Recovery

The raw hexadecimal data was examined on each drive to work out the drive order and data parity type before creating a virtual RAID and validating the data integrity.  It turned out that one of the drive had dropped out at an earlier date and so there would have been widespread corruption if it had been forced online.  The RAID was recreated using the other three drives and the recovered files copied to another data volume.

Lessons Learnt

The IT managers had made a conscious decision not to make a backup as the assessment of the security risk far exceeded the risk of data loss.  As long as the level of risk was understood and accepted then this would be a valid decision. It is also possible they had included a data recovery option in their calculations. However, when a data loss actually occurred it turned out that the risk was unacceptable. The time to recreate the data was much longer than was available to meet their schedules.  I think it would have been possible for them to recreate the data in the available time frame but at significant additional cost and when compared our recovery cost was much lower. I also feel somebody’s reputation was on the line. From our experiences at Kroll Ontrack two drive failures are very rare but do occur more frequently after a reboot.  RAID controllers complete a Power On Self Test (POST) and part of the test is to check that the disk drives are reading correctly.  If the data through-put rate is below specification they may fail the POST and it is then the decision of the operator to decide if they should be forced online.

, , , , , , , ,

Leave a Reply