Backing Up and Recovering An Entire System

This will describe the process in which a backup of the operating system is restored to a box. This is only the Operating System and not the data. There are three scenarios that can take place when thinking about backing up a system.

1. Backing up system A and restoring to system A

2. Backing up system A and restoring to system B (where system B in identical in hardware)

3. Backing up system A and restoring to system C (where system C is different hardware)

Scenario 1 and 2 can be done with relative ease. The backup process can be automated and require little technical expertise. While the restore process will require a person to know a little about the system, the instructions can be well documented with step by step instructions. I will quickly describe the process of completing these tasks and go over what definitely does not work.

What does NOT work

Copying files from one system to another will not work whether the copying method is cpio, tar, or backup. This includes copying files from drive A to backup media (this could be a tape drive or another disk drive), remove drive A and replace it with a New drive, format the New drive, slice the New drive, copy the boot program to the New drive, and finally restore the files from the backup media to the New drive. This will not work. You can not restore an AT&T operating system this way. I have been on the phone with many support people at AT&T, and they all confirm that this can not be done. I emphasize this because companies like IBM and HP has a very simple backup and recover procedure where AT&T does not. I have spoken with Raj, Vaughan, and Dave Tenneson at AT&T about these procedures. These procedures are documented in chapter 8 of the AT&T Admin. guide but are not supported when creating a boot device.

I am hoping that Coke's direct AT&T contact, Jean Spencer, will be able to enlighten us on any procedures (most likely undocumented) that involve copying files to restore an operating system. I will be meeting with her on Thursday to discuss these issues.

What DOES work

You can do a complete device dump of the boot drive and restore that image on another drive. A device dump is performed by the unix command dd and does a bit by bit copy of any device. You can make an exact mirror image of the boot drive and copy that image to another drive. The only problem with this scenario is when your original drive is bigger than your new destination drive. Obviously your boot image will be larger than the new drive and there will be data loss. This process seems simple enough and is actually a pretty reliable method for backups under scenario's 1 & 2.

Why perform backups?

We have backups for one main reason and that is to prevent data loss in the event of a system failure. The definition of a system failure is the key to a successful Backup & Recover program. The definition of a system failure basically could be one of the following:

1. Software failure - files are removed, data corruption

2. Disk Drive failure - data or operating system drive goes belly up.

3. Hardware Component failure - the mother board dies, controller failure, memory bad, etc.

4. Facility failure - power is lost to building, fire, place is blown up

Backup tapes solve the first two system failure definitions, Software & Disk Drive. Disk mirroring is performed on some systems and in either failure case, you can quickly restore data by either the mirrored disk or backup media. The main idea is that your system remains up and no data has to be exchanged to another system for recovery. In fact, the backup and recovery procedure using a device dump works extremely well under these to cases. The problems with the above procedures only start when you try to recover a system that was caused by the last two failures, Hardware Component and Facility.

You are attempting to backup a system, operating system and all applications, and restore that data onto a completely different box. This technique would not be the recommended way of performing a restore for the following reasons. The most obvious reason is different hardware. Different pieces of hardware require different software device drivers and configuration files. This would not be acceptable and would cause many problems such as kernel panics & memory dumps (I know this because I have tried). Wanting to restore to a completely different piece of hardware with relative ease is out of the question unless the system hardware itself is an exact duplicate of the original system hardware. In the real world under a Hardware Component Failure, you probably would not try to restore to another system anyway. I say this because by the time you get your tapes out of the vault, find the backup system, restore the Operating System, configure new disk arrays, restore data, and fine tune the network parameters, I bet you'll have the new part in house and installed by then. And your original system will be up and running within 24 hours or at the most 48 hours.

Basing these facts on real life experience, I propose that the only time an entire system will need to be restored is in the event of a complete Facility failure. And when this occurs, there needs to be complete documentation on how to reload the Operating System and all the Applications. Why? Because we do not know that far ahead in the future what kind of hardware will be needed. In fact, you can't even guarantee that an exact duplicate piece of hardware will even exist. How often is a Facility failure going to occur? I am not talking a simple power outage. I am describing a complete destruction of the facility where it would take months to rebuild. Are you guaranteeing that you'll be using the same system designated as the recover box in five years. What if you go to a third party vendor and said "We don't know what we changed in the kernel parameters, systems tuneables, or configuration files ... but we the O/S on tape." I don't think this would be a very efficient way of obtaining your goal - restoring your system. They are going to have completely different hardware and will require documented procedures of what was changed to fine tune your system.

If you believe that I have missed any concept put forth to me in relation to the Backup & Recovery procedures, please talk to me. I am very open and sympathetic to everyone's needs. We should compromise and find a solution that is good for everyone. Again, I am meeting with AT&T on Thursday. We will discuss some of the options available and what other companies (supported by AT&T) are doing. I will coordinate the facts and figures to James and Rick.

Any questions?

Mark