A Welcome Back to School Story
We hope you had a great August and a wonderful holiday weekend. Summer has gone by in a flash, but we are back on track with our newsletter series this month. And in keeping with the “back to school” spirit, we hope that the story in this newsletter imparts some good lessons.
As they say in Hollywood, the following is based on true events. Some facts and names have been changed to protect the traumatized. So here goes…
The Anatomy of a Hardware Disaster
There we were in a July morning in the office. The coffee pot had just finished brewing and the team was settling down to commence work on the various projects and tasks that needed attention. The peaceful atmosphere was suddenly interrupted by the ringing of our emergency phone line.
Jim answered the phone and the usually chipper Kathy from Fidelity Distribution was on the other line. It was immediately clear from her voice that something was not right.
“Our server crashed last night and it’s not booting up…”, she said.
Fidelity Distribution has been our customer for over 20 years. They started using Profits Plus on the Alpha-Micro platform which was converted to Unix in the late 90’s and had migrated to Windows in 2008. The main motivation for the Windows conversion was typical: a lower initial upgrade cost, additional functionality with new Profits Plus integrations, and in-house administrative control of the main server for day to day tasks like networking, adding printers, etc to be handled by the same company handling the PC workstations and other network components.
They had contracted with Dynamic Computers, their local computer support company to set up the hardware. Dan from Dynamic had configured a server with a RAID array of external hard drives to provide some redundancy and protection from drive failure. In addition he had set up the server with a back-up scheme that used a different hard drive. Things looked pretty good.
“What exactly happened, and when?”, asked Jim
“We had a power outage last night and now the computer won’t start up. It says, ‘Operating System not found'”, Kathy replied.
“That sounds like a hard-disk error. I see from your record that Dynamic Computers has set up the hardware. Have they been contacted?”
“Yes, I just paged Dan”
“Great, let’s get him to look into it. He may have to do a restore from back-up. The good thing is that you haven’t done any work this morning, so last night’s back-up should get you to a good starting point. Hopefully he’ll have it up and running soon. We’ll go in and take a look before you are ready to begin working on it.”, Jim concluded.
A few minutes later, Dan from Dynamic Computers called back with some grave news. The RAID array had suffered a multiple disk failure. This meant that the usual redundancy that it offered was no longer available. Furthermore, the back up procedure that was set up to execute nightly had been failing so it appeared that all the company’s data had been lost!
While Dynamic had set up a pretty good system to protect from hardware failure, there was no procedure in place to ensure its effectiveness by a human. Users of Profits Plus on UNIX will point out that there is a menu item in Software Maintenance that enables quick and easy backup verification. Our training specifically covers this point.
On Windows however, this is not something that we control and is often left to the person or company that sets up the server. This doesn’t mean that Windows is not a reliable platform. This was a hardware problem and that could happen with any operating system. However, as with any technology, it is imperative that there is sufficient human monitoring of all critical systems. Not only do you have to have the tools and automation in place, but they need to be tested and reviewed periodically to ensure effectiveness.
Our story concludes with a little bit of good news. The previous server was still plugged into the network and running so we were able to reload Profits Plus onto the newer server and get them up & running quickly, albeit with one year old data. Thanks to Fidelity’s solid information tracking & month-end procedures, along with several days of intense data entry effort, they were caught up and back in business. Eventually Dan was able to recover a back up that was about two months old which was used to selectively restore additional data . It was a lot of work for everyone involved but luckily the final outcome was much better than the initial prospect of losing all the data.
Lessons to Take Away
You can be sure that today, Fidelity Distributing is much more knowledgeable of their back-up system. Some other points to keep in mind:
1. It’s not enough to have a system in place. Monitor the system daily and ensure that you know how to respond in the face of emergency. You can’t get much simpler than a daily check of the backup that takes only a moment to do … but it’s of little use if nobody bothers to do it.
2. Fire drills are done for a reason. Testing emergency procedures regularly go a long way in ensuring that they work when they are required. Even better than knowing how to respond, is actually doing it to verify that you are right.
3. Work with a competent partner. While Dynamic Computers did put in a good back-up system in theory, they should have ensured that Fidelity understood how to verify backup and detect problems early.
We hope that this never happens to you (or us), but as this story illustrates, only proactive planning and ongoing inspection can transform a potential disaster into a minor irritation.
CRA Tradeshow Update
Jim will be at CRA Trade Show and Convention September 16 – 18, 2009 Albuquerque, New Mexico. Please contact us to schedule a meeting outside tradeshow hours if you wish or at least stop by the booth and say “Hi !”.
Catch you next month!
Barbara, Cheryl, Jim, Joe, Neil, Vivek, and Pavan
phone: (248) 583-4110