Anatomy: A System Failure

…and why we fired The Planet. Early in the morning on February 7, one of the Crucial web servers suffered a hard drive failure. This was a legacy server from our older plans and did not have the RAID drive configuration that our new servers all have. There was definitely going to be some down time because the server had to be brought down, the drive physically replaced and the OS restored to the new drive before our administrators could even get their hands on it.

Crucial runs backups daily and stored these backups on a separate drive on the same local machine. Once a week a weekly full backup is moved off to a Network Area Storage (NAS) device. The same is true for one monthly backup. We were very prepared to restore our clients data and simply had to wait for our data center, The Planet, to replace the drive and restore the OS.

As a professional web hosting company, we prepare for just such disasters and we do everything possible to mitigate these unavoidable hardware failures, however, nothing we could have done could have prepared us for what The Planet put their client (Crucial) and our clients (you) through.

I would like to offer this time line of events and actual tickets submitted by both Crucial and The Planet. Before we start, we should point out that we have been a client of The Planet for the past 4 years. We have 17 dedicated servers with them as well. Our monthly bill is significant, but apparently not that significant.

Upon completing diagnostics, it was determined that the primary hard drive had had serious errors and needed to be replaced. It was approximately 10 am on February 7, when we tried to access the machine multiple times to access the current SQL databases, however the drive damage was too severe and this was determined by Crucial Administration to be too time consuming as we needed to get the machine back online for our clients.

At approximately 11:30 am, Crucial submitted a request for drive removal and OS restore. Below is a copy of this request:

This is a request to reload the Operating System on our server. The request details are as follows:

Hardware Name: xxxxx.crucialwebhost.com (Server)
Location Info: BW30 (Tile)
Operating System: Red Hat Enterprise Linux, Version 4

Planet:
Let me know if you need anything else.

Crucial:
This will not effect backup information on good drive—that is my understanding. All backup files are on the second drive and MUST be retained.

We received a relatively prompt response.

Planet (02/07/2007 12:24:46):
Replacing the failed drive now.

Shortly thereafter we received this update.

Planet (02/07/2007 12:43:36):
Processing this reload now.

At this point we’re feeling pretty good. They’ve started reloading the OS at 12:43 pm on February 7. Crucial should have this machine up in no time at all to restore the backups from the previous day.

Several hours pass and we request an update.

Crucial (02/07/2007 16:08:22):
Can I get an update on this reload process, several hours have passed.

Thanks.

Shortly there after we received the following update to the ticket.

Planet (02/07/2007 16:31:54):
Our automated system is currently processing your reload request.

Thank you for your patience.

Another hour later we update the trouble ticket once again with the following:

Crucial (02/07/2007 16:58:36):
Can you please tell me how long this AUTOMATED process generally takes?

I’ve installed Red Hat a few thousands times manually and haven’t had it take this long ever.

Would it be possible to have this done manually as that may be better than this automated system. We’re going on 4+ hours.

After another hour passed by with no response we update the ticket with the following post.

Crucial (02/07/2007 17:39:01):
Could someone please give me an eta on when this os reload will be complete.

Please?

Not long after that I decided it was time to get on the telephone. We were able to get in touch with a representative of The Planet relatively quickly with low hold time, this was a bonus at this point. After speaking to the technician, he assured me that he would personally get involved and escalate this for us. That was appreciated and and welcome news at this point as the server had been down all day.

The Planet representative then updated the ticket with this:

Planet (02/07/2007 18:04:52):
Per our phone conversation, I am looking into that status of this reload for you now. Thank you for your patience.

Of course, by this point patience is running very thin not only for us, but also for large majority of the Crucial clients.

We followed up on the ticket about 35 minutes later looking for some sort of ETA.

Crucial (02/07/2007 18:41:43):
Thank you, I appreciate that.

If you could provide any sort of ETA I’m sure my clients would appreciate it as would I.

Thanks.

To our amazement and dismay, we did not receive another reply until nearly 3 am on the February 8.

Planet (02/08/2007 02:51:52):
At this time we do not have an ETA on the system. We will update you with a more accurate status shortly.

We thank you for your patience.

By this time, we had given up on being able to restore the server that day and decided that it would be best to get sleep and be ready to start fresh in the morning.

We began pinging the IP address of this machine so that we could tell precisely when the server came back up, if it came back up at all. The server began responding to pings at approximately 8:00 am on February 8.

We’re back in business and we can start with restoring our clients’ data! Or so we thought…

Upon accessing the machine we quickly learned that the backups that were on the second backup drive were gone. Not just the backups, but the drive was also gone.

Now, keep in mind that the original ticket has not been updated again by The Planet at this time. We updated the ticket with the following:

Crucial (02/08/2007 08:14:19):
I see the machine is backup, however I am unable to find the backup drive or backups that were on this machine. Please let me know where I can find my backups.

An hour later, still no reply from The Planet, we updated the ticket with the following post:

Crucial (02/08/2007 09:09:36):
I am currently standing by waiting for an update on my backup drive that was mentioned in the opening post.

This will not effect backup information on good drive - that is my understanding. All backup files are on the second drive and MUST be retained.

We need to begin work on this machine to get it ready for production again. our phone is ringing off the hook because this machine has been offline for 2 days now.

Please don’t make me tell all these people that their backups are now lost.

FYI - this is the second time this has happened to me, failed drive, replaced drive - lost backup drive. I think your procedures need to be re-evaluated.

Ticket response time has been dismal as well. Just use this ticket as an example.

Why don’t you guys have a manager give me a call when it’s convenient for them/you. We’ve got 20 servers with you all and I need to get some attention, and real quick like.

Not at all a happy client.

More than an hour later, The Planet finally responded to the ticket with the following:

Planet (02/08/2007 10:14:06):
We are taking the server down to reconnect the drive at this time. I will escalate this ticket to management when that is complete.

Approximately 30 minutes later the drive was found (fortunately) and replaced back into the dedicated server. The Planet updated the ticket with the following:

Planet (02/08/2007 10:42:53):
The secondary drive has been reconnected to the system and is ready to be mounted. Please let us know if you need any assistance with this.

We are forwarding this ticket to a supervisor for review.

We left the ticket at that time to begin the restore process, nearly 24 hours from the time the reload was requested. While we were restoring backups, the following was posted to the trouble ticket by a manager:

Planet (02/08/2007 12:18:50):
We certainly apologize for the lack of update to this ticket and the amount of time that was taken to complete this request. We disconnected the secondary drive on your system due to the seriousness of the situation in that keeping your data was absolutely necessary, though sometimes this operation does add to completion times. Currently we guarantee OS Reloads within 24 - 48 hours and this operation was completed in closer to 18 hours. We apologize that this was not completed in a satisfactory manner for you, but we are working everyday to try and shorten these processes so as to not affect our customers as much. Do you need further assistance as to this issue?
Data center Technician Supervisor

This reply sealed the deal for us. We had no idea that an OS reload could take 48 hours to complete! For sure a lesson to read the fine print. I received no telephone call, only the reply.

The restore process took approximately 6 hours to complete. During the restore process, several of the larger websites experienced problems during the restore, which is normal, and they needed to be restored by hand. Again, we had backups from the day the server went down, so we had no data loss, just down time.

By the evening of February 8, we had recovered 95% of the websites and data. The server was fully hardened and most lingering DNS problems had been resolved. Needless to say, there was a lot of work ahead of us still because invariably there are always problems with some sites after a restore. So, after an additional day of trouble ticket support we were able to resolve 99% of all problems.

There are some valuable lessons to take away from this for anyone who happens to care to read it.

I’d like to outline some of the things we as a company learned from this experience.

  1. Do not rely on a third party to care about you or your clients.

    This is sad but true. This is also a lesson that Crucial is fully aware of. We maintain a great deal of redundancy in our server farm, however we had never considered that an OS reload could take us offline for 2 full days. The solution is simple—we now have duplicate machines that we can restore backups directly to in the event of a catastrophic failure. No longer will we rely on anyone to restore an OS for us. We have one ready now.

  2. Never take for granted a company’s reputation.

    The Planet was the best. What can I say—I was a cheerleader for The Planet for a long time. I built several businesses on the solid foundation that The Planet gave me over the years. However, time changes everything, and it has certainly changed The Planet. They’ve got that "huge company mentality" that we all loathe. They just don’t seem to care much anymore. This can be further observed by the poor usability of not only The Planet’s new website, but also within their systems control panel, Orbit, where features have been "coming soon" for years.

  3. Store daily backups on NAS.

    Do not store daily backup on a secondary drive on the same server, instead store backups on private internal network NAS device. This will allow you to have access to our clients most recent backups even if the machine is inaccessible. This was a real point of failure for us and we’ve taken steps to ensure this does not happen again.

  4. Public Relations becomes the number one priority when disaster happens.

    The clients must be kept up-to-date on exactly what is going on at all times. This was a failure of The Planet. They dropped the ball on us several times and we know how that feels to be left in the dark. You will always be able to find the current status of a disaster situation on our support page.

    We believe that keeping our clients informed of and educated about exactly what was transpiring at any given hour was the primary factor in the fact that we lost only a single client during this two and a half day outage. The reason the client gave: he was unable to get an API to work correctly. He was within his 101 day money back guarantee and he got his fully money back. His issue wasn’t even related to the outage! We’re quite thankful that only one client left us over this very trying week.

  5. In the hosting business, the data center is everything to you.

    It is something that you must constantly monitor and gauge the effectiveness of. You can not take for granted that your data center is good, no matter who they are. It is up to the web hosting company to ensure that their data center is not only good, but its the best for you and your clients sake.

  6. Test your data center!

    We’ll be performing periodic disaster recovery drills with each and every data center Crucial employs. We will fire those data centers that can not perform to Crucial standards. We’ll also tell you who we fired and why, so that perhaps something positive can be gained from an equal negative.

Crucial Web Hosting has fired The Planet. We have taken up home in a new, but much more robust and secure environment, known as SoftLayer, located in Dallas, Texas, in the popular Infomart Business Center.

We’re pleased to have SoftLayer providing our services and we look forward to testing them on a regular bases to ensure that they maintain the standards that Crucial requires.

There are also a few lessons here for the clients of web hosting companies as well.

  1. Make your own backups.

    If Crucial was unable to recover the daily backups from a "lost" hard drive, we would have had to go back three weeks to the monthly backup, thus loosing data for that three weeks.

  2. Visit the support page first.

    Any hosting company worth its salt will keep you informed of what is going on. The most likely place they will do this is on the support page unless everything is out, then I would expect it to be on the front page.

  3. Think about what you are hosting and where.

    This is a dig at many web hosting customers. If your business is absolutely reliant on your website and email, and you have the capability of loosing tens or hundreds of dollars each day your site is down, let me ask you this question. Why do you pay $5 a month for that? It costs more for a six pack of beer, yet you can justify paying less for something that earns you money.

    You will notice that during the Crucial outage, the Crucial website and support remained available the whole time. The reason is because that is our business website. We lose hundreds of dollars each day our site is offline. For us, this is unacceptable. So we don’t put Crucial on a shared host—it’s on a dedicated, high availability server. And you guessed it, it costs a lot more than $5 or $20 a month. More in the neighborhood of $500 a month. That is called a cost of business, like an office or stamps or anything else you use to make money.

    What is comes down to it, you do get what you pay for. You can’t realistically think that the service you get for $20/month is equivalent to the service you would get from a $50/month plan, or a $500/month plan. There are reasons why the service is cheap and there are reasons why the service is expensive. Many in the industry may argue this, but they are just trying to avoid the reality of the situation. Logic should tell you that you can’t get the same quality of service between these packages.

    So, if your website is important to you in one way or another, treat it like it’s important! You wouldn’t host a $3000/month business on a $20/month account would you?! I realize many people do and they have my best wishes for success, however to the rest of you who base your success on more than wishes, I suggest you examine your web hosting and determine for yourself if what you are paying for and getting is acceptable protection for your investment. It is your investment after all.

  4. Be nice!

    You have to understand that during these outages things are very hectic and much work is being done. Many times the same people who are doing the work are also answering your live chats and trouble support tickets. It’s very difficult to stay motivated when hit with needless questions or criticisms that only slow the recovery process. A good host is working to fix the problem, not wasting time trying to call every client to let them know there host is down. We assume you know it’s down because it’s down. Our job is to get it back up as quickly as possible.

  5. Vote with your feet.

    If you are unhappy with Crucial or any host because of the way things are handled at any level, the best way to show your displeasure is to cancel your account. The grass is not always greener on the other side, but you cannot find that out unless you look for yourself. We were very lucky not to lose clients due to the outage—but I would understand if they had, and I personally had expected many to do just that.

In closing, computers are made of moving parts. Things break from time to time and a good host does what they can to rectify the problem as quickly and painlessly as possible. Understand your hosts procedures and know where to look for information if there is a problem.

Backup, backup, backup. Don’t let anyone take any of your business responsibility from you. This means you are responsible for your own high availability needs, not your host. The host’s job is to provide a quality service and enough information for you to make a wise decision for your hosting needs based on the information provided.

I’ve personally learned a great deal from this entire experience and this is far from my first hard drive failure. In many respects, we were fully ready for this, in other respects we let our data center take over our responsibility of having a machine ready for just such a circumstance, among other things.

I hope that you too have gleamed a small bit of information and can move about he Internet a more informed consumer. Thanks for the taking the time, and thanks for choosing Crucial.

Bookmark:  Del.icio.us · Digg · Furl · Google · Reddit · Technorati · Yahoo!

Subscribe Now

Subscribe to our blog by RSS or by email.

Related Posts

  1. Does Your Host Do This?
  2. Nagios Howto: Using NRPE To Monitor Remote Services
  3. Nagios Howto: Notification Escalations, EventHandlers & Remote Service Monitoring With NRPE
  4. Split-Shared Technical Analysis
  5. 11-3-07 Brief Outages on vhost-13

Comments (6)

We’d love to hear others experiences, good or bad with The Planet.

Rick

Jennifer on 10 February 2007 said:

There is an excellent website for online backup information, news and articles. Check it out here:

http://www.BackupReview.info

This site lists more than 400 online backup companies and ranks the top 25 on a monthly basis.

Cheers,

Ray on 15 February 2007 said:

I feel your pain, brother, we will be happy to take on your 17 machines and existing business provided SL is not the right choice for you.

Its nice to see that (by reading all of your final paragraphs) that someone understand how things work. Its unfortunate you had to learn the hard way about keeping daily off sites, but at least things turned out OK. I had a similar experience with liquid web that dragged on for about 48 hours, with no sleep for me in between.

Morgan, Ray - Thanks for your comments!

I think we can honestly look back on this and say that some very valuable lessons were learned from it. Crucial has become twice as strong as it was prior to this disaster.

Thanks again for sharing!

Rick

Trackbacks & Pingbacks

  1. Crucial Web Hosting » Blog »

Leave A Reply

Helpful Hint

To post HTML or other code, wrap your text in the <code> tag.