Jump to content

Recent Downtime + Hardware/Drive Failure


Recommended Posts

  • Superuser

Hey everyone,

 

Starting from March 6th, we've been experiencing occasional OS crashes on our web server. I looked at the server logs and couldn't find any cause to this (e.g. no output from journalctl before the crash). I made a ticket with our hosting provider to see if they could take a look at the node our web server was hosted on and make sure the hardware was okay via disk/RAM tests. They came back stating the resource usage looked fine on the node and that the crashes were most likely related to our server specifically (e.g. running out of available RAM). I personally didn't believe this was the case since we always had a lot of available RAM and the old web server we were on had over a year of uptime while running the same services. Either way, I know there was a chance it could have been the services or the OS, so I enabled kdump and was planning to inspect crash dumps that were generated if we experienced another crash (we never got to this point since the next crash wasn't just a crash which is explained below). Unfortunately, the hosting provider also didn't perform tests on the disk/RAM which is what I tried telling them to do to be safe.

 

We experienced a total of 3 crashes since March 6th until last night around 11:30 PM EST when our server went down again, but this time wouldn't come back up. The KVM through our hosting provider's portal wasn't working so we couldn't see the server's console along with status being stuck on rebooting through the panel. This indicated the node was completely down. I made another ticket and tried calling the hosting provider, but I didn't receive any responses for a while. They did eventually make an event regarding this on their status page around an hour and a half later.

 

Around 10 AM EST this morning, they notified us that their node had experienced hardware failure and that they were restoring our services from a backup on another node. While I was suspecting the node's hardware was bad due to the crashes prior, I was hoping it would be something like bad RAM. However, given they needed to restore from backups, I believe it was due to a drive failure (they never confirmed this with me, though).

 

The entire server was restored to a backup from February 29th. This wasn't too bad, but I did have to re-make some changes I didn't backup before. Additionally, TMC's website itself was restored to a more recent backup I had from March 3rd. However, I plan on implementing daily backups onto this website soon.

 

Other services such as Best Mods had automatic daily backups of their SQL database. Therefore, they've been restored to a more recent date.

 

I am going to see if I can get more information from our hosting provider on this incident. I'll also continue monitoring the server to ensure it doesn't crash again.

 

I'm sorry for this inconvenience and thank you for understanding!

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...

Important Information

By using this site you agree to the Terms of Use and Privacy Policy. We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.