Jump to content



Photo
- - - - -

The Great TGTAP Downtime 2013


  • This topic is locked This topic is locked
17 replies to this topic

#1 Chris

Chris

    Limbal Rings

  • Administrators
  • 10,967 posts

Posted 30 March 2013 - 01:30 PM

We're back!

First off, we're sorry about the lengthy downtime we suffered during the past week. For the most part it was completely beyond our control. We did unfortunately suffer some data loss, some of which was simply due to corruption. Now if you continue reading I'll explain what happened, or just skip to the end if you don't care and just want to know wtf is up with the site right now.

Back in the February we suffered some downtime, over 24 hours in fact. Scanning the system logs we were able to pinpoint this to an issue with one of the hard drives, but it was difficult to determine which one (we have 4 in a RAID 10 array) when the tools we have available were reporting them all as healthy! Fast-forward to last week, one of the hard drives decided to completely fail on us. Those of you who know about RAID will know that a single hard drive in the array failing is not a problem, so we simply asked the server techs to replace it for us. No big deal. As I was submitting the support ticket, I was running tests on the other 3 hard drives to check they were ok - turns out they weren't. A drive in the second pair was also failing, and as the test finished running, it did fail. This brought the server into a read-only state, we rebooted to allow the techs to replace the first bad disk. This was done, but we had to wait practically a whole day for it to finish filesystem checks and rebuild itself into the array.

The bad news got worse after that and long story short, the OS had become corrupted and we were unable to get the server to boot up. Combine this problem with incredibly slow support staff and you see where this is heading...
In the days that followed, they eventually replaced the second failed disk for us, and within another couple of days they finally got it into rescue mode and were able to get the server back online (today).

Unfortunately this was when I discovered quite a lot of file integrity loss, corrupted files everywhere. After working all night to get the server back into a workable state, we realised that unfortunately our database system was completely fucked, for lack of a better phrase. Worse still, our on-site nightly backups were mostly lost. The most recent off-site backups we had were over 2 weeks old, but these had to do.

What I've done is restored a database backup from 13th March. Anything that happened since then has been lost. As for files and uploads, we believe most of this is ok, but chances are there's some missing files we aren't aware of. Please let me know (in this topic) if you are experiencing errors or other weirdness on the forums or anywhere else on the website.

While no one is to blame (except myself for not having more recent backups available) for what happened, we feel the support staff made us wait unreasonably long times to both replace the failed hardware and recover the system for us. Downtime was almost inevitable with these kinds of failures, but it certainly should not have been this long. For this reason, we will be transferring TGTAP (and our other sites) to a new server within the next couple of weeks. We don't expect there to be any downtime while this happens, though the forums will be turned offline for approx 15 mins to ensure we have successfully migrated the data.

That's all.
  • Huckleberry Pie and gbityzyli like this

#2 TUN3R

TUN3R

    Executioner

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,569 posts

Posted 30 March 2013 - 02:51 PM

Short version: Chris ruined everything.


Edited by TUN3R, 30 March 2013 - 02:51 PM.

  • Jezz Torrent likes this

#3 Chris

Chris

    Limbal Rings

  • Administrators
  • 10,967 posts

Posted 30 March 2013 - 03:04 PM

Yeah I forgot to add a TL;DR but that pretty much sums it up, might as well blame me :)



#4 BlackListedB

BlackListedB

    Executioner

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,641 posts

Posted 30 March 2013 - 05:32 PM

Hard Drive failures are the most critical to any computer anywhere, so yeah, I sympathize, backing up whenever and wherever possible is the way to go about it, Cloud storage in this modern age is certainly worthy of consideration but for business it will wind up costing for more then the free basic amounts I'm sure, if even possible.

 

I've dealt with RAM memory failures as well and modern Lithium Ion batteries most recently, they just stop working all of a sudden, unlike ZINC batteries of old.

 

Ram failures in particular are just too odd to nail down quickly, they manifest in different ways in terms of erratic operations.


  • Huckleberry Pie likes this

#5 TUN3R

TUN3R

    Executioner

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,569 posts

Posted 30 March 2013 - 05:56 PM

Don't think I've ever had a hard drive break, five graphics cards and three CPU's and god knows how man memory cards but never a hard drive. So yeah...

 

I think there are people who specialize in data recovery in the UK. Data can be recovered from both memory cards and hard drives, unless you raged and stepped on 'em. There aren't any here cause East Europe.


Edited by TUN3R, 30 March 2013 - 05:57 PM.


#6 BlackListedB

BlackListedB

    Executioner

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,641 posts

Posted 30 March 2013 - 06:05 PM

For a forum, a HDD recovery service for recent posts is really going to extremes. Our own forum was sabotaged and I restored with an older archive at that moment then I would have liked to, so I regret that, but it nearly mirrors TGTAP in this case, we reset from a prior backup point...kinda like Windows Restore in Win Millennium! YEAH!


  • Huckleberry Pie likes this

#7 Huckleberry Pie

Huckleberry Pie

    "There's always something,"

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 3,000 posts
  • Philippines 

Posted 31 March 2013 - 01:38 AM

Hard drive failures are a bitch indeed. My Strawberryland Forum sites are also running fine now, although the other subdomains are still fucked and pending repair.


Posted Image

Ubisoft should take a look at this...
"We love Him because He first loved us." - 1 John 4:9-10


#8 BlackListedB

BlackListedB

    Executioner

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,641 posts

Posted 13 April 2013 - 07:21 AM

^ Use the CLOUD, it's the next big thing, very trendy, I'm sure you can leverage the FREE factor by using more then one cloud based service to store critical data in smaller sizes



#9 TUN3R

TUN3R

    Executioner

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,569 posts

Posted 13 April 2013 - 12:52 PM

^ Use the CLOUD, it's the next big thing, very trendy, I'm sure you can leverage the FREE factor by using more then one cloud based service to store critical data in smaller sizes

 

No.



#10 BlackListedB

BlackListedB

    Executioner

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,641 posts

Posted 14 April 2013 - 02:29 PM

You're the self-proclaimed expert? Clouds are used by businesses, as most online services are from large companies, and they have to communicate via networks as well. Some are closed to the public but Networks only differentiate themselves by their eco-systems


  • Huckleberry Pie likes this

#11 Issac

Issac

    Leave Out all the rest

  • Members
  • PipPip
  • 240 posts
  •    Wales 

Posted 09 May 2013 - 05:05 PM

I suggest you switch to a VPS and install Cloudflare.

#12 Chris

Chris

    Limbal Rings

  • Administrators
  • 10,967 posts

Posted 09 May 2013 - 05:31 PM

You think a site a site as big as this can fit on a VPS? You're severely underestimating our size, a dedicated server is a minimum requirement, and we've been on several high-end ones since 2007.

 

Also, Cloudflare is good but it wouldn't have helped in this case. We were not under attack so their DDoS mitigation wouldn't have been of any use. The downtime was simply due to two consecutive hard drive failures. As for their caching methods, Cloudflare only shows cached pages in the event of an unreachable server, and for a limited time. Since the site was down for a whole week this would only have been effective in the first day.

 

Anyway, we actually silently migrated to a brand new server a couple of weeks ago and everything has been running smoothly since then. I've also taken extra measures to ensure we have nightly backups stored off-site in two separate locations. In the event of any future downtime due to hard drive failure, the most data we'd be at risk of losing is an absolute maximum of 24 hours. Unless this coincided with a big news/content update, this won't even be much of an issue. Point is, while we can't guarantee against hardware failure, we can now be extremely optimistic about recovery should anything untoward ever happen again, and we certainly don't expect downtime as long as that ever again.



#13 Issac

Issac

    Leave Out all the rest

  • Members
  • PipPip
  • 240 posts
  •    Wales 

Posted 09 May 2013 - 05:40 PM

What host are you using? And not really, but I have seen 600K Post Forums running on VPS. And, Hostgator offers good dedicated hosting. I and friend have been using them for years. If you've good budget, then you can do a switch, when you want.

#14 Chris

Chris

    Limbal Rings

  • Administrators
  • 10,967 posts

Posted 09 May 2013 - 07:01 PM

We're at RapidSwitch now. And yeah I'm familiar with Hostgator. I got into the hosting game back in 2005 which I ran alongside TGTAP, but I didn't have any budget outside of advertising revenue brought in by this site, which was only enough to cover the cost of a cheap server back then. Struggled with that and couldn't dedicate enough time to it to take it seriously so it was never something I pursued, but I still gained useful knowledge through doing it. Over the years I've managed a total of 11 servers (8 dedicated, 3 VPS) at a total of 8 different hosts around the world. I don't claim to be an expert on server management, I'm far from it, but I know everything that I need to know to get it optimised, secured, kept stable, and fix any problems that might crop up.

 

And yeah a 600K post forum shouldn't have too much of a problem running off a decent VPS. But this forum is one of the least intensive parts of this site as it's not particularly active right now, it's our downloads database that sees a massive amount of traffic compared to the rest of the site. When you're transferring that much data so quickly you need decent hardware and bandwidth. Also, bear in mind I also host GrandTheftWiki on the server, as well as a couple of other projects I have unrelated to GTA. So the server gets decent usage even if the whole thing isn't being used by TGTAP - it's not wasted if that's what you're thinking, and we've plenty of room to grow and expand :)



#15 BlackListedB

BlackListedB

    Executioner

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,641 posts

Posted 10 May 2013 - 07:13 AM

We've been talking about backup files of our own site, since the Webmaster himself can't seem to log in, and has to create a new account to come in and try and remedy things. That's a PITA as well! ARGH!!!

 

I've been too lazy to setup any home RAID systems, but the idea should apply for online as well, where a HDD is mirrored with the same data in case one dies for any reason, the other should mechanically operate and retain as much data in recent history, but I'm no expert either, I opted to learn a bit more once I had a major trojan attack and lost all access to data on my HDD. NEVER AGAIN!!


Edited by BlackListedB, 10 May 2013 - 07:19 AM.


#16 Chris

Chris

    Limbal Rings

  • Administrators
  • 10,967 posts

Posted 10 May 2013 - 02:14 PM

We use a RAID 10 setup in our server, so it's both striped and mirrored. We can afford one of the hard drives to completely fail and still be ok, possibly two if the other failure is in the opposite hard drive of the other array.



#17 smallpancake

smallpancake

    Hitman

  • Elite Members
  • 602 posts
  • United States 

Posted 13 May 2013 - 05:33 AM

I use RAID on the bumblebees, I know it's only for wasps, though.


  • Gerard likes this

I stopped eating pancakes in 2009.  :sick:


#18 BlackListedB

BlackListedB

    Executioner

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,641 posts

Posted 14 July 2013 - 10:36 PM

Mike, our GTAC systems manager, aka Webmaster, informs me an infrastructure fortification is also under way for GTAChronicles website  Hope you don't mind a mention here, since it's possibly the same type of thing, we're expecting at least more lurking traffic as GTA V is set to come out finally!






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users