• Welcome to ResetEra 2.0! Guests should now be able to save their dark or light theme preferences, found on the left sidebar.

Sunday Maintenance Post-Mortem

Oct 24, 2017
338
0
#1
*Whew*

That was an exhausting 14 hours but we've finally gotten the site back up, upgraded to XenForo 2.0, with our brand new theme.
I wanted to provide a summary and post-mortem of what happened over the past 14 hours for those who are interested in what happened.

I originally estimated the maintenance period to take at least 6 hours, consisting of the XF 1.5 -> 2 upgrade, database maintenance / changes, and site configuration. Barring a few issues that set us back about 1-2 hours, this was relatively straightforward, consisting mostly of time to upgrade our sizable database.

At 7:45 PM PST, we decided to reopen the site to the public and immediately got flooded with traffic to the site, on par with what we see during E3 press conferences. At this point, our auto-scaling kicked in to add more resources to our server cluster to handle the influx of traffic. Normally, the worst that usually happens is you'll see intermittent availability for a minute or two as more servers come online.

With the upgrade to XF 2.0 however, we quickly noticed that, under production traffic, XF 2.0 requires significantly more compute resources than 1.5. Our database query distribution algorithms also needed to be tuned to distribute queries evenly amongst our DB cluster. We had performed similar configuration on the beta site, but after handling significantly larger traffic on the production site, we realized we needed to aggressively redistribute certain queries. Additionally, we discovered certain bugs within our version of XF 2.0 that were causing very long query times.

It took a few more hours, a lot of debugging, testing fixes, and debugging again before we were able to get back to normalcy. I wanted to thank everyone for their patience - it was a long day for us all.

Over the next few days, you may see some slowness or general site wonkiness. We're going to be working over the next few days to collect statistics that we will use to further tune our query distribution algorithms to more evenly spread load. You may see some UI bugs here and there - please report them in the Welcome Thread!

Thanks everyone, enjoy the new site!

Love,
delta
 
Oct 27, 2017
345
0
#5
Thank you, thank you, thank you so much for the Latest Threads view. That was the view I wanted for this site since the start. Now I don't have to keep two bookmarks for Video Games and EtcetEra with Latest View.

Great work. ^_^
 
Oct 28, 2017
1,496
0
#9
what's a "database query distribution algorithm"? what kind of DB (relational, nosql, punch cards) backs the forum? also why did the significantly higher compute resources only manifest under production traffic and not in test? or maybe it just wasn't noticed in testing? also interested to know what kind of levers you guys pulled, i.e., what did "we realized we needed to aggressively redistribute certain queries" mean in practice?