Notes from the Tech Team: Recent Server Upgrades and Service Interruptions

Status
Not open for further replies.

ratrosaw

Administrator
Oct 23, 2017
128
0
#1
Hi! #tech team here.

Many members have inquired about the two recent service interruptions we experienced after last Thursday's server upgrade. With ResetEra's first E3 coming in less than two weeks, we wanted to detail our upgrade plan a little bit more and explain the reasons behind the technical difficulties. Hopefully this will address the common concerns.

About the Server Upgrade

We are always looking for ways to improve ResetEra's performance and reliability, with a focus on dealing with traffic fluctuations during major events. So far we have completed two major architectural upgrades to ResetEra, migrating the site from one single server node to a multi-machine cluster. The architecture, very loosely speaking, works as follows:



As you can see from the diagram, ResetEra consists of two parts, an app server part, for serving all the requests from browsers, and a database part, for storing the data. Our February upgrade was designed to make the database backend automatically scalable, while the update last week grants the app server part the ability to automatically scale itself.

More specifically, the app server cluster is now capable of monitoring its own usage, and will allocate/deallocate resources automatically in response to fluctuations in traffic. The allocation/deallocation process happens behind the scenes and will not interrupt your browsing experience at all, which we believe is a perfect solution for our E3 challenge.

That all sounds great, so why did the service interruptions happen?

Scalable though it is, the database server cluster, due to some technical restrictions, cannot resize itself in the same way that the app server cluster does. It is true that we can add/remove resources to/from the database backend at any time; however, every time we add resources to the database backend, it takes some time to replicate the data. As a result, we still need to carefully plan how many resources we would like to allocate, and prepare accordingly beforehand.

Unfortunately (or fortunately, depending on how you look at it), after the migration last week, we underestimated ResetEra's growth in the past 7 months and the scale of pre-E3 traffic. During peak hours, the app server cluster scaled as expected, yet the database backend got overloaded. Failing to get a response from the database backend, the app server cluster became unresponsive as well. Our auto-healing system then kicked in and tried to solve the problem by rebooting itself, hence the on-and-off behavior many members experienced. We apologize for the oversight.

It took some time for us to adjust to the new architecture and we have learned a lot from the two incidents. More resources have been allocated to the database backend and we have updated our provisioning plan for E3. Although it is impossible to promise that ResetEra will never go down at any point during E3, we do want to assure you that we are prepared.

Sincerely,
#tech

P.S. We are aware that some members are experiencing random logout issues. A fix for this will be applied today. Before that fix comes online, you can avoid the problem by making sure that you are visiting www.resetera.com instead of resetera.com.
 

mrtl

Member
Oct 27, 2017
543
0
#4
Being prepared is half the battle. But who will win? ERA vs. The Megatons in just a week.
 

His Majesty

Banned
Member
Oct 25, 2017
7,527
0
Belgium
#5
I'm sorry but this is unacceptable and it's severely impacting my ability to shitpost, I hope all of this will be resolved by the time E3 and the World Cup hit otherwise you may expect a strongly worded letter in your inbox.
 
Oct 25, 2017
4,208
0
#9
does the database part get polled every time a page is displayed, or is some of that cached on the app server? Otherwise it would seem that any autoscaling on the app side will always be bottlenecked by your database side?
 
Oct 25, 2017
5,080
0
26
#10
Great to hear you guys are so actively working on making the site better. I hope the scaling on the database end doesn't end up costing too much, or at least that it is covered by your current earnings through ads and Clear.
 

plagiarize

Mighty Jagrafess
Moderator
Oct 25, 2017
4,800
0
Cape Cod, MA
#13
does the database part get polled every time a page is displayed, or is some of that cached on the app server? Otherwise it would seem that any autoscaling on the app side will always be bottlenecked by your database side?
The thing is, that load on both isn't going to have a linear relationship. The reason things like searches get limited in how often you can perform them, is that they put a lot more load on the database than say, opening a thread, but about the same load on the app server.
 
Oct 25, 2017
1,310
0
#15
Yeah, that all makes sense to me.

I Imagine it's pretty simple to drop a docker image on a box and run a script to spin it up and add it to the app cluster on demand. But you simply can't do that with the DB, the backups are huge and replication takes forever (in internet timescale).

I guess you all will just need to go through a scaling/descaling cycle once a year to get the DB resources properly provisioned for E3.
 
Oct 25, 2017
1,780
0
Toronto
www.killerrin.com
#16
Love reading these technical analysis. ResetEra is about to charge into its own uncharted territory for the E3 Monolith. Keep up the great work and we'll see you on the other side!

Would love to read a post nortem after the fact with the challenges of dealing with E3
 
Oct 25, 2017
5,720
0
Singapore
#17
Good write up. Didn't even expect a detailed response to the disruption issue, so that's already a plus. Hopefully the database server is ready for E3! :)

:thunbsup: :thumbsup: :thumbsup:
#GiveUsEmojis
 

TSM

Member
Oct 27, 2017
1,127
0
#25
P.S. We are aware that some members are experiencing random logout issues. A fix for this will be applied today. Before that fix comes online, you can avoid the problem by making sure that you are visiting www.resetera.com instead of resetera.com.

Haha, I just added www. to the url for this thread and when I hit enter it brought me to a logged out version of this web page.
 
Oct 25, 2017
2,034
0
#26
I've had this problem recently where I log in to ERA, but the second I click to a different page on the forum I am logged out and have to log back in again. It only happens after the initial log in. After the second time it works fine. Anyone else having this issue?
 

TSM

Member
Oct 27, 2017
1,127
0
#27
I've had this problem recently where I log in to ERA, but the second I click to a different page on the forum I am logged out and have to log back in again. It only happens after the initial log in. After the second time it works fine. Anyone else having this issue?
It's been happening to people for days now. the portion I quoted above says it should be fixed sometime today.
 

L Thammy

Spacenoid
Member
Oct 25, 2017
11,269
0
#29
I started reading this but I got bored partway through. Can you give the same explanation but using Dragonball Z analogies?
 

deltaplus

Administrator
Oct 24, 2017
338
0
#32
I've had this problem recently where I log in to ERA, but the second I click to a different page on the forum I am logged out and have to log back in again. It only happens after the initial log in. After the second time it works fine. Anyone else having this issue?
We've pushed out a fix for this issue moments ago. You may encounter an additional login request, but the settings should persist following a successful login.
 
Status
Not open for further replies.