June 1st, 2018 outage – post mortem

Overview

Some of the SurveyToGo servers were down on Friday, June 1st 2018 from ~1:20am – 6:50am GMT+2. During the outage offline data collection was not affected however, access to the SurveyToGo Studio and APIs was down along with the ability to upload collected data. Once the service was restored all collected data was uploaded and access to the SurveyToGo Studio and APIs was restored. There was no data-loss involved.

 

Affected Services

The affected services were:

  1. SurveyToGo Studio Access
  2. Upload of collected Data (offline data collection not affected)
  3. Online Login from tablets
  4. Survey synching
  5. REST API access

Root Cause

The SurveyToGo platform is hosted on the Amazon AWS Cloud US-EAST region. Amazon is the leading cloud service provider today and has a very solid cloud offering.

  • At ~1:20am, the Dooblo Datacenter team identified increased error messages from the system and checked with the AWS team who reported it was investigating connectivity issues in the data center and quickly after that confirmed there has been a power event which has impacted some of the servers and network devices in the US-EAST region.
  • The SurveyToGo Datacenter was impacted from this event, specifically the server and disks that host the main SurveyToGo Database.
  • After 1 hour (~2:30am) AWS did manage to restore most servers however while our main Database server was working up and running, the actual set of disks that host the Database itself were still recovering. 
  • While SurveyToGo runs in a fully mirrored database environment and our secondary (mirror) was not impacted - after careful considerations of the issue along with consulting with AWS support it was decided to allow the AWS team to recover the affected disks instead of going forward with a force-failover to our mirror system.
  • ~4 hours later (~5 hours since the start of the issue) at ~6:30am GMT+2 the Amazon AWS team completed the promised recovery of the disks and SurveyToGo was up & running normally again.

For your convenience, here  is the AWS Health console updates regarding the issue:

 

aws-outage.png

 

Lessons learned

Every service disruption is a chance to learn and make ourselves better. In this case, we are confident that the platform architecture was planned correctly and our mirrored system was ready to take over in standby mode - and had the AWS team notified us that the recovery would take a few hours longer we would have switched to it. Regardless, for Dooblo, the lessons learned concentrate around the following:

  • Customer notification of service issues and customer updates.
  • Better monitoring facilities and more exposed one so that customers will have a go-to web page for status updates and availability queries.

Looking forward

We are constantly trying to improve our service offering and availability is a major aspect of the service. We will be using this outage to learn and improve our availability offering in order to provide you with the best service we can. Finally, we wish to extend a very warm thank you to all of our customers who were very understanding of the issues and patiently waited while we investigated and worked on bringing the service back up safely.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

1 comment
  • That's a good action plan.
    Thank you very much for all information.

Please sign in to leave a comment.