Root Cause Analysis - Server Unavailability on March 6th, 2021

Overview

The SurveyToGo server running in the global datacenter was unavailable during several hours from Saturday, March 6th, 2021, 6:46AM GMT until 18:25 PM GMT the same day.

Impact

During the first few hours of the event data retrieval and exports were possible as well as data collection on surveyor devices.  The impact during those first few hours was mainly on online surveys that were not available as well as data uploads from the surveyor devices that were not successful. In addition, editing of data through the Studio (scripts, collected data, permissions etc.) was not available as well.

As a result of the efforts to mediate the issue we reached a state where we had to take the servers completely down for a process that took several hours and during that period the system was completely unavailable. With that said, due to the offline capabilities of SurveyToGo, field teams could keep working and collect data on mobile devices which was then uploaded after the server became available again.

Order of events and Root Cause

As part of our IT enhancement processes to support the constantly increasing usage, we have identified several improvements that we would want to make to our existing infrastructure. Among those is the significant enhancement to our database server that was planned for several weeks from today.

We knew, when planning the enhancement activity, that our databases’ free space was decreasing but analyzed the recent trends and estimated that the existing setting could carry the load until that enhancement is done. Based on past experience, we also assumed that we could extend that space on demand in a case the usage trend will exceed our expectation in the coming weeks.

On the day of the event, we realized that such an extension of space would be needed. ,   The usage trends  increased at a faster pace and we started seeing That the database backup process is not able to complete successfully (which also led to the database log to significantly expand).

During the process we realized that the disk space expansion would not be possible in that case as we have reached the available limits for the specific server and that the issue affects the secondary database instance as well.

There was no alternative at that point in time but to transfer the existing database to a larger server.

That process took several hours to complete where at the end the servers were restored to a normal operation.

Lessons learned and actions we are taking

  1. We have enhanced the monitoring schemes to alert earlier on similar issues and allow a larger margin of error.
  2. We will be setting up, as part of our IT Infrastructure enhancements, an always-on availability group that will allow more robust failover process.
  3. Status Page – As part of our investment in transparency, we will be adding a status page that will be presenting a real-time status of our services. The page will be shared with our customers and available at all times.

Looking forward

We are constantly trying to improve our service offering, and availability is a major aspect of the service. We will be using this incident to learn and improve in order to provide you with the best service we can.

 

Visit us at:  http://www.dooblo.net




Was this article helpful?
1 out of 2 found this helpful
Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.