Root Cause Analysis - Server Unavailability on April 6th, 2021

Overview

The SurveyToGo infrastructure running in the global datacenter was unavailable between Tuesday, April 6th, 2021, 15:00PM GMT until Friday April 9th, 2021 14:15 PM GMT.

Impact

During the event, the service was entirely offline. Studio operations, API exports and the CAWI servers were unavailable. Due to the offline capabilities of SurveyToGo, field teams could keep working and collect data on mobile devices which was then uploaded after the service became available again.

Due to the event, the data and attachments that were collected between March 3rd and April 6th got decoupled from the underlying survey. Other items that were created or edited after March 3rd have lost the changes:

  • Surveys that were edited or created.
  • Subject stores that were added or edited.
  • Participant lists  that were added or edited.
  • CAWI Links that were created.

Most of the collected data was later connected back to the surveys.

Order of events and Root Cause

On Tuesday, April 6th, 2021,  at 14:56PM GMT, the main hard drive that contains our main Database stopped responding. The exact cause of this is still under investigation by the AWS support teams. There are 2 possible causes that are being investigated -
1. A physical disk error in one of the AWS volumes
2. An issue with the storage driver that caused the file to be truncated at 2 TB
A restart of the database server was done at this point. After the restart of the database server and the Windows instance, the issue was still present and a decision was made at this point to attach an additional large hard drive that would be used to extract the database backups, and create a snapshot of the backup and main database volumes.

Attaching the new large hard drive caused the server’s Windows OS to crash. Our investigation together with the AWS team helped us understand that this was a result of a bug in a Windows Storage driver (Rhelsvc) that is related to the attachment of large volumes, of over 2TB, to an existing Windows server in a specific OS version that our servers were configured on.

After the server was restarted the bug caused a corruption of the other existing large drives as well.As a consequence of this, both the main DB and Backup drives got corrupted and became inaccessible.

Over the next couple of days (April 7th, April 8th) we made an effort to restore the information from either the Backup or DB drives. These efforts were done together with several data recovery experts but eventually were unsuccessful.

On April 8th a decision was made to utilize the most updated backup that was not corrupted in the event which was made on March 3rd. In parallel, we began developing methods and tools for restoring and reconnecting the data back to the underlying surveys.

Surveys were restored from various caches and file stores.
CAPI results were sorted by their organizations and prepared for manual coupling with the Analyzer tool.
CAPI device logs were collected automatically from device synchronizations and support logs, and were used to match results to devices and back to their surveys.
CAWI results were restored from existing survey logs and system logs.

On Friday April 9th 14:15 PM GMT the servers were restarted and brought back online.

After the servers were brought back online, the team kept working around the clock and additional resources were invested in order to reconnect the various aspects of the survey data.

On April 25th the server was migrated to a new machine with an up-to-date version and a mirror database was set up for redundancy.

Immediate System Upgrades that already took place

  1. Database servers were upgraded to the newest version of windows server and SQL server with updated drivers and updates.
  2. An SQL Always-On configuration was installed with a mirror database constantly synched with the main database.
  3. We increased backup frequencies with an hourly scheduled backup plan.

Future Actions

The immediate actions that were already taken make SurveyToGo multiple times safer than it ever was. However, we are still taking additional precautions to make sure that if it does happen, we are in a much better position to deal with it faster and with minimal damage while also further increasing the level of transparency with our customers

  1. Within 30 days 
    • Adding system status page that will provide fully-transparent monitoring for the status of all services around the clock
  2. Within 60 days
    • Strengthen attachment’s coupling with the underlying survey and interview result
    • Increase levels of data in the logs to better connect the interview results with their underlying surveys
    • Increase database performance and utilization to further improve and add additional layers of safety
  3. Within 90 days
    • Additional backups that will be stored remotely on top of the local backups

 

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.