Internal Connectivity Issue

Incident Report for Ocasta

Postmortem

At 15:02 we were notified by an alarm that errors were occurring on some queries. This error had occurred several times earlier during the day but each time had cleared itself shortly after. This time we were also notified of problems retrieving data from with the Oplift and Capsule applications. Investigations indicated that some requests were succeeding, some taking longer than normal and some timing out after 60 seconds.

Initially it looked like an error with a particular batch of services so we restarted these but with no effect. Further investigations and some error messages indicated that there were network issues within AWS. After a period of working with AWS and analysing more service errors AWS was given a clean bill of health and we found some tcp connection errors. These turned out to be connections to MongoDB Atlas, one of our other data sources. This finally led us to a database instance that was permanently stuck at 100% disk utilisation. Attempts to reboot this instance initially failed but we eventually were able to do so. This resolved the issue. While this problem had never previously occurred within our MongoDB cluster we have now added additional monitoring and alarms to alert us of any similar issues to resolve promptly should it occur again.

Posted Feb 19, 2021 - 18:09 GMT

Resolved

This issue has been resolved. It was an internal communication issue where we were unable to reach one of our database instances.
Thank you for your patience and we’ll provide more details in a postmortem.

Posted Feb 19, 2021 - 17:38 GMT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 19, 2021 - 17:21 GMT

Update

We are continuing to work on a fix for this issue.

Posted Feb 19, 2021 - 17:15 GMT

Update

We are continuing to work on the issue. We have identified the problem and are now working on a solution.

Posted Feb 19, 2021 - 17:10 GMT

Update

We are continuing to work with our provider AWS in order to find a solution to the issue. We'll update as soon as we make further progress.

Posted Feb 19, 2021 - 16:37 GMT

Identified

We have discovered an issue with how the components in Oplift/ Capsule communicate internally. This appears to be caused at an infrastructure level with our provider AWS (Amazon Web Services).

We are currently working with AWS to restore connectivity. In the meantime, you may experience slow or failing requests for:

- Review reports
- Unread news
- Capsule cases
- Albert quizzes

Posted Feb 19, 2021 - 16:10 GMT

Update

We are continuing to investigate this issue.

Posted Feb 19, 2021 - 15:53 GMT

Investigating

We are currently undergoing internal connectivity issues that are effecting our services

Posted Feb 19, 2021 - 15:43 GMT

This incident affected: Ocasta Hosted Services (Ocasta Cluster, Capsule Cluster).