At 15:02 we were notified by an alarm that errors were occurring on some queries. This error had occurred several times earlier during the day but each time had cleared itself shortly after. This time we were also notified of problems retrieving data from with the Oplift and Capsule applications. Investigations indicated that some requests were succeeding, some taking longer than normal and some timing out after 60 seconds.
Initially it looked like an error with a particular batch of services so we restarted these but with no effect. Further investigations and some error messages indicated that there were network issues within AWS. After a period of working with AWS and analysing more service errors AWS was given a clean bill of health and we found some tcp connection errors. These turned out to be connections to MongoDB Atlas, one of our other data sources. This finally led us to a database instance that was permanently stuck at 100% disk utilisation. Attempts to reboot this instance initially failed but we eventually were able to do so. This resolved the issue. While this problem had never previously occurred within our MongoDB cluster we have now added additional monitoring and alarms to alert us of any similar issues to resolve promptly should it occur again.