Database outage - 16 Aug 2022 - Resolved
Incident Report for EdSmart Platform Status
Postmortem

We have published a postmortem response to yesterday’s incident on our blog, available here: https://blog.edsmart.com/critical-incident-response

Posted Aug 17, 2022 - 19:06 AEST

Resolved
This incident has been resolved.
Posted Aug 16, 2022 - 16:35 AEST
Update
(Editing to change the Incident Status to Resolved.)

Service is now restored on the EdSmart Platform. We are awaiting a Root Cause Analysis (RCA) from the Microsoft Azure Support Team to explain how their infrastructure caused this significant outage whilst preventing us from activating failover processes.

We will continue to monitor and work to implement measures to eliminate or reduce the impact of this problem should it happen again.

Please continue to subscribe to https://status.edsmart.com for updates.

If you are still experiencing issues, or would like a detailed RCA, please contact help@edsmart.com.

We apologise for the inconvenience this may have caused and appreciate your patience and understanding today.
Posted Aug 16, 2022 - 16:34 AEST
Update
Service is now restored on the EdSmart Platform. We are awaiting a Root Cause Analysis (RCA) from the Microsoft Azure Support Team to explain how their infrastructure caused this significant outage whilst preventing us from activating failover processes.

We will continue to monitor and work to implement measures to eliminate or reduce the impact of this problem should it happen again.

Please continue to subscribe to https://status.edsmart.com for updates.

If you are still experiencing issues, or would like a detailed RCA, please contact help@edsmart.com.

We apologise for the inconvenience this may have caused and appreciate your patience and understanding today.
Posted Aug 16, 2022 - 16:24 AEST
Update
Updating to reflect that API and Application services are now running.
Posted Aug 16, 2022 - 14:30 AEST
Monitoring
Our database is now running and we are restoring services on the EdSmart Platform and running diagnostic tests. You may experience some performance issues while systems start back up.
Posted Aug 16, 2022 - 14:28 AEST
Update
Updating outage information to show that the API is unavailable in addition to the application.
Posted Aug 16, 2022 - 14:15 AEST
Update
Microsoft Azure Support Engineers are working to cancel the stalled auto scaling operation. We are awaiting further updates from the Azure Support team and will continue to provide updates via https://status.edsmart.com.
Posted Aug 16, 2022 - 14:10 AEST
Identified
The issue lies with the scale up/down functions in the Azure database platform. This is functionality we have been using without incident for some 8 years. The most recent automated scale up operation stalled and our engineering team were not able to manually resolve.

We have found some references online, indicating that this is a very rare but known issue with Azure SQL.

The nature of the issue also appears to have sidestepped our monitoring, for example, database connection monitoring, which did not trigger. It seems Azure 'thinks' the database is available when actually it isn't.

The engineering team is dealing directly with senior account and engineering management at Microsoft Azure. In the meantime, they are also working to make the database available via other means, as the accepted Azure restore methods for failover are also unavailable due to the current issue.
Posted Aug 16, 2022 - 12:55 AEST
Update
Microsoft Azure Support are still investigating the issue, it’s ranked at the highest priority in their support system and we are in constant contact with their account and technical management teams.
Posted Aug 16, 2022 - 11:44 AEST
Update
Microsoft Azure Support have escalated management for this critical issue and are working with EdSmart to identify and resolve the problem urgently. Please continue to subscribe to https://status.edsmart.com/ for updates. Thank you for your understanding.
Posted Aug 16, 2022 - 10:06 AEST
Investigating
Our Support and Engineering teams are aware of an issue with our Microsoft Azure SQL Database as of 6:30 am local time. We are investigating the issue with highest priority and have contacted Microsoft Support.

Please continue to subscribe to status.edsmart.com for updates. We apologise for the inconvenience this may have caused and we endeavour to restore service as soon as possible. We appreciate your patience and understanding during this time.
Posted Aug 16, 2022 - 08:51 AEST
This incident affected: API and Application.