2022-08-17 08:53:51

EdSmart's critical incident response

By Sonja Bijou

We provide transparency around an outage incident, and list our set of activities and outcomes.

On Tuesday 16th August 2022, the EdSmart platform experienced a wide-ranging outage between 6:25am and 2:11pm Australian Eastern Standard Time (AEST) due to a technical issue with our primary production database hosted in Microsoft Azure.

In this article, we wish to provide transparency around the cause of the incident, and to list the set of activities and outcomes that we're in the process of implementing to prevent a reoccurrence of the edge-case scenario.

The situation

EdSmart’s primary database infrastructure is provided by Azure SQL – a fully managed, cloud-based Platform-as-a-Service (PaaS) implementation of Microsoft’s flagship relational database, SQL Server. We currently have two independent production instances: one servicing our South Australian customers, and the other, which is our ‘default’ instance, servicing the remainder of our customers.

EdSmart’s monthly Azure hosting costs are dominated by our production Azure SQL instances. To minimise unnecessary expenditure, we scale the database hosting tier up and down on weekdays to reflect the time of day for our various customers around the world.

Given the majority of our customers using the ‘default’ instance are located on the east coast of Australia, we scale our primary default production database to its maximum level at 6:25am Australian Eastern Standard time on weekdays to meet the high demand of the majority of our schools beginning their daily activities. At 10:25pm Australian Eastern Standard Time on weekdays, we scale the default database down to a lower level, but one which is more than enough to continue to meet customers’ needs, based on analysis of usage statistics and to a level that has a significant positive impact on our monthly Azure hosting bill.

Microsoft provides a ‘four 9s’ guarantee of availability for their Azure SQL infrastructure. This means they guarantee the service will be available at least 99.99% of the time. For those geeks (like me!) who have memorised the number of seconds in a year (31,536,000), that’s 3,153.6 seconds, or about 52 and a half minutes of acceptable downtime per year, typically spread across the whole year with only a few seconds of outage being experienced at any one given time.

The outage we experienced on 16th August 2022 was entirely attributable to Microsoft, and obviously fell well outside of the bounds of their service level agreement.

EdSmart encounters Azure SQL service outages of up to a minute on rare occasions but an outage of this magnitude was unprecedented for us. It was caused by a rare but known fault when scaling between database hosting tiers, in which the scaling process gets ‘stuck’ and the database is not able to read or write data. This issue is rare enough that no one in the EdSmart engineering team had encountered it before, despite the several combined decades of experience we have in working with Azure SQL. However, our response to the situation uncovered others across the world that had experienced a similar issue when scaling.

Our monitoring and alerting system did not pick up the problem when it first occurred yesterday and, even now, in the Azure resource health page for the database, no outage is listed. As far as the built-in monitoring capability was concerned, the database was ‘up’ and performing its normal activity (i.e. switching the scale of the infrastructure from one tier to another). It was only when schools started reporting a complete lack of response from the platform that we diagnosed the issue and raised a critical support ticket with Microsoft. Unfortunately, this meant that a couple of hours of support time were lost.

At 9:46am, a little after an hour after we reported the issue and with no diagnostic progress reported by Microsoft, we contacted our account representative who was able to escalate the issue internally within Microsoft given our status as a high-value, long-term customer. Immediately after escalating, we began contingency plans to attempt to restore service using our own methods in parallel with the Microsoft support team’s efforts.

Our response

Our first strategy was to attempt to cancel the scaling activity using Azure CLI – a scripting technology used for lower-level infrastructural configuration and management. The cancellation command was accepted; however, it remained in-progress alongside the actual scaling process itself.

Next, we began to look at restoring backups. In addition to manual, long-term retention backups, our database is configured to provide continuous short-term backups for any point in time, at the granularity of one minute, for the previous four days. This is a feature of the ‘hyperscale’ tiers we use for production data (more information on Azure SQL hyperscale technology and our zero-minute recovery-point-objective backups can be found here).

While Microsoft continued to resolve the ‘stuck’ database scaling issue, and while our Azure CLI cancellation of the scaling process also seemed to be stuck, our subsequent parallel strategy was to restore the database to another instance using the latest point-in-time backup – which would initially be created at our regular low, nightly tier – and then attempt to scale that instance up to the maximum tier, trusting that the problem would be extremely unlikely to reoccur. We discovered that the nature of the scaling process, however, meant that we were locked out of restoring our point-in-time backups until the scaling was complete, and so we couldn’t continue with this approach.

Note that this block on database restoration is not the case when a database is down for any other reason, but the internal mechanics of scaling have somehow led to Microsoft disabling restores while the process is running.

We then attempted to provision a completely new database, which as part of the provisioning procedure, does have access to our other hyperscale point-in-time backups and allows creation of a new database by seeding it with the data from a backup. In this case, we were able to initiate database provisioning with the backup as a seed, and we weren’t blocked at any stage via the Azure portal interface.

But the new database was taking a huge amount of time to create. At first, we assumed this was due to the size of the database (approaching a terabyte) but, as the two-hour mark was drawing near, we suspected that the backups undertaken while scaling were somehow affected by the scaling process, and we weren’t able to select an older pre-scaling point-in-time restore using this method. Although we have tested restoring Azure hyperscale point-in-time backups many times, it seems to be problematic when this particular scaling issue occurs.

Apparently when Azure SQL gets stuck while scaling, it really wants to stay stuck! As we began to look at alternate strategies – while communicating with customers via our incident reporting system (thankfully not based on SQL) – we were notified by Microsoft Support that they had fixed the issue. In total, the database was down for seven hours and 46 minutes, and we are now in the process of seeking reimbursement for breach of the Azure SQL service level agreement.

Microsoft are working on a root cause analysis of the problem from their side, and have said it may take a day or two to fully diagnose and prepare.

The post-mortem

Following the incident, we are implementing the following strategies to improve our response if, and when, this situation reoccurs:

  1. Monitoring and alerting: We are extending the built-in Azure health diagnostics to include our own sample data reads and writes at regular intervals, essentially making sure that not only is Microsoft saying that the database is up, but we are also able to use it to do what it’s there for – namely reading and writing data. This monitoring will feed into our critical alert system when there is an issue, which means we will receive out-of-hours SMS notifications.

  2. Backups and restores: We will take more regular long-term backups, and not solely rely on Microsoft’s point-in-time hyperscale backups for addressing service outages. We can then have access to these backups independently of the running instance of the backed-up database, which turns out to be necessary when this particular scaling issue occurs.

  3. Configuration consolidation: We identified, if restoring a backup to a completely new database instance, there are a number of places where reconfiguration of services needs to take place so they are pointing to the correct instance. We will be implementing a strategy to consolidate the configuration points to a single place. This gives us a single point of truth specifying which database to use for all of our services.

  4. Hot standby cutover: We are investigating the possibility of short-term provisioning of a secondary hot stand-by replica that we cut over to each time we scale the primary instance. 

Once we have received and reviewed Microsoft’s root cause analysis report, we will likely adopt other preventative and mitigative strategies based on their analysis, and we will be seeking their input on recommended response activities. 

I would like to express our wholehearted apology for any interruption this unusual outage may have caused to your school's activities, and we look forward to implementing further measures to ensure it is not a repeatable event.