Microsoft techies have pinpointed the SQL query that smashed the software giant's Visual Studio Team Services offline for several hours.
It appears the query triggered a crippling SQL Server 2014 bug that Redmond is now trying to fix.
Last week, Microsoft's cloud-based Team Services suffered a five-hour outage, leaving programmers unable to access or manage their accounts. Team Services provides tools for sharing code, tracking changes, testing builds, and so on ... when it's working, of course.
In the immediate aftermath, we were told the downtime's culprit was "an SQL stored procedure that ... was allocating too much memory," and thus overwhelmed the service's database machines. That troublesome SQL query was working fine on Microsoft's SQL Server 2008, but jumped the tracks when engineers reconfigured their systems to run the procedure on SQL Server 2014.
The problem lies in the rewritten cardinality estimator in SQL Server 2014. Cardinality estimation is supposed to improve a database's performance by predicting the number of rows a query is going to return, and allocating resources early. This means the server can grab the memory it needs before getting stuck into executing the SQL statement, thus avoiding the need to allocate chunks in memory on-the-fly (which can introduce deadlocks and other slowdowns) or swapping to disk (which destroys performance).
The crucial SQL query – used by Team Services to handle customer accounts, identities, authorizations – is supposed to pull up one row, and it looks like this:
SELECT TOP 1 @collidingAccountName = ids.AccountName, @preExistingDomain = i.Domain, @preExistingTfidGuid = i.Id FROM #identities ids INNER LOOP JOIN tbl_Identity i WITH (INDEX=IX_tbl_Identity_AccountName_TypeId, FORCESEEK, UPDLOCK, HOLDLOCK) ON i.PartitionId = @partitionId AND i.AccountName = ids.AccountName AND i.Domain = ids.Domain AND i.Sid <> ids.Sid WHERE i.PartitionId = @partitionId AND ids.TypeId in (5, 6) AND i.TypeId in (5, 6) OPTION (OPTIMIZE FOR (@partitionId UNKNOWN))
This caused SQL Server 2014 to predict that the query would return many, many more rows than the single one needed. In fact, the software thought it would require 3.5GB of RAM, rather than a very small amount, to cope with the request, which is bad news on a machine with 48GB of RAM. This query caused servers to run out of memory quickly, and leave customers unable to access their accounts.
As a quick fix, Redmond engineers changed the last line of the above procedure to:
OPTION (OPTIMIZE FOR (@partitionId UNKNOWN), MAX_GRANT_PERCENT = 1)
This forced the cardinality estimator to, at most, reserve just one per cent of the memory it thinks it's going to need, which is more than enough.
"In the SQL Server 2014 query optimizer, [Microsoft engineers] made significant changes to the cardinality estimation. I’m sure they were improvements but not for this query," blogged Brian Harry, Microsoft's vice president for cloud developer services.
"The ultimate resolution, for now, is that we added a hint to the query that tells the query optimizer the maximum memory grant to use for the query. Longer term, littering our code with memory grant limits is a very unsustainable practice. We use query plan hints a lot but memory grant caps are too fragile over time.
"So we are working with the SQL team to see about changes in the cardinality estimator that will do better for this query and others like it. There are also changes we can make to the way this query is written that will reduce the likelihood of mis-estimations. It turns out that in most cases, this query only operates on one row, but from a very big table, the cardinality estimator guessed a lot more than one row."
A more detailed breakdown of the query at fault can be found here. In the meantime, if you're running SQL Server 2014 with big tables, and your code features a query similar to the above snippet, be mindful of a sudden devouring of available memory.
We asked Microsoft to confirm that the SQL Server 2014 it runs in its Azure cloud, which powers Team Services, is the same SQL Server 2014 it ships to customers. We're still waiting to hear back from Redmond. ®