Salesforce users on the NA14 instance were recently affected by downtime relating to a power failure issue in the Washington D.C. data center.
A circuit breaker responsible for controlling power into a data center in Washington D.C failed. Unfortunately, the redundant power systems had not engaged, which led to a power failure at the compute system level.
To restore service as quickly as possible, the NA14 instance was switched from its primary data center in Washington D.C. to its secondary data center in Chicago. Service was restored; although ten hours later a degradation in service was detected.
The degradation escalated to service disruption as a result of a database cluster failure, and customers on NA14 were unable to access the Salesforce service.
How are Salesforce users affected?
Salesforce users on the NA14 instance were unable to use the platform for 12 hours while Salesforce worked through the database cluster failure. Working with their database vendor, it was determined that file discrepancies were found in the database. Long story short, restoring the files resulted in more errors and failures. There were three possible solutions to get NA14 up and running as seen on the right.
As peak activity time approached, and to avoid a second day of downtime, the team decided to use a local backup that didn’t have file discrepancies present. This resulted in a three-hour window where data could not be restored to NA14 – Not ideal for any SaaS customer!
What are Salesforce going to do about it?
The root cause of the initial failure remains unknown. The breaker in question had passed load testing in March 2016 as part of a regular data center certification process. The power circuits have since been replaced, and a full audit of power and failover systems is being carried out in all data centers.
The increase in volume on the instance exposed a firmware bug on the storage array, increasing the time for the database to write an array, and causing timeout conditions when writing to the storage tier. Salesforce is working to ensure if anything like this were to happen again, that disruption would be resolved more quickly by deploying a new technology for data replication to standby copies to streamline the performance of site switchovers.
An investigation is being carried out with the database vendor to determine the root cause for the file discrepancies. Details of this investigation and full details of the outage are available on the Salesforce website.
What does this mean for cloud technology?
Unfortunately, any IT system – whether on-premise or in the cloud, will be susceptible to service outages. Organizations need to be prepared for that. Having service level agreements (SLA’s) in place is critical to ensure your organization is reimbursed for any inconvenience caused.
The loss of customer data is what everyone in the SaaS world never wants to hear. The data which was lost could have been restored – but the NA14 instance would have been down for the time it took to restore. Salesforce made the difficult decision to restore an old version and resume services instead.
Fortunately, Salesforce is taking steps to ensure it can avoid making a difficult decision like this again. With Microsoft and Amazon surviving outages last year, it’s clear cloud technology is not going away. On-premise outages are not as widely reported as they only affect the organization which holds the data.
Unfortunately for Salesforce, Microsoft, and Amazon, the world of social media amplifies the message of any outages. In the modern world of business, transparency of information and providing regular service updates to customers is critical, particularly during an outage, so customers know the implications of a service outage for them, how to plan for it, and when normal service should resume.
When the announcement was made, we assessed all of our ClaimVantage customers and determined no one was affected.