[Resolved] [CALA] [Major Issue] All services down
Started on December 7, 2020 at 7:21:00 AM GMT+0. Resolved after about 15 hours
- InvestigatingDecember 7, 2020 at 7:21:00 AM GMT+0 –
We have observed some production systems anomalies and Rainbow Operations team is currently investigating this incident.
- IdentifiedDecember 7, 2020 at 7:50:31 AM GMT+0 –
A possible root-cause has been identified and we're working on a fix for this incident.
- IdentifiedDecember 7, 2020 at 8:56:20 AM GMT+0 –
Our IaaS Cloud provider has confirmed a cooling incident on their local facility. All associated services in CALA are currently electrically shut down.
- IdentifiedDecember 7, 2020 at 9:39:20 AM GMT+0 –
IaaS Cloud provider managed to decrease a bit the overall temperature but not enough to reach nominal state. Datacenter is still non-operational.
- IdentifiedDecember 7, 2020 at 10:34:30 AM GMT+0 –
IaaS Cloud provider is continuing the investigation but temperature is still above threshold and the whole data-center is down.
- IdentifiedDecember 7, 2020 at 10:44:42 AM GMT+0 –
We are continuing to work on a fix for this incident.
- IdentifiedDecember 7, 2020 at 11:17:15 AM GMT+0 –
IaaS Cloud provider has confirmed a leak in cooling subsystem of CALA data-center. The local teams have fixed the leak and need to fill-in the cooling tank (announced ETA: at least 2-3 hours)
- IdentifiedDecember 7, 2020 at 1:30:34 PM GMT+0 –
IaaS Cloud provider says it should take another 1-2 hours to stabilize temperatures. CALA data-center availability expected in 2 hours. This is a current estimation and may change based on changes in situation
- IdentifiedDecember 7, 2020 at 2:07:54 PM GMT+0 –
While we were implementing a contingency plan to prevent outages and consolidate the stability of the Rainbow infrastructure, a critical problem in our Latin America infrastructure has arisen, because a third-party service provider suffered a major outage due to cooling system failure. This problem had a snowball effect that was impacting users around the world. Rainbow connections, bubbles and conferences were affected. Most services were restored at 13:00 CET for users in EMEA, and NAR & APAC regions are being updated. Our operation team is working with the local IaaS provider to restore services for CALA users. Updates have been provided here on status.openrainbow.com page. Our Operations and R&D teams are fully focused to fix this inconvenience as soon as possible. A detailed RCA will follow soon in our Help Center "here".
- IdentifiedDecember 7, 2020 at 5:35:59 PM GMT+0 –
Our IaaS Cloud provider has fixed the cooling issue and temperature has reached a nearly acceptable level of operation. The Cloud provider is progressively initiating hardware and network components restart. Additional update will be provided when our servers will be powered up.
- IdentifiedDecember 7, 2020 at 8:04:38 PM GMT+0 –
Our IaaS Cloud provider has progressively restored power and network in the facility. We managed to get access to a few servers. We're currently evaluating the troubles and trying to recover from the situation.
- MonitoringDecember 7, 2020 at 9:27:00 PM GMT+0 –
We've observed the various hardware configurations and everything seems up and running. Services are restored. The incident is however not yet closed on our IaaS Cloud provider so we keep on monitoring the situation.
- ResolvedDecember 7, 2020 at 10:00:00 PM GMT+0 –
Our monitoring and log systems confirm that service is now fully restored at nominal level.
The Root Cause Analysis is available on the "Help Center".