Rainbow - [WW] Rainbow / Hub Softphony — Intermittent Connectivity Issues – Incident details

Experiencing partially degraded performance

[WW] Rainbow / Hub Softphony — Intermittent Connectivity Issues

Resolved
Operational
Started 5 days agoLasted 2 days

Affected

Global (WW)

Major outage from 6:28 AM to 8:36 AM, Operational from 8:36 AM to 8:37 AM, Major outage from 8:37 AM to 12:54 PM, Partial outage from 12:54 PM to 1:27 PM, Degraded performance from 1:27 PM to 2:55 PM, Operational from 2:55 PM to 11:00 AM

[WW] Rainbow Hybrid PBX Telephony

Major outage from 6:28 AM to 8:36 AM, Operational from 8:36 AM to 8:37 AM, Major outage from 8:37 AM to 12:54 PM, Partial outage from 12:54 PM to 1:27 PM, Degraded performance from 1:27 PM to 2:55 PM, Operational from 2:55 PM to 11:00 AM

[WW] Rainbow Administration & Subscriptions

Major outage from 6:28 AM to 8:36 AM, Operational from 8:36 AM to 8:37 AM, Major outage from 8:37 AM to 12:54 PM, Partial outage from 12:54 PM to 1:27 PM, Degraded performance from 1:27 PM to 2:55 PM, Operational from 2:55 PM to 11:00 AM

Europe & Middle East (EMEA)

Major outage from 6:28 AM to 8:36 AM, Operational from 8:36 AM to 8:37 AM, Major outage from 8:37 AM to 12:54 PM, Partial outage from 12:54 PM to 1:27 PM, Degraded performance from 1:27 PM to 2:55 PM, Operational from 2:55 PM to 11:00 AM

[EMEA] Rainbow Core Services

Major outage from 6:28 AM to 8:36 AM, Operational from 8:36 AM to 8:37 AM, Major outage from 8:37 AM to 12:54 PM, Partial outage from 12:54 PM to 1:27 PM, Degraded performance from 1:27 PM to 2:55 PM, Operational from 2:55 PM to 11:00 AM

[EMEA] Rainbow Hub Voice Services

Major outage from 6:28 AM to 8:36 AM, Operational from 8:36 AM to 8:37 AM, Major outage from 8:37 AM to 12:54 PM, Partial outage from 12:54 PM to 1:27 PM, Degraded performance from 1:27 PM to 2:55 PM, Operational from 2:55 PM to 11:00 AM

Updates
  • Resolved
    Resolved
    This incident has been resolved.
  • Postmortem
    Postmortem

    Post‑mortem Update – Apr 02, 13:00 AM CEST
    Following the application of OVH’s fix yesterday at approximately 22:00 CEST, all data centers are now fully operational. All services have returned to normal and continue to be closely monitored.

    ===
    Apr 01, 2026 10:07 AM
    Incident Report: Root Cause Analysis – Rainbow Data Center 

    1. Executive Summary  

    Our Rainbow Data Center provider experienced significant service degradation impacting our Rainbow services in our different regions due to a network routing anomaly at our primary hosting provider.  

    Incident was caused by the instability of IP Routing between the different Data Center virtual router resulted in an active/passive switch flapping scenario that ultimately isolated database nodes and interrupted client connections. All services have now been restored by routing traffic to an alternate data center, and we are currently in strict monitoring mode. 

    2. Timeline of Events 

    • 31/03, 08:00: Flapping issue escalated on OVH Data Center, leading to a split-brain condition on our database infrastructure. Service disruption became widespread. 

    • 31/03, 08:00 – 10:00: The team assessed all available options to recover service without resorting to a full platform restart, which was considered a last resort given the risk of a broader outage. Despite these efforts, the underlying network instability could not be resolved through targeted intervention alone. 

    • 31/03, 11:00 - 11:30: The engineering team attempted a first restoration. This was unsuccessful as the network continued to flap between DCs. 

    • 31/03, 12:00: A second restart was initiated, attempting to distribute the load across different data centers. 

    • 31/03, 13:00: Our teams executed a major restoration by isolating faulty datacenter network elements. 

    • 31/03, 13:30-16:30: Rainbow services were recovered and users were getting progressively access to their services in the different regions 

    • Current Status: All IP addresses from Roubaix were successfully switched to the Gravelines data center, and services have been restored. We are now in a dedicated monitoring phase and working with OVH Team to ensure DC Roubaix stability. 

    3. Root Cause Description  

    The service disruption was triggered by a malfunction in the network at our data center, which is operated by our infrastructure provider, OVH. We observed repeated VRRP flapping, preventing incoming traffic from reliably reaching the active destination server. 

    This persistent routing instability ultimately disrupted database synchronization, resulting in a split-brain scenario in which database nodes operated independently without maintaining consensus. 

    As the VRRP protocol is managed externally by the hosting provider, direct intervention at the switch level was not possible on our end, requiring the implementation of an architectural bypass to restore service.  

    4. Corrective Actions and Next Steps  

    We are accelerating the expansion Rainbow Service with new IaaS provider to avoid full dependency on OVH Cloud network solution – A detailed timeline will be shared before end of next week  

     

    Invitation: Partner Webinar on Rainbow Solution Stability 

    To ensure full transparency and ongoing collaboration, we will invite our partner community to a dedicated technical webinar next week. Invitation to follow before end of this week.

  • Resolved
    Resolved

    The Rainbow service remains operational. OVH is still experiencing network issues in the Roubaix datacenter, which temporarily reduces our high‑availability capacity.
    Despite this, the platform is functioning normally, and we are maintaining reinforced monitoring to ensure service continuity.

    Our teams are closely following the situation with our provider and will take any additional measures needed.
    Thank you for your understanding while we worked through this incident.

  • Update
    Update

    Monitoring activities are still ongoing, and current indicators remain positive. Our IaaS provider has also made progress on resolving the underlying network issues. We will continue to observe the platform closely and provide further updates if necessary.

  • Update
    Update

    Services have been restored, and we continue to monitor the platform closely to ensure everything remains stable.
    If you experience any issue with Rainbow, a simple log out and log back in should help.

  • Monitoring
    Monitoring

    Services have been restored, and the vast majority of data is now available to users.
    Our teams are monitoring the platform closely to ensure full stability.

  • Update
    Update

    We continue to observe some instability on our infrastructure, but our teams have identified the contributing network issues and are actively working with our provider to address them.

    As part of our stabilisation efforts, several servers are being redirected to an alternate site to ensure a more reliable environment.
    These actions are already helping to improve the situation, and we remain fully mobilized to restore full service as quickly as possible. Further updates will follow as progress continues.

  • Update
    Update

    Telephony services are starting to recover, and PBX systems are reconnecting. Some users may still experience slow response times, but overall availability continues to improve. Our teams remain fully engaged and are closely monitoring the situation to support the ongoing recovery.

  • Update
    Update

    Most of our services are coming back online, and availability continues to improve, even though some users may still experience limited access. Our teams remain fully committed and are making steady progress toward full recovery.

  • Update
    Update

    Services are starting to be partially restored, although not all users may be able to access them yet.
    Our teams continue to work on completing the recovery.

  • Update
    Update

    To stabilise the service, a portion of user traffic has been temporarily reduced while our teams work on redistributing the load across our systems. This may prevent access to the platform for some users.
    We are actively working to restore normal access as quickly as possible.

  • Update
    Update

    Our teams continue to work actively on restoring the service. While the situation remains complex and no definitive recovery path has been confirmed yet, all necessary actions are being taken to move forward.

  • Update
    Update

    Despite no significant change for users at this time, our teams continue to work steadily on the resolution. Based on current progress, we expect full recovery before the end of the day.
    We will continue to keep you informed and share more details as soon as we make further progress.

  • Update
    Update

    We are performing additional actions on our infrastructure, but the service has not yet returned to normal. Our teams are continuing to work through the situation, and we will share further details as soon as new progress is made.

  • Update
    Update

    We are still completing the restart process, and some services may take a bit more time to fully stabilize. Our teams remain actively engaged, and we will provide a further update as soon as the next milestone is reached.

  • Update
    Update

    The system restart is still in progress. We will provide another update as soon as the operation is completed.

  • Update
    Update

    The system restart is still ongoing and is expected to take approximately one hour. We will provide another update as soon as the process is completed.

  • Update
    Update

    As part of the ongoing incident resolution, a full system restart will be performed. This will cause active sessions to be disconnected. This action is necessary to restore the service and stabilize the platform.
    Our teams will be closely monitoring the situation throughout this operation, and we expect this step to accelerate the return to normal service.

  • Update
    Update

    We are still experiencing degraded performance, but we are seeing ongoing progress as our teams actively manage elevated traffic levels as part of the mitigation efforts.
    We currently estimate full recovery within the next several hours and are working to shorten this timeframe as improvements continue.

  • Update
    Update

    We are continuing to make progress in restoring the service, although elevated load is currently slowing the process. Our teams remain fully mobilized, and we will provide further updates as the situation evolves.

  • Update
    Update

    Our teams are actively working on restoring the service. Services are partially recovered, although we continue to observe latency. We will share further updates as progress is made.

  • Identified
    Identified

    We’re facing a major incident caused by unexpected database latency. Our engineers are investigating and working to restore normal performance. We sincerely apologize for the inconvenience and will keep you updated.

  • Investigating
    Investigating
    We are currently investigating this incident.