Rainbow - [EMEA] Rainbow core services outage – Incident details

Experiencing partially degraded performance

[EMEA] Rainbow core services outage

Resolved
Major outage
Started 4 months agoLasted about 6 hours

Affected

Europe & Middle East (EMEA)

Operational from 8:47 AM to 9:49 AM, Major outage from 9:49 AM to 1:50 PM, Degraded performance from 1:50 PM to 2:33 PM

[EMEA] Rainbow Core Services

Operational from 8:47 AM to 9:49 AM, Major outage from 9:49 AM to 1:50 PM, Degraded performance from 1:50 PM to 2:33 PM

Germany (DE)

Operational from 8:47 AM to 9:49 AM, Major outage from 9:49 AM to 1:50 PM, Degraded performance from 1:50 PM to 2:33 PM

[DE] Rainbow Core Services

Operational from 8:47 AM to 9:49 AM, Major outage from 9:49 AM to 1:50 PM, Degraded performance from 1:50 PM to 2:33 PM

North America (NA)

Operational from 8:47 AM to 9:49 AM, Major outage from 9:49 AM to 1:50 PM, Degraded performance from 1:50 PM to 2:33 PM

[NA] Rainbow Core Services

Operational from 8:47 AM to 9:49 AM, Major outage from 9:49 AM to 1:50 PM, Degraded performance from 1:50 PM to 2:33 PM

Updates
  • Postmortem
    Postmortem

    Post-Incident Report

    Incident Window: Feb 4th 8:30 AM- 5:00 PM CET

    1. Incident Timeline & Recovery

    · 08:30 AM CET: Initial instabilities detected on European core infrastructure.

    · 09:45 AM CET: Global service disruption confirmed.

    · 10:30 AM CET: Traffic temporarily suspended by our ALE operations team to allow for a safe, synchronized server reinitialization.

    · 12:00 PM – 02:00 PM CET: Services progressively restored for most impacted users, with the exception of Call-Log service

    · 02:00 PM – 05:00 PM CET: Close Monitoring of the situation and stability of services

    · 05:00 PM CET: All services including Call-Log fully restored.

    2. Scope of Impact

    · Rainbow Collaboration: Chat, Conference, and Webinar services in Europe, Middle East and Africa.

    · Rainbow Hub: Softphony services in Europe, Middle East and Africa.

    · Rainbow Hybrid Telephony: Worldwide services (Click-To-Call, Softphone).

    3. Technical Summary

    The incident originated from a network maintenance operation performed overnight between February 3rd to Feb 4th on an XMPP server cluster node. Although the standard restoration was completed at 04:00 AM CET on February 4th, subsequent network instabilities at 08:30 AM CET caused a desynchronization within the cluster, which ultimately led to the incident.

    To resolve the resulting service outage, our teams performed a controlled restart of all affected systems. We temporarily blocked traffic to prevent the recovering cluster from being overwhelmed. We then allowed the servers to fully resynchronize before restoring normal service to users.

    4. Current Status

    As confirmed above, all services are now fully operational from 05:00 PM CET.

    5. Corrective actions

    Our in-depth analysis of this incident has led us to take additional actions to further strengthen the long-term stability of the platform in the coming weeks.

    · Adding communication nodes to create a more robust cluster architecture. This will further reduce the risk of synchronization failures during periods of unforeseen network instabilities.

    · Upgrading database nodes’ hardware. This will accelerate system recovery during incidents.

    We value your partnership and your patience. If you have immediate concerns or require further support, please contact our support team directly.

  • Resolved
    Resolved

    Services are restored, RCA will be shared soon.

  • Update
    Update

    Service is partially restored. All Rainbow Core features are available except call logs.

  • Update
    Update

    Service restoration continues across remaining affected accounts. Our teams are actively finalizing recovery for the remaining accounts and continue monitoring platform stability closely.

  • Monitoring
    Monitoring

    Core services have been restored and are now under enhanced monitoring.Service restoration is being progressively completed across all user accounts. Our teams remain fully engaged to ensure complete recovery and platform stability.

  • Update
    Update

    The root cause has been identified, and recovery actions are currently in progress.Service restoration in underway, and systems are being progressively brought back online

  • Update
    Update

    We identified peak traffic problem, currently rebalancing the traffic

  • Update
    Update
    We are continuing to work on a fix for this incident.
  • Identified
    Identified
    We are continuing to work on a fix for this incident.
  • Investigating
    Investigating
    We are currently investigating this incident.