Rainbow - Intermittent telephony issues – Incident details

Experiencing partially degraded performance

Intermittent telephony issues

Resolved
Degraded performance
Started 4 months agoLasted 25 minutes
Updates
  • Postmortem
    Postmortem

    Root Cause Analysis – Service Degradation on February 9th

    Incident Window : 07:20–08:55 CET (end users impact between 08:30-08:55 CET)

    Services Impacted : Outbound and inbound call establishment for a subset of customers in France and Germany

    Current Status : Fully resolved

    Summary of the Incident

    On February 9th, a service degradation affected voice call processing on our platform. Some customers experienced failed call attempts for approximately 30 minutes. The issue has since been fully resolved, and the platform is operating normally.

    Customer Impact

    • Some customers experienced call setup failures for approximately 30 minutes.

    • No impact on other services or call legs already established.

    • No further issues have been detected since restoration.

    What Happened (Timeline)

    07:20 CET – Automatic Failover

    • One of our proxy servers became unavailable.

    • The platform automatically switched over to the redundant proxy as expected.

    • At this stage, only a few calls (4–5) in ringing phase were lost.

    • Services were nominal after the switchover.

    08:30 CET – Crash on the Redundant Proxy

    • The newly active proxy experienced a crash.

    • Service degradation became visible to some customers (call failures).

    • During restart, our team detected data inconsistencies in the proxy environment. This inconsistency prevented the automated switchover to the secondary instance.

    • Checks and clean-up of database were required before safely restarting the service to prevent further data corruption and repeat failure.

    08:55 CET – Full Service Restoration

    • The proxy was restarted once data integrity was secured.

    • All services stabilized and no further issues were observed.

    Root Cause

    Preliminary analysis indicates that data corruption occurred during the initial switchover between proxies.

    This corruption prevented a clean restart of the redundant proxy and ultimately triggered the crash.

    The underlying cause appears linked to anomalies in the proxy cluster’s data synchronization process during the failover.

    Corrective Actions Taken

    • Ensured the integrity of all proxy data before restart

    • Restored the proxy and validated call flow end‑to‑end

    • Performed full health checks across the platform

    Preventive Actions (In Progress)

    We are implementing the following measures:

    1. Hardening of failover procedures to avoid data inconsistencies during switchover

    2. Improve continuous monitoring of proxy cluster data consistency to detect abnormal state earlier and isolate faulty instances

  • Resolved
    Resolved
    This incident has been resolved.
  • Identified
    Identified
    We are continuing to work on a fix for this incident.
  • Investigating
    Investigating
    We are currently investigating this incident.