March 20, 2017 - Summary of the Flowroute Voice Service Impairment in the Nevada Region

This communication is intended to give additional insight into the service impairment that occurred in our Nevada region on Monday, March 20, 2017 from 7:05 AM through 9:01 AM PDT. Flowroute’s expanded logging and continuous SIP tracing lacked certain fault tolerance requirements; specifically, the remote system logging functionality (rsyslog) was not tolerant to network connectivity faults, which in the case of an extended fault would lead to a local service impairment.

A network interruption caused by a bad ethernet port disrupted the log forwarding service (rsyslog), which was not redundantly connected, from sending logs to the log capture server. The rsyslog local log buffering was also misconfigured on our Nevada customer SIP edge gateway, so that SIP message processing locked when the local log buffer was exhausted. At 7:26 AM PDT, the network port recovered and the logging service drained the buffer, causing a flood of stale SIP messages to load multiple SIP routing elements above capacity, and caused a legacy authentication and authorization sub-service to overload and crash. This additional service crash created a cascading failure for the entire Nevada ingress and egress routing path as well as inbound call processing to SIP-registered accounts registered only through the Nevada location.

Customers configured exclusively for the Nevada site experienced ingress and egress voice routing impairment during the 7:20-8:46 AM PDT window; customers configured for redundant connectivity to Flowroute experienced degraded service, but not an outage as our redundant routing platform handled these calls.

We recognize that a portion of our customers are not set up in a fault tolerant configuration, and are therefore not able to leverage the redundancy built into our service.  We will send subsequent communications to assist with configuring cross region redundancy.

We’re making a number of changes as a result of this service impairment and our investigation into the root causes. We’ve identified two areas where we can improve reliability: an Authentication, Authorization, and Accounting (AAA) subservice, and log forwarding. We have already made updates to our log forwarding configurations to tolerate network faults. A platform revision that improves reliability in the AAA subsystem for the Ingress Routing SIP Proxy is in integration testing and the initial production deployment is planned by March 31, 2017.

Finally, we want to apologize for the impact this impairment may have caused our affected customers. While we’re proud of our track record of reliability, we know how important these services are to our customer’s applications, end users and businesses. Analysis and learning of this event will continue for some time as we are committed to learn and improve in every way possible.

 

Did you find this article helpful?