DevBlog

About

The DevBlog documents all developments and promotions that relate to the TM4B Bulk SMS Gateway and its associated services 70589.INFO & T4ME.

Categories

27/06/2007About Today's Downtime

Summary:

An explanation regarding to today's down time.

Category:

Down Time

Link:

www.tm4bhelp.com/dev/p-100.php

Details:

Over the past few days, a handful of customers have experienced a lack of responsiveness from our servers and, whilst we have frantically tried to understand the root problem behind it, we were unable to. Then today, our services were unavailable for almost 2 hours.

Following last month's downtime, it is indeed very embarrassing to experience a further case of down-time in such a short period and an explanation is necessary.

To begin with, ollowing last month's unexpected downtime, we took many steps to increase the reliability of our services. Part of these steps involved relocating our servers to, what is considered to be, one of the best data-centres in the world. However, as it would happen, even the best have unexpected problems as described by the following 3 messages from the Chief Network Administrator:

The fiber provider for one of our inter-datacenter Cedar Falls to Chicago links has just informed us that they have scheduled a maintenance for the morning of Wednesday, June 27th. This will result in routing reconvergence as traffic is routed around the downed fiber. Customers may see sub-optimal path selection and increased ping times during a single 15 minute window in the maintenance period.
As many of our customers have noticed, our upstream's scheduled maintenance did not go as planned. Due to issues that are still under investigation at this hour, the outage period was longer than was expected, and the window was open longer than scheduled. We are still inquiring with the provider as to why they did not enter a predictable fail state that would have allowed traffic to fail over to another connection at this time, instead of accepting our traffic and then failing to deliver it. Additional updates regarding this issue should be forthcoming later today. We thank everyone for their continuing patience on this matter.
After extensive analysis this morning, it appears that multiple issues, instead of a single one related to this maintenance, resulted in prolonged degraded service for customers in the Cedar Falls datacenter. As mentioned previously, a monitoring / configuration error on the part of the provider performing the maintenance today resulted in a longer than expected maintenance window this morning. This resulted in degraded speeds and latency for some customers in the Cedar Falls datacenter, as traffic had to take the scenic route via backup links to its destination. In addition to this, further investigation this morning revealed an entirely seperate issue with another of our upstream providers accepting traffic for a subset of remote destinations, and then dropping that traffic, resulting in complete unreachability for some customers on a small selection of remote networks. Since this was only a small subset of the traffic the upstream was handling for our Chicago and Cedar Falls datacenters, it went un-noticed until customers brought it to our attention. Once the issue was pin-pointed, the offending upstream was removed from our routing mix until they reported the issue was solved (which has occured at this time). As of this morning, the network core and distribution portions are operating normally - if you are still experiencing an issue, please contact our support department immediately so our NOC staff can look further into the issue.

We can only but apologise for this loss in reliability and unexpected downtime. We chose to locate our servers in the Cedar Falls data-centre because it is one of, if not, the best in the world and will continue to maintain our servers there, taking this as a one-off occurrence.

We also request customers to inform us should they experience any problems.