Autodetecting Network Failures and Self-Healing To Ensure Optimal Availability

It’s midnight on a Saturday, you’re finally getting to that nice REM cycle. You’re on call, but it’s been a relatively quiet week. Besides, you feel good, you have done everything necessary to ensure that the web properties you are responsible for are operational.

You have deployed redundant data centers, each on opposite ends of the continent. You have a top-tier service provider you are paying thousands of dollars a year to manage your zones.

All is good in your world. If anything goes wrong, no sweat, flip the switch, and the backup kicks in.

Bam! Your main data center is under load. Your partner slaps you in the head to wake up, “your phone is going nuts, please shut that crap off..”

Shoot, what is going on? Crap, we’re under attack. Attackers are flooding your DC with bogus traffic having a material impact on your availability. Now it’s on, services depending on those properties are starting to go down.. crap, not the payment gateway…

What do I modify? Grr, where is the IP for that network? What page do I go to? Man, how long does this networking take? You call the service provider.

“Sorry, we can’t do it for you, need to update your zone to your failover location.”

Gah!!!

12 hours later, it’s now noon.. but we’re back up… oh my goodness… Now let’s prepare the AAR…

This is a fictionalized story, based on real events experienced by an administrator on call who was responsible for the web assets for a large financial institution.

Automating The Detection and Remediation of Availability Incidents

The weakest aspect of monitoring your web assets is often the manual aspects of the job. While you try to automate, some components are typically reliant on manual intervention.

It’s not always because it’s not technically possible to automate, but because the platforms we rely on have not exposed features the would simplify real-world problems or those features have been overlooked as non-essential.

Automating the detection of availability incidents, and self-healing is a great example of this.

NOC.org works to modernize the approach by integrating technologies. Using Authoritative DNS and the NOC.org smarting routing features, a user can leverage enhanced records to identify and self-heal from outages like the one described above.

How NOC.org Would Respond to an Availability Incident

In the following illustrations I’ll show you what would have happened in the scenario above:

1 – Normal traffic flow to your web server….

Simplified illustration of Web Traffic hitting a Web Server

2 – NOC.org detects issue with Primary, redirects traffic to Failover within minutes:

NOC.org Detects Issues, Reroutes all traffic

2 – NOC.org detects recovery, and recovers:

NOC.org Automatically Recovers When Outage Mitigated

To do this NOC.org merges different technologies to a) detect issues, and b) automatically respond and recover on behalf of the organization. All through the use of Authoritative DNS and smart routing features.

Leveraging Smart Routing Features

To deploy a solution like the one presented above, a user would leverage the NOC smart routing options (guide on how to configure).

Once the two records are present, the enhanced options would automatically be enabled and the user would be able to choose which one is the primary, and which is the backup.

Simply choose which type of monitor to leverage, and the system will do the heavy lifting. It will check the availability of the asset, depending on the type of monitoring selected, every few minutes. If at any time it detects an outage, it will automatically remove the affected resource from the rotation and route the traffic to the backup.

This gives an administrator the time to address the problem in a low-stress environment. Once the issue is resolved, simply bring it back online and the system will automatically self-heal, identifying the recovered asset and routing traffic accordingly.

Binding Monitors with Authoritative DNS Services

Authoritative DNS’ is a critical part of how the web works. They contain all the information associated with a domain known as records. These records are stored in a container known as a zone.

Ensuring Business Continuity

Things go down, that is a hard lesson we learned running massive infrastructures for years. You can do everything in your power to ensure the service is never disrupted, but Murphy often has other plans. Whether it’s a partner disruption, or something as innocuous as an oversight during a PR, it happens.

Leveraging an independent Authoritative DNS can add exponential peace of mind to an organization that depends heavily on its online presence. These services are a critical part of how the web works. They contain all the information associated with a domain, so let’s make better use of them.

NOC.org is here to help provide that.