Let’s continuing our discussion around machine learning and AI focusing on chronic issues. Key is the automated detection and mitigation of chronic issues. As we have discussed in many of blogs of this series, anomalies are unusual behaviors while chronic situations occur all the time. One customer put it best. “I deal with more of the same than different every day — help me there”. Chronics are not noise. The example I give is the scenario where every night a managed router goes down in a WAN site. The identified root cause is router a plugged into light switch enabled power receptacle. After the janitorial staff finishes, they flip the switch and DOWN IT GOES. Customer does care because they turn on the light in the morning and do not notice it. Operations have no way to control this problem, but they need to track it. The RCA worked, its service impacting, but customer does not care. Do you leverage a business hours suppression engine? No, because if someone is working late and it goes down, you have lost the customer. As you can see chronics are common and frustrating for operations. Too many times these waste effort and cause complacency. Giving the power to humans within operations to ignore an outage is always a bad idea.
What is an Outage?
The correct solution is to look for a typical behavior of outage. If the outage follows that procedure suppress, otherwise treat it as normal. Machine learning can detect the scenario in an automated fashion. Now who does the compare of the current pattern to the learned behavior model — artificial intelligence. The chronic detector will fire off a message. This will suppress the outage during the learned window. This can be overridden by the anomaly detector. This covers chronic conditions, but exits from the model to revert the chronic suppression. Together humans in operations can focus on what they do best — ACT. Instead of what can be difficult — remembering and tracking.
Example – Firmware Bugs
We have discussed a customer driven behavior model, but what about a technology driven one. One of my customers is doing heavy amounts of work in SDN/NFV. They have a VNF with vendor “firmware” that had a nasty reboot bug. The trouble? Well the reboot was done in <1 second. It reoccured every 3 days for every VNF depending upon their boot cycle. While their system caught it was chronic. Their network services dropped traffic and sessions entirely every three days. It took weeks to understand the bug, but with chronic detection it becomes a snap. Machine learning would include the firmware version. Hundreds of VNFs on the same version would identify the problem. Machine learning with chronic detection would prevent a new ticket opening every time it occurred. Instead it would correlate to a root cause — bad vendor firmware. Once identified operations can escalate to the vendor, but keep their screens clear of all the random reboots.
With proper chronic detection and mitigation operations are free to operate as they do best. No longer are their screens cluttered with non-actionable events. No longer do operational learning curves start at 6 months and could be longer than 18 months. Operations need the freedom to assimilate new technology. Handling change with ease is the direction the business is saying. How do you do that? By simplifying operations so that do what the do best — ACT.
About the Author
Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems. Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers. Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure. Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.