Monthly Archives: July 2018

Automated Detection and Mitigation of Chronic Issues

Let’s continuing our discussion around machine learning and AI focusing on chronic issues. Key is the automated detection and mitigation of chronic issues. As we have discussed in many of blogs of this series, anomalies are unusual behaviors while chronic situations occur all the time. One customer put it best. “I deal with more of the same than different every day — help me there”. Chronics are not noise. The example I give is the scenario where every night a managed router goes down in a WAN site. The identified root cause is router a plugged into light switch enabled power receptacle. After the janitorial staff finishes, they flip the switch and DOWN IT GOES. Customer does care because they turn on the light in the morning and do not notice it. Operations have no way to control this problem, but they need to track it. The RCA worked, its service impacting, but customer does not care. Do you leverage a business hours suppression engine? No, because if someone is working late and it goes down, you have lost the customer. As you can see chronics are common and frustrating for operations. Too many times these waste effort and cause complacency. Giving the power to humans within operations to ignore an outage is always a bad idea.

What is an Outage?

Automated Detection and Mitigation of Chronic Issues

The correct solution is to look for a typical behavior of outage. If the outage follows that procedure suppress, otherwise treat it as normal. Machine learning can detect the scenario in an automated fashion. Now who does the compare of the current pattern to the learned behavior model — artificial intelligence. The chronic detector will fire off a message. This will suppress the outage during the learned window. This can be overridden by the anomaly detector. This covers chronic conditions, but exits from the model to revert the chronic suppression. Together humans in operations can focus on what they do best — ACT. Instead of what can be difficult — remembering and tracking.

Example – Firmware Bugs

We have discussed a customer driven behavior model, but what about a technology driven one. One of my customers is doing heavy amounts of work in SDN/NFV. They have a VNF with vendor “firmware” that had a nasty reboot bug. The trouble? Well the reboot was done in <1 second. It reoccured every 3 days for every VNF depending upon their boot cycle. While their system caught it was chronic. Their network services dropped traffic and sessions entirely every three days. It took weeks to understand the bug, but with chronic detection it becomes a snap. Machine learning would include the firmware version. Hundreds of VNFs on the same version would identify the problem. Machine learning with chronic detection would prevent a new ticket opening every time it occurred. Instead it would correlate to a root cause — bad vendor firmware. Once identified operations can escalate to the vendor, but keep their screens clear of all the random reboots.

Takeaways

With proper chronic detection and mitigation operations are free to operate as they do best. No longer are their screens cluttered with non-actionable events. No longer do operational learning curves start at 6 months and could be longer than 18 months. Operations need the freedom to assimilate new technology. Handling change with ease is the direction the business is saying. How do you do that? By simplifying operations so that do what the do best — ACT.

Article Map

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems. Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers. Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure. Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Learning Your Operational Performance

Leave a reply

In business intelligence reporting, a common area is around learning your operational performance. This means tracking operation’s workload and results. While this can be a sticky subject for operations, its also a great opportunity to improve. Its a fact, when overloaded, operations suffers in the quality of their response. So its only common sense to track the NOC like you track the network. If operations is overload causing quality issues, operations need to be aware so that remediation actions occur. This could include staff augmentation or improved training regimes to drive better results. Trouble becomes how. Many focus on ticketing solutions. The ITIL compliance allows management of operational performance to set specifications. But those levels are not real-time. How does it help to know you needed help last Wednesday!

Where Machine Learning Coming Into Play

Again, ML/AI technology helps. Fault managers can track user and automation interactions with those faults. Most call these “event managers”. This audit trail can have machine learning applied to create the standard operational model. The result is a discovered model. Say a common fault usually takes 10 actions and 15 minutes to fix during business hours. When the NOC deviates from their previous score – good or bad. The AI can alert to the group, either GREAT JOB improving here is the new bar or let’s RALLY we are getting behind.

Proactive Workload Management

Let’s get into the details. Let’s say that machine learning exposes that during a certain time of day/day of week, 4 level1, tickets 5 level2 tickets, and 15 level3 tickets. Then the system is showing a systemic increase 2x, then 5x, and then 10x. AI agents can see this risk and alert. That alert can show that we have an abnormal amount of tickets opened. Operations managers can call in resources. The system can send an advisory email to the ticketing administration asking for a health check. Without ML/AI technology, running reports and interpreting requires so much time, most organizations will not even try. Those that do, latency could be weeks between needing a change to recognizing that need.

Positive Impact to Operations

The results of operational performance monitoring should be a smoother working operations teams. Fewer errors and happy customers is what every NOC should try to provide to the organization. Accomplishing this with zero human touch with a latency of less than 15 minutes. This has been unimaginable functionality up to this point. The difference has been the emergence of ML/AI technologies.

Let me know what you think here in the comments below. This is a cringeworthy conversation with operations. I do believe that near real-time operations performance management has value to NOCs today.

Article Map

About the Author

Minimizing Faults with Machine Learning and AI

Leave a reply

A great topic of conversation is fault reduction. First let’s not confuse terms. Data reduction is discarding data deemed not actionable. Fault reduction is prioritizing the fault stream. This enables the most impactful and actionable faults bubble up to the top. You should never ignore faults. They are pointing to a problem that may cause an outage in the future. Fault reduction is a challenging field of service assurance. There are some many ways to cheat the system — as in simply deleting non-actionable faults. But let’s get serious. Where we should focus is on the best practice. That is identification and prioritization of faults that enabling filtering. This blog describes minimizing faults with Machine Learning and AI. You can be the judge of the methodology.

Understanding Fault Noise

Minimizing Faults with Machine Learning and AI

Imagine if you will a universal “noise” level for operations. Currently there are tons of outages, so operations only work on outages – they want no noise. Outages are usually straightforward and actionable. You may want to use maintenance window filters. Then verify that the services affected are in production. Many filters are straightforward. The trouble is moving beyond the outages or dealing with outages that are not actionable. Let’s talk the first – problems and issues. Problems impair a service, but not affect it. Say a loss of redundancy as an example. Usually you need two problems before the situation becomes an outage. Issues are things like misconfigurations that complicate things and can cause problems. The trouble is a mature, legacy network has 10s-100s of outages. With an exponential amount of problems. Then issues are exponential to problems. You are talking information overload. How do you rank them? Well that is where ML/AI is being leveraged. The secret ingredient is statistical rarity. If the problem or issue is new and unusual there is a greater chance of a quick fix. The less rare it is, the more likely it is not actionable. But less test my hypothesis…

The Rogue Device Example

For example, a rogue device. Let’s say someone adds a new device to network without following best practices. Receiving traps from the device, but nothing else — they are a rogue device. When a new device first alarms, creating an anomaly. This kicks off automation that validates configuration. This opens a ticket for manual intervention upon failure. The net results is zero human intervention. This follows best practice; no quasi monitored production devices exist in the network.

Dueling Interfaces

Another examples is interface monitoring. Let’s say two interfaces on a switch are down. One happens all the time, the other rarely occurs. Which do you think is more actionable? With ML/AI technology, you can create a model based upon Device/Interface occurrence. If the current situation indicates breaking that model you can enrich the alarm that is more rare. This way operations can focus their time, when that is a constraint, on the more actionable fault. The result would be addressing what is easy, then working on what harder later. With prioritization, operations can increase their efficiency. This also maximizes the value to the organization as a whole.

Take Aways

Reduction of the fault stream is something everyone wants to do. We must remember there are good and bad ways to achieve it. A good way is to rank your fault stream using rarity. ML/AI technology can help leverage rarity. This increases operational efficiency. This is yet another advantaging of leverage event analytics for real-time operations.

Article Map

About the Author

An Umbrella for Fault Storm Management

Leave a reply

Let’s continue our conversations around ML/AI in Service Assurance. I want to to explore an illustrated real life use case. The first example focus in on is around fault storms management. When bad things happen, they may create an explosion of faults. Each fault may be a separate situation. This operational overload is best described by a customer of mine — “a see of red”.

Impact of Fault Storms on Operations

When fault storms occur they cause many operational problems. First they cause blindness. It makes pre-existing problems and follow-on problems to get mixed in. Suddenly you have a mess. It may take hours to sort out responsibility alarms with manual correlation. Next they cause automation gridlock. Most service impacting alarms are set to immediately generate tickets. If 1,000 alarms try to open tickets at the same time, you may break your ticketing solution. Last they cause mistakes. Due to the human nature of sorting out the problem, errors are common. Operations can ignore a separate problem by assuming its part of another root cause. Fault storms, while rare, are dangerous for operations in assuring customer services.

Addressing Fault Storms with Machine Learning and AI

Fault storms are a great use case for ML/AI technology. Machine learning sets the bar for a “storm”. Artificial intelligence can create the situation by encapsulating all the service impacting faults. This isolation/segment would mitigate the “sea of red”. When storms occur, the solution mitigates the blindness. The storm situation is isolated from pre-existing faults and all follow-on problems. Automation would only run on the situation created by ML/AI. This avoids the overload scenario. Fault storms are rare, but can devastate NOC performance. ML/AI technologies are a great choice to mitigate them.

Mitigating Effects Fault Storms

The best way to illustrate how this technologies works is by showing a solution to a problem. For example, a site outage. We you have a power outage at a remote site, its devastating. All services depending upon infrastructure are no longer available. There are hundreds fo service impacting alarms. The final result is a complete mess for operations to clean up. Now ML/AI can address the fault storm caused by the site isolation. All the alarms could have the same location field key, then have a commonality. The count of alarms from that location is tracked. Machine learning can built a model based upon those records. The rush of faults breaks that model. Then the result is an anomaly centered upon that specific location. The anomaly encapsulates the situation – all the service impact alarms. With a processed alarm list, the “sea of red” becomes “clear as a bell”. Operations can assign the single site isolation to an individual. Then after validation, the user can perform action — dispatch. Instead of confusion and panic, operations continues to move forward like any other day. Business as usual, no stress should be the goal.

Take Aways on Fault Storms

Fault storms can break people’s day. They invite failure by operations. At the grand stacks of hundreds of outages to spotlight will be overwhelming. Operations has the opportunity – will the die or will they shine. Leveraging ML/AI technology can keep them on rails. Then success will be the standard operating procedure.

Article Map

About the Author

Challenges Addressed by Machine Learning with AI

Leave a reply

Now let us discuss the challenges addressed by machine learning with AI. As we learned in the previous blog, machine learning is excellent in model building. These operational models enable leveraging historical data. Artificial Intelligence is great at doing comparisons to produce results. Identifying anomalies and chronic with ease lowers effort requirements. So how does this help us? Lost plot by many, operations have discrete problems. Technology is only valuable if applied to problem areas. Here are some the common discussed areas I have seen in the marketplace — names withheld.

Normalizing Faults

One common problem is the normalization of fault data. For example, SNMP traps are a very common fault format and protocol. Its binary format of enterprise and integer indicating a trap number. This requires human beings create database lookups (called MIBs) to provide descriptive detail of the fault. This discounts operational configurations like Up/Down correlation or aging configuration settings. Learning these configurations is a possible area for using machine learning technology. AI can compare common worded, complete configured trap types and guess what they should be. Human beings can right-click, update where applicable. The result would be a build-as-you-go rules engine curated on-the-fly. Many Managed Service Providers (MSP) find this interesting. Any organization with diverse and changing data set would find it valuable.

Correlating Faults

Correlation is another concern. Operations need efficient ways to identify, isolate, and act on situations. They do not have time to discern association. As discussed, machine learning can identify a model of what is normal. The same model enabling the anomaly enables a set of context, which allows for reverse correlation. The allows anomalies to drive encapsulation of the situation.

Now looking forward, machine learning with AI can identify if this has happened before – or chronic detection. This opens the door to closing out chronic situations. Imagine a burst outage. Operations sees a failed service that clears – down, then back up. Most people end up ignoring these errors. The reason is that a final resolution is impossible to achieve. Unfortunately its uncommon to track them. If they re-occur, tools need to identify that they are not “blips” they are chronic. Long-term, the goal is to forecast them. That way preparations can occur to capture key data. The goal will be to engage to get final resolution of the chronic and prevent their repetition. With machine learning with AI, chronic detect and mitigation becomes possible — which we will discuss in more detail later.

Prioritizing Faults

Another problem for operations is data overload. While a problem may be a root cause and an individualized “situation”, operations may not CARE. If a customer takes down their service, operations must make the logical choice to IGNORE a situation. Leaving this up to humans introduces human error. With machine learning you can understand that this problem common and should be identified as a chronic situation. This enrichment allows operations to re-prioritize new situations over chronic situations. This allows a more accurate report of what is going on with an operational priority assigned.

Operations also have a problem with reporting. Post-mortem analysis can encumber operations available to learn from failures. In a matter of minutes, machine learning with AI technology can scan years of raw data to see a the particular pattern. That pattern can segment what affect a situation had on the network. The bottom line is operations can report on the what, why, where, and how using machine learning with AI technologies.

Article Map

About the Author

Shawn Ennis

Founder, Father, and Serial Entrepreneur. Here is where I come to work, play, and share. Follow me on Twitter @Shawn_Ennis or Linkedin.

Monthly Archives: July 2018

Automated Detection and Mitigation of Chronic Issues

What is an Outage?

Example – Firmware Bugs

Takeaways

Learning Your Operational Performance

Where Machine Learning Coming Into Play

Proactive Workload Management

Positive Impact to Operations

Minimizing Faults with Machine Learning and AI

Understanding Fault Noise

The Rogue Device Example

Dueling Interfaces

Take Aways

An Umbrella for Fault Storm Management

Impact of Fault Storms on Operations

Addressing Fault Storms with Machine Learning and AI

Mitigating Effects Fault Storms

Take Aways on Fault Storms

Challenges Addressed by Machine Learning with AI

Normalizing Faults

Correlating Faults

Prioritizing Faults