Tag Archives: Fault Storms

An Umbrella for Fault Storm Management

Let’s continue our conversations around ML/AI in Service Assurance. I want to to explore an illustrated real life use case. The first example focus in on is around fault storms management.  When bad things happen, they may create an explosion of faults. Each fault may be a separate situation. This operational overload is best described by a customer of mine — “a see of red”.

Impact of Fault Storms on Operations

Fault Storm Management

When fault storms occur they cause many operational problems. First they cause blindness. It makes pre-existing problems and follow-on problems to get mixed in. Suddenly you have a mess. It may take hours to sort out responsibility alarms with manual correlation. Next they cause automation gridlock. Most service impacting alarms are set to immediately generate tickets. If 1,000 alarms try to open tickets at the same time, you may break your ticketing solution. Last they cause mistakes. Due to the human nature of sorting out the problem, errors are common. Operations can ignore a separate problem by assuming its part of another root cause. Fault storms, while rare, are dangerous for operations in assuring customer services.

Addressing Fault Storms with Machine Learning and AI

Fault Storm Management

Fault storms are a great use case for ML/AI technology. Machine learning sets the bar for a “storm”. Artificial intelligence can create the situation by encapsulating all the service impacting faults. This isolation/segment would mitigate the “sea of red”. When storms occur, the solution mitigates the blindness. The storm situation is isolated from pre-existing faults and all follow-on problems. Automation would only run on the situation created by ML/AI. This avoids the overload scenario. Fault storms are rare, but can devastate NOC performance. ML/AI technologies are a great choice to mitigate them.

Mitigating Effects Fault Storms

Fault Storm Management

The best way to illustrate how this technologies works is by showing a solution to a problem. For example, a site outage. We you have a power outage at a remote site, its devastating. All services depending upon infrastructure are no longer available. There are hundreds fo service impacting alarms. The final result is a complete mess for operations to clean up. Now ML/AI can address the fault storm caused by the site isolation. All the alarms could have the same location field key, then have a commonality. The count of alarms from that location is tracked. Machine learning can built a model based upon those records. The rush of faults breaks that model. Then the result is an anomaly centered upon that specific location. The anomaly encapsulates the situation – all the service impact alarms. With a processed alarm list, the “sea of red” becomes “clear as a bell”. Operations can assign the single site isolation to an individual. Then after validation, the user can perform action — dispatch. Instead of confusion and panic, operations continues to move forward like any other day. Business as usual, no stress should be the goal.

Take Aways on Fault Storms

Fault storms can break people’s day. They invite failure by operations. At the grand stacks of hundreds of outages to spotlight will be overwhelming. Operations has the opportunity – will the die or will they shine. Leveraging ML/AI technology can keep them on rails. Then success will be the standard operating procedure.

Article Map

Fault Storm Management

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

My Journey into Machine Learning with AI

I’m alive! Sorry for the lengthy lapse in updates, but things have been so busy. With the release of some of my work as of late, I now have some time to share what I have been working on. As the title suggests, its all around machine learning and artificial intelligence (AI). Despite being hot buzzwords, there are tons of success stories using the technology today. To explain what I have learned, I have written a blog series. My hope is to cover my journey in machine learning with artificial intelligence.

Blog Focus: Machine Learning

Journey into Machine Learning AI
 
First focus is around the technology. As with all new technology it has new terms and new concepts. Several of these are heretical to the status quo, so its important to set a proper level set.  Then we want to discuss where it applies. What problems does it solve? How well do they solve them? How do the new solutions compare to legacy ones?

Solving Problems with Machine Learning

Journey into Machine Learning AI 
Now we have a firm introduction, lets solve some problems. First with the use case of fault storm management. How are storms detected and mitigated? What are the rewards of using ML/AI when applied?
 
My next favorite use case is around fault stream reduction. Fewer faults mean less effort for operations. Can ML/AI help? How well does it work? How hard is it to use?
 
Operational performance management is a touchy subject, but a worthwhile exercise. Why should you monitoring your NOC? How can help operations without being Orwellian about it?
 
Chronic Detection & Mitigation is a common use case for operations. How does operations iron out the wrinkles of their network?  Can operations know when to jump on a chronic to fix them for good. Getting to 99.999% is hard without addressing chronic problems.
 

What is the Future of Machine Learning

Journey into Machine Learning AI
As part of this series, we should address the future of this technology. With ML/AI being so popular, where should this technology be applied? How can it help with service assurance to make an impact.
 
The plan is to wrap up the series with a review of what is currently available in the marketplace. So we are all are aware of what is current versus what is possible.
 
Stay tuned, the plan is to release the blogs weekly. Don’t be afraid to drop comments or questions, as would love to do a AMA or blog on the questions.

Article Map

Journey into Machine Learning AI

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.