Minimizing Faults with Machine Learning and AI

A great topic of conversation is fault reduction. First let’s not confuse terms. Data reduction is discarding data deemed not actionable. Fault reduction is prioritizing the fault stream. This enables the most impactful and actionable faults bubble up to the top. You should never ignore faults. They are pointing to a problem that may cause an outage in the future. Fault reduction is a challenging field of service assurance. There are some many ways to cheat the system — as in simply deleting non-actionable faults. But let’s get serious. Where we should focus is on the best practice. That is identification and prioritization of faults that enabling filtering.  This blog describes minimizing faults with Machine Learning and AI.  You can be the judge of the methodology.

Understanding Fault Noise

Minimizing Faults with Machine Learning and AI
 
Imagine if you will a universal “noise” level for operations. Currently there are tons of outages, so operations only work on outages – they want no noise. Outages are usually straightforward and actionable. You may want to use maintenance window filters. Then verify that the services affected are in production. Many filters are straightforward. The trouble is moving beyond the outages or dealing with outages that are not actionable. Let’s talk the first – problems and issues. Problems impair a service, but not affect it. Say a loss of redundancy as an example. Usually you need two problems before the situation becomes an outage. Issues are things like misconfigurations that complicate things and can cause problems. The trouble is a mature, legacy network has 10s-100s of outages. With an exponential amount of problems. Then issues are exponential to problems. You are talking information overload. How do you rank them? Well that is where ML/AI is being leveraged. The secret ingredient is statistical rarity. If the problem or issue is new and unusual there is a greater chance of a quick fix. The less rare it is, the more likely it is not actionable. But less test my hypothesis…

The Rogue Device Example

Minimizing Faults with Machine Learning and AI 
For example, a rogue device. Let’s say someone adds a new device to network without following best practices. Receiving traps from the device, but nothing else — they are a rogue device. When a new device first alarms, creating an anomaly. This kicks off automation that validates configuration. This opens a ticket for manual intervention upon failure. The net results is zero human intervention. This follows best practice; no quasi monitored production devices exist in the network.

Dueling Interfaces

Minimizing Faults with Machine Learning and AI
Another examples is interface monitoring. Let’s say two interfaces on a switch are down. One happens all the time, the other rarely occurs. Which do you think is more actionable? With ML/AI technology, you can create a model based upon Device/Interface occurrence. If the current situation indicates breaking that model you can enrich the alarm that is more rare. This way operations can focus their time, when that is a constraint, on the more actionable fault. The result would be addressing what is easy, then working on what harder later. With prioritization, operations can increase their efficiency. This also maximizes the value to the organization as a whole.

Take Aways

 
Reduction of the fault stream is something everyone wants to do. We must remember there are good and bad ways to achieve it. A good way is to rank your fault stream using rarity. ML/AI technology can help leverage rarity. This increases operational efficiency. This is yet another advantaging of leverage event analytics for real-time operations.
Fault Storm Management

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

An Umbrella for Fault Storm Management

Let’s continue our conversations around ML/AI in Service Assurance. I want to to explore an illustrated real life use case. The first example focus in on is around fault storms management.  When bad things happen, they may create an explosion of faults. Each fault may be a separate situation. This operational overload is best described by a customer of mine — “a see of red”.

Impact of Fault Storms on Operations

Fault Storm Management

When fault storms occur they cause many operational problems. First they cause blindness. It makes pre-existing problems and follow-on problems to get mixed in. Suddenly you have a mess. It may take hours to sort out responsibility alarms with manual correlation. Next they cause automation gridlock. Most service impacting alarms are set to immediately generate tickets. If 1,000 alarms try to open tickets at the same time, you may break your ticketing solution. Last they cause mistakes. Due to the human nature of sorting out the problem, errors are common. Operations can ignore a separate problem by assuming its part of another root cause. Fault storms, while rare, are dangerous for operations in assuring customer services.

Addressing Fault Storms with Machine Learning and AI

Fault Storm Management

Fault storms are a great use case for ML/AI technology. Machine learning sets the bar for a “storm”. Artificial intelligence can create the situation by encapsulating all the service impacting faults. This isolation/segment would mitigate the “sea of red”. When storms occur, the solution mitigates the blindness. The storm situation is isolated from pre-existing faults and all follow-on problems. Automation would only run on the situation created by ML/AI. This avoids the overload scenario. Fault storms are rare, but can devastate NOC performance. ML/AI technologies are a great choice to mitigate them.

Mitigating Effects Fault Storms

Fault Storm Management

The best way to illustrate how this technologies works is by showing a solution to a problem. For example, a site outage. We you have a power outage at a remote site, its devastating. All services depending upon infrastructure are no longer available. There are hundreds fo service impacting alarms. The final result is a complete mess for operations to clean up. Now ML/AI can address the fault storm caused by the site isolation. All the alarms could have the same location field key, then have a commonality. The count of alarms from that location is tracked. Machine learning can built a model based upon those records. The rush of faults breaks that model. Then the result is an anomaly centered upon that specific location. The anomaly encapsulates the situation – all the service impact alarms. With a processed alarm list, the “sea of red” becomes “clear as a bell”. Operations can assign the single site isolation to an individual. Then after validation, the user can perform action — dispatch. Instead of confusion and panic, operations continues to move forward like any other day. Business as usual, no stress should be the goal.

Take Aways on Fault Storms

Fault storms can break people’s day. They invite failure by operations. At the grand stacks of hundreds of outages to spotlight will be overwhelming. Operations has the opportunity – will the die or will they shine. Leveraging ML/AI technology can keep them on rails. Then success will be the standard operating procedure.

Article Map

Fault Storm Management

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Challenges Addressed by Machine Learning with AI

Now let us discuss the challenges addressed by machine learning with AI. As we learned in the previous blog, machine learning is excellent in model building. These operational models enable leveraging historical data. Artificial Intelligence is great at doing comparisons to produce results. Identifying anomalies and chronic with ease lowers effort requirements. So how does this help us? Lost plot by many, operations have discrete problems. Technology is only valuable if applied to problem areas. Here are some the common discussed areas I have seen in the marketplace — names withheld.

Normalizing Faults

Challenges Addressed by Machine Learning with AI
One common problem is the normalization of fault data. For example, SNMP traps are a very common fault format and protocol. Its binary format of enterprise and integer indicating a trap number. This requires human beings create database lookups (called MIBs) to provide descriptive detail of the fault. This discounts operational configurations like Up/Down correlation or aging configuration settings. Learning these configurations is a possible area for using machine learning technology. AI can compare common worded, complete configured trap types and guess what they should be. Human beings can right-click, update where applicable. The result would be a build-as-you-go rules engine curated on-the-fly. Many Managed Service Providers (MSP) find this interesting. Any organization with diverse and changing data set would find it valuable.

Correlating Faults

Challenges Addressed by Machine Learning with AI
Correlation is another concern. Operations need efficient ways to identify, isolate, and act on situations. They do not have time to discern association. As discussed, machine learning can identify a model of what is normal. The same model enabling the anomaly enables a set of context, which allows for reverse correlation. The allows anomalies to drive encapsulation of the situation.
 
Now looking forward, machine learning with AI can identify if this has happened before – or chronic detection. This opens the door to closing out chronic situations. Imagine a burst outage. Operations sees a failed service that clears – down, then back up. Most people end up ignoring these errors. The reason is that a final resolution is impossible to achieve. Unfortunately its uncommon to track them. If they re-occur, tools need to identify that they are not “blips” they are chronic. Long-term, the goal is to forecast them. That way preparations can occur to capture key data. The goal will be to engage to get final resolution of the chronic and prevent their repetition. With machine learning with AI, chronic detect and mitigation becomes possible — which we will discuss in more detail later.
 

Prioritizing Faults

Challenges Addressed by Machine Learning with AI
Another problem for operations is data overload. While a problem may be a root cause and an individualized “situation”, operations may not CARE. If a customer takes down their service, operations must make the logical choice to IGNORE a situation. Leaving this up to humans introduces human error. With machine learning you can understand that this problem common and should be identified as a chronic situation. This enrichment allows operations to re-prioritize new situations over chronic situations. This allows a more accurate report of what is going on with an operational priority assigned.
 
Operations also have a problem with reporting. Post-mortem analysis can encumber operations available to learn from failures. In a matter of minutes, machine learning with AI technology can scan years of raw data to see a the particular pattern. That pattern can segment what affect a situation had on the network. The bottom line is operations can report on the what, why, where, and how using machine learning with AI technologies.
Challenges Addressed by Machine Learning with AI

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

The Technology of Machine Learning with AI

The technology of machine learning with AI should be our first focus. As with all new technology it has new terms and new concepts. Several of these are heretical to the status quo. Its important to set a proper context so we can have serious discussions.
 
Good to my word, here is the first blog in the series on machine learning with AI in service assurance. To explain some of the solutions in the marketplace, first we need to talk about the technology. The terms and concepts are new. The goals are the same for operations: increasing automation, increasing quality.

Defining Machine Learning

Technology of Machine Learning 
Let’s start on machine learning. First check out the wikipedia article. Below is the definition:
 
“Machine learning is a field of computer science that often uses statistical techniques to give computers the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.”
 
I summarize it like an ant farm. The farm, made by worker ants, creates the tunnels. Once done the model is complete. In this case, ants are the machine learning. As days go on and if necessary the ant farm will change. Then, say the farm falls over (oops). So the ants have to re-build and new pathways in the model, updating them. The fact is the ant farm is a always changing and learning from the environment. Like the ants, machine learning builds and maintain the patterns, or as I call it models. These models compare to the current situation in real-time. Artificial intelligence can see if they align (chronics) or diverge (anomalies).
 

Defining AI

Technology of Machine Learning
Now AI as defined by wikipedia as “the study of intelligent agents: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.
 
The complexity of AI is a spectrum. On one side some AI is cognitive, virtual reasoning. On the other, AI is nothing more than rules processing. When in the context of machine learning, AI is usually applying or comparing models. This is what machine learning has learned with a separate set of data. In the context of service assurance, doing the comparison is the value. Comparing the past with present or projecting the future using the past. This provides analytics with insights.

Understanding Anomalies

 When performing a compare, either they align or diverge within some set degree. If the current situation is a repeat of the past, you have detected a chronic situation. When the current situation is new and unusual its called an anomaly. Datascience.com defines three types of anomalies:
 
  • Point anomalies: A single instance of data is anomalous if it’s too far off from the rest. Business use case: Detecting credit card fraud based on “amount spent.”
  • Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Imagine spending $100 on food every day during the holiday season is normal, but may be odd otherwise.
  • Collective anomalies: A set of data instances collectively helps in detecting anomalies. Imagine someone is trying to copy data form a remote machine to a local host. If this odd, an anomaly that would flag this as a potential cyber attack.
Consider chronics as a “negative” anomalies. Many customers I have talked to, see chronic detection being the most common AI tool to have. Both anomalies and chronics are important AI tools operations can use to better monitor their estate.
 

Focusing on Service Assurance

Technology of Machine Learning

This blog focuses on technology as it applies to service assurance (my major focus). Service assurance, as defined by wikipedia, is:
 
“The application of policies and processes by a Communications Service Provider to ensure that services offered over networks meet a pre-defined service quality level for an optimal subscriber experience.”
 
Otherwise defined as assuring the quality of the service. Two ways to increase the quality of your services. One, automate resolutions, thus reducing how much issues impact customers. Two, proactively addressing issues before they become problems and problems before outages.

Increasing Automation with Machine Learning

Increasing automation, reducing downtime, is always a goal for operations. Machine learning with AI tools focus on correlation. The ability to segment faults into new terms like “situations”. Addressing each situation is key, but they must be actionable. Any correlation would leverage a machine learning built model. Then compare that model against the current fault inventory to produce the segmentation. This is the current focus of the industry today with mixed results as we will discuss later.

Being Proactive with Machine Learning

 Being proactive is another buzzword from the late 90s — that was a decade ago right? The reality is its hard, you need to provide educated guesses. You can make quality guesses if you don’t have the data. Data reduction, prevalent in the industry, has caused operations to discard 99% (or more) of their data. Without this data your guesses will be poor. A machine learning model with AI can leverage low level information that may predict a future outage. That prediction will give operations lead time to fix the issue before an outage occurs. This is the hype focus on the industry today.
 
It is important to provide context when discussing machine learning with AI. This helps the us all understand the technology to enable the deeper discussion. The concepts and terms are new, but the industry hasn’t changed. Next up will be applying this technology to problems experienced by the industry.
Journey into Machine Learning AI

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

My Journey into Machine Learning with AI

I’m alive! Sorry for the lengthy lapse in updates, but things have been so busy. With the release of some of my work as of late, I now have some time to share what I have been working on. As the title suggests, its all around machine learning and artificial intelligence (AI). Despite being hot buzzwords, there are tons of success stories using the technology today. To explain what I have learned, I have written a blog series. My hope is to cover my journey in machine learning with artificial intelligence.

Blog Focus: Machine Learning

Journey into Machine Learning AI
 
First focus is around the technology. As with all new technology it has new terms and new concepts. Several of these are heretical to the status quo, so its important to set a proper level set.  Then we want to discuss where it applies. What problems does it solve? How well do they solve them? How do the new solutions compare to legacy ones?

Solving Problems with Machine Learning

Journey into Machine Learning AI 
Now we have a firm introduction, lets solve some problems. First with the use case of fault storm management. How are storms detected and mitigated? What are the rewards of using ML/AI when applied?
 
My next favorite use case is around fault stream reduction. Fewer faults mean less effort for operations. Can ML/AI help? How well does it work? How hard is it to use?
 
Operational performance management is a touchy subject, but a worthwhile exercise. Why should you monitoring your NOC? How can help operations without being Orwellian about it?
 
Chronic Detection & Mitigation is a common use case for operations. How does operations iron out the wrinkles of their network?  Can operations know when to jump on a chronic to fix them for good. Getting to 99.999% is hard without addressing chronic problems.
 

What is the Future of Machine Learning

Journey into Machine Learning AI
As part of this series, we should address the future of this technology. With ML/AI being so popular, where should this technology be applied? How can it help with service assurance to make an impact.
 
The plan is to wrap up the series with a review of what is currently available in the marketplace. So we are all are aware of what is current versus what is possible.
 
Stay tuned, the plan is to release the blogs weekly. Don’t be afraid to drop comments or questions, as would love to do a AMA or blog on the questions.

Article Map

Journey into Machine Learning AI

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

A Tale of Two Sales

Its been a long quarter, but to celebrate some of the successes I have had, I wanted to share.   Service assurance sales strategies is something everybody is looking for and most do not know how to achieve.   I get asked all the time, why does this sell and this do not.    To summate my response I decided to write a blog post — A Tale of Two Sales.

Sale A: Relationship Sale

Service assurance sales strategies

When customers or partner need help, they go to people they trust. If they have no one with answers they trust, they do RFI/RFPs. They do proof of concepts until they obtain trust. So many sales engagements is about trust – obtaining and fostering it. Once trust is there, the deal will only need to be palatable. Until trust is there, you must be patient and listen. This sales is also called a “consultive sale”. Customers must share their pain, their vendor must share their solutions. If there is a fit and proved, you have a match. This can take months or day, it all depends upon how open the engagement. Once the trust is there, its a matter of business case creation. Project approval and budget follow — sale done and dusted…

 

Sale B: Package Sale

Service assurance sales strategies

Same situation – a customer or partner in need. They have no trusted advisor relationship. Hello RFI/RFP! They get back a complete, documented, sourced, and independently verified business case. This “known” quantity details provides all the tangibles. This includes ROI projections, feature/benefit, comparison studies, and case studies. The entire business case with minor adjustments required. I call this the package sale. Sales makes the pitch and leaves the bank with the customer. References provide the trust. The customer can go to their management assured in their support. The quality of the deliverable is everything. Management agrees, project approval and budget follow. Sale is complete.

 

For you buyers out there…

Sales is hard, be nice do your sales rep.    Tell them what you want and why you want it.   They can get it for you if you ask.   If they every tell you false information, stop talking to them.    Its not worth putting your career at risk.   Even for a slightly superior product.

 

For you sellers out there…

Its all about the benefits.   Completely documented and third party certified benefits.    Features are for demos, benefits are for closers.   Customer and partners buy solutions to solve headaches or get promotions.    Its you job to help on both accounts.

I personally believe it does not matter which strategy is used.   Service assurance sales strategies are not complex, you just have to listen to your customer.    The customer will guide you and if you do it right, you will have both deliverables: the trust and business case.

Service assurance sales strategies

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Projecting how 5G will impact operations

It’s coming they say.    Not winter (that’s here already), but mobile 5G is coming.   Sure the latency will allow a new generation of mobility applications.   The new RF control functions will allow better elasticity.   The network slicing will finally allow MPLS-like functionality in a 3GPP network.    But I think its time someone asked the basic questions on how 5G will impact operations.

Projecting how 5G will impact operations

Basic Tutorial and Terminology

5G means different things to different people.   The money involve means industry will be providing competing visions.   The essence of 5G is focusing on improving latency, bandwidth, network slicing, and elasticity.   Adding more endpoint (antennas) and distributed control functions will reduce the latency.    The control functions permitted more distributed switching.   The extra endpoints means fewer network mileage required.   These latency benefits are the biggest game changer so far.

The simplified control functions also enable better elasticity options.   Scaling up/down/in/out will allow a more natural self-optimizing network.   Adding class-of-service capabilities (like QoS in MPLS) will allowed tiered network options.   While net-neutrality questions still loom, this adds diversity to the single-use mobile network.   The amount of network required is still to be defined, but it looks to be at least 10x.   Mobile operators are also taking this opportune time to diversify their vendors.    Most US providers are adopting radios from at least two vendors.

The bottom line for operations is potentially terrifying.   Exponential scale in the network and backhaul is to be expected.    Exponential complexity increase with a “always in flux” network.   New network offerings and customer bases are bound to cause trouble.   Top those off with at least doubling the vendors.   Houston we have a problem!

 

Projecting how 5G will impact operations

Next-generation Mobile Network Services

As you can see, the investment will be significant.   The upside should be worth it though.   Traditional mobile services are commoditized as I blogged here last year.   Data, voice, and SMS do not provide enough value to the customer.  The new services provided will change that.   With the latency benefits, IoT services will become more viable.   I detailed IoT more in the previous blog around IoT Service Assurance.   In my opinion, the most intriguing new offering is “fixed wireless access” (FWA).    Ericsson did a really nice write up available here.    Verizon is augmenting their FIOS offer with a FWA offer in 2018.   This means that mobile providers are entering into the cable access market (HFC).    This sets Verizon against Comcast or T-mobile against Charter.   Gone will be the days that we will locked into high speed internet options solely by developed the neighborhood.

These new network services will drive new revenue potentials.   Most of these services will have direct competition so quality will matter.   With all these changes operations should expect challenges.    We should all expected quality problems with these new services.

 

Projecting how 5G will impact operations

Exponential Scale and Complexity

The first great challenge will be scale and complexity.  Tripling the number of devices in your network will stress your tools.    Realistically, can your OSS handle a 10x-1,000x increase in network size?    But this is not the only issue.   The self-optimizing vRAN means that network will constantly be in flux.   How can you troubleshoot a network that is always changing?   Due to size of investment, it only makes sense multiple vendors will be used.   Most mobile operators heavily depend upon their NEPs to provide OSS solutions.

The solution is simple, in fact simplification.   A vendor, technology, and product agnostic OSS solution is a must.   As you increase your tools, the complexity limits functionality.   Low-level optimization and orchestration can be done at the element manager layer.   This increases scale of both layers of the solution.

 

Projecting how 5G will impact operations

Becoming Geospatial Again

Remember the HP OpenView days of maps? When get prepared for those concepts to return. Like wifi antennas, 5G deploys radios with geospatial design in mind. Geospatial information (Lat/Long) will then drive behavior. GIS Correlation and visualization then becomes a need. Correlation and analytics are vital to reducing the complexity of 5G vRAN networks. External network conditions becomes more indicative. Things like hurricanes, floods, and power outages need to be taken into account. This is very similar the cable industry (HFC) access monitoring requirements. Operations will need help because most legacy tools are inadequate in these areas.

Projecting how 5G will impact operations

Bending but not Breaking with Elasticity

Elastic scaling the network is not a 5G concept. The trouble is we have seen that many, if not most, of the network functions are beats. Components like MMEs and SGWs take hours to spin up and configure. This reduces the value of elasticity in 3GPP networks. 5G replaces many complex functions, like eNodeBs, with smaller control functions. This will enable all the promises of network virtualization. The question becomes does operations have orchestration tools to enable automation. Some NEPs are building those functions into the element managers or VNFMs. Most service assurance tools do not have the capability to handle the network flux called by real-time elasticity. Operations will need to review their tools to make sure they are agile enough.

Projecting how 5G will impact operations

Operations Face Uphill Battle

The industry is beating the drums, 5G is coming. But I do not hear from the industry how operations will consume it. From my experience, nobody knows. Legacy tools are too difficult to share information. They are too tied to a vendor or technology domain. Most tools have difficulty scaling to 100k-5m devices. This forces most customers to silo their monitoring and management. This creates lacking visibility capability with drives quality issues. Most operational processes are ticket or fault-centric. Correlation is lagging behind. There will be too many faults to process. Visualization of the network will be a critical need, but may not be possible. Like winter, 5G is coming, so where is my 700 ft wall?

 

Shawn Ennis Projecting how 5G will impact operations

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

 

Merry Christmas 2017

Merry Christmas to all and to all a good night!    I wanted to take an opportunity to review 2017 and be thankful for all the blessings we received this year.    Let me provide some description of my good cheer by listing out the blessing I have gotten in 2017.

Ezekiel Ennis

My youngest son was gifted to me this year in February.   Very thankful to have a healthy and growing baby boy.   Nothing was going to top that.   Anyone would be thankful for this guy!  Now just sleep through the night, huh? 😉

SDN/NfV

It really works (and no that’s not Zeke).   I was fortunate to be involved with a revolutionary project in 2017.  It was around service assurance proving the value of SDN/NFV in telecom managed services.   We are all so proud of the pioneering work and working with the people that made is possible.   I hope to share more lessons learned in 2018.

Machine Learning

Elastic stack and machine learning has been a large focus on my Q4 this year.   The value proposition is stunning.  Merging service assurance with business intelligence and machine learning has been an eye-opener.   I hope to provide more perspective in the blog, but very thankful to have spent so much time here recently.

5G Mobility

Learning 5G mobility and vision.   The mobility world is changing rapidly. The services offered are about to begin an exponential leap.   Fixed wireless is only HFC access without the coax, what a realization.   A 10-100x increase in antennas and backhaul.   Breaking eNodeBs into virtualized network functions.   Really exciting stuff.   I hope to share more as new projects are become a reality.

IoT Digital Services

IoT is all about perspective.   Here is a link around IoT Service Assurance.   Very thankful to get a deeper understanding of the services and pain points.  The customers and partners I work with are outstanding.   I look forward to working in the new year realizing our co-developed vision.

 

Thanks again for all the customers, partners, and colleagues who made 2017 a great year.   Now on to 2018, we good boys and girls are getting 5G under their trees and gift cards for RPA.

Shawn Ennis

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Gone are the days…

In this holiday season (2017), I happen to chat with a customer about change.   Why is innovation so hard?  The answer is change.  No where is change harder than traditional telecommunications CSPs.   The impact is starting to show, OTT services are growing and unstoppable.   Traditional offers are withering.  Hardware providers (NEPs) are also on the ropes.  A Business Insider articles states that “it’s time to say goodbye to the text message“. Papworth first used SMS texting in a telecom hardware provider (Sema).  He sent it to Jarvis, a telecom service provider (Vodafone).  As an aside, the first text was “Merry Christmas”.  Gone are the days that innovation comes from NEPs and CSPs.   Providers need to accept change.

Gone are the days of the text message

Personal Testimonial

If that chart does not help show it, how about a personal testimonial.    Below is the usage from my mobile provider.   My account (#1),  leverages OTT offerings all day long for voice, data, and communications.    When I am traveling between offices and meetings I use mobile voice.   I use data to check and send email.    If I am in the office, I am using wifi.    I am like many other mobile consumers.

gone are the days of simple mobile services

Then there is my wife (#2, don’t tell her).   She is always in motion and must be completely mobile.   There is no wifi available at the park or in the playground.    My wife texts because she does not have the time to dedicate to a long conversation.   She texts because she wants to communicate via groups.   Data is used because of facebook and pinterest.    My wife is like many other mobile consumers.

Maybe 5G will change things?   Maybe the next iPhone XYZ will break us both out of our patterns?   We can agree that how we use our phone is going to change.   Technology is a constant ‘game changer’.  I for one look for change, but does my service provider?   There is a lot of uncertainty in mobility. One thing is for sure, gone are the days that voice and text drive our communications.

Change in Telecommunications

What this means for telecoms?   The value of “change” needs to embraced.   Change means opportunity.    As the business insider chart shows, fewer people are using a 25-year-old technology.   Innovations by NEPs and CSPs are fewer than ASPs and OTT providers.   The reason is change.    Embracing change drives innovation.   Innovation provides market opportunities.   They also make consumers happier.

The road for telecoms is clear and many are already following.    First they need to embrace OTT, not fight them (remember net-neutrality?).   T-Mobile with Netflix is showing the way.   Sprint just announced partnership with Hulu.   Together providers can provide a single throat to choke and cement customers in.

Next, telecoms need to leapfrog OTT providers.    Telecoms need to get back to their R&D, pioneering days.    Verizon has made investment in purchasing Yahoo and AOL.    AT&T is acquiring Time-Warner.    This is a good start, but innovation needs to come to telecoms – before its too late.  Before we say gone are the days CSPs have control of their customers.

The text message is dead (SMS), long live the text message (Facebook)…    Gone are the days…

Shawn Ennis

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Understanding Service Assurance Correlation & Analytics

Data is good right?   The more data the better.   In fact, there is a whole segment of IT related to data call BIG data analytics.   Operations have data, tons of it.    Every technology devices spits out gigabytes of data a day.   The question is figuring out how to filter data. It’s all about reducing that real-time stream of data into actionable information.   Understanding service assurance correlation & analytics is all about focusing operations. That attention can produce that better business results.   This blog details common concepts and what’s available in the marketplace. I want show the value of driving data analytics into actionable information. Value operations can execute successfully.

Maturity Curve for Service Assurance Correlation & Analytics

Let’s talk about terminology first.    Correlation versus analytics is an interest subject.   Most people I talk to, consider correlation to be only within fault management.    Analytics includes TimeSeries data like performance and logs.   Now I know some would disagree with that simplification, we can use it here to avoid confusion.   Whether it be either term, what we look for is reduction and simplification.   The more actionable your information is, the quicker you can resolve problems.

Visual Service Assurance Correlation & Analytics

Visual Correlation & Analytics

First step on the road to service assurance correlation and analytics is enabling a visual way to correlate data.   Correlation is not possible if the data is missing, so have unified collection is your first step.    Once you have the data co-located you can drive operations activities to resolution.    Technicians can leverage the tool to find the cause of the fault.   Drill-down tools can help uncover enough information.  Then the NOC techs can perform manual parent/child correlation.   

Once executed, users of the assurance tool can also suppress, or hide, faults.   Faults that are not impacting or known false errors become sorted out as “noise”.    Assurance systems then leverage third party data to enrich faults.   Enrichment would allow faults to include more actionable data. This makes them easier to troubleshoot.    All these concepts should be second nature.   Operations should have all these visual features as part of the assurance. Otherwise they are hamstrung.

Basic Service Assurance Correlation & Analytics

Basic Correlation & Analytics

Once you have a tool that has all your data, you will be swimming in the quantity of that data.    You must reduce that data stream.   If not, you will overload the NOC looking at the stack instead of the needle.   There are many basic level strategies that allow that reduction.

First, look at de-duplication.   This feature allows you to match up repeat faults and data points. Which eliminates 80% of duplicate data.   Matching “up” to “down” messages allow elimination of 50% of your data stream.    Reaping jobs can close out data that are not deemed “current” or limited log data.    Another common feature is suppressing faults. Suppression by time windows during scheduled maintenance or excluding business hours.    Threshold policies can listen to “noise” data and after X times in Y minutes create an alert.    These features should be available on any assurance platform. If yours lacks them, look to augment.

RCA Service Assurance Correlation & Analytics

Root Cause Analysis Correlation & Analytics

If you have a NOC with thousands of devices or tens of domains, you need cross domain correlation.   Root cause analysis is key to reducing complexity of large access networks.   Performing RCA across many technology domains is a core strategy. Operations can use it for consolidated network operations. Instead of playing the blame game, you know which layer is at fault.   Leveraging topology to sift through faults is common. Unfortunately its not typical in operations.   Topology data can sometimes be difficult to collect or of poor quality.   Operations needs a strong discovery strategy to prevent this.

Cluster-based Correlation

Cluster-based correlation is another RCA strategy. This one does not rely upon topology data.   The concept here is using trained or machine learning data. A written profile will align data when a certain pattern matched.  The tools create patterns during troubleshooting process. Others have algorithms that align faults with time and alerts.   Once the pattern matches, the alert fires causing a roll-up by the symptoms to reduce the event stream.    This correlation method is popular, but hasn’t provided much results yet. Algorithms are the key here. Many challenge its ROI model that requires machine training.

Customer Experience Assurance

Next, RCA enables operations to become more customer-centric.   Service oriented correlation allows operations to see the quality of their network. All through their customers eyes.   Some call this functionality “service impact analysis”. I like the term “customer experience assurance”.   Understanding what faults are impacting customers and their services enables higher efficient operations.   The holy grail of operations is focusing on only root causes. Then prioritize action only by customer value.

Service Quality Management

Lastly, you can track customer temperature by moving beyond outages and into quality.   Its important to under the KPIs of the service. This allows clarity on how well the service is performing. If you group these together, you simplify.  While operations ignore bumps and blips, you still need to track them.    Its important to understand those blips are cumulative in the customers eyes.   If the quality threshold violates, customers patience will be limited. Operations needs to know the temperature of the customer.   Having service and customer level insights are important to provide high quality service. Having a feature like this drives better customer outcomes.

Cognative Service Assurance Correlation & Analytics

Cognitive Correlation & Analytics

The nirvana of correlation and analytics includes a cognitive approach.   Its a simple concept. The platform listens, learns, and applies filtering and alerting.    The practice is very hard.   Most algorithms available diverse. They are either domain specific (website log tracking) or generic in nature (holtz-winter).    Solutions need to be engineered to apply the algorithms only where they make sense.

Holtz-Winter Use Case

One key use case is IPSLA WAN link monitoring.  Latency across links must be consistent.  If you see a jump, that anomaly may matter.   The Holtz-Winter algorithm is for tracking abnormal behavior through seasonal smoothing.   Applied to this use case, an alert is raise when the latency breaks its normal operation.    This allows operations to avoid setting arbitrary threshold levels.   Applying smart threshold alerting can reduce operational workload.   Holtz-winter shows how cognitive analytics can drive better business results.

Adaptive Filtering Use Case

Under the basic correlation area I listed dynamic filtering.   A fault can happen “X times in Y minutes”. If so, create alert Z.    This generic policy is helpful. The more you use it, you will realize that you need something smarter.   Adaptive filtering using cognitive algorithms allows for a more comprehensive solution.   While the X-Y-Z example depends upon two variables, the adaptive algorithm leverages hundreds.    How about understanding whether the device is in a lab or a core router?   Does the fault occurs ever day at the same time?   Does it proceeds a hard failure.   

You can leverage all these variables to create an adaptive score.   This score would be an operational temperature gauge or noise level.   NOC techs can cut noise during outages. They can increase it during quiet times or sort by it to understand “what’s hot”.    Adaptive filtering enables operations the ability to slice and dice their real-time fault feeds. This feature is a true force multiplier.

The Value of Correlation & Analytics

Understand the Value in Service Assurance Correlation & Analytics

The important part of correlation & analytics with service assurance is its value. You must understand what is freely available and it’s value to operations.    This subject varies greatly from customer to customer and environment to environment.   You have to decide how far the rabbit hole you want to go.    Always ask the question “Hows does that help us”.    If you are not moving the needle, put it on the back burner.   

If you are not saving 4-8 hours of weekly effort a week for the policy, its just not work the development effort.    Find your quick wins first.     Keep a list in a system like Jira and track your backlog.   You may want to leverage a agile methodology like DevOps if you want to get serious.   Correlation and analytics are force multipliers.   They allow operations to be smarter and act more efficiently.   These are worthwhile pursuits, but make sure to practice restraint.   Focus on the achievable, you don’t need to re-invent the wheel.   Tools are out there that provide all these features.   The question to focus on is “Is it worth my time?”.

Shawn Ennis

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.