Applying Web3 Concepts

Web3, the buzzy term for the next internet era based on blockchain technology, can foster a better customer experience in many use cases. For enterprises, a turnkey hosted decentralized application framework with the trust and privacy of blockchain built-in can enable better B2C applications, since the very design of blockchain allows companies to be everywhere all the time. When done right, Web3 concepts can be used to create personalized and frictionless experiences for consumers, while still guaranteeing privacy and security.

Below, we’ll break down a few examples of practical uses for Web3 concepts.

Web3 concepts in action

Let’s say you want to create a certificate of authenticity for a physical product you market, manufacture, and sell (like Nike does), or It could be a “badge” indicating attendance at an event (looking at you, AMC). To tie customers to your brand, you need a record of ownership that customers find value in now and over time. In a Web3 world, that’s called an NFT—non-fungible token. The art and card trading worlds are currently reinventing themselves with this technology, but it can also be used by enterprises. 

This digital artifact can be given to a customer to keep in their digital wallet (that is, their blockchain user account). When they want to resell that product or prove ownership, it’s available 24/7/365 at no cost to you (the manufacturer) or them (the owner).

What do you need?

The math in a real-life example can be useful. Let’s look at what you need for execution:

  • Omnichannel registration portal
  • Marketing data reporting solution, including revenue
  • Track and trace engine, to zoom in/zoom out of the data

This would cost you around $250k a year in a hosted software as a service (SaaS)-type framework. But what would that buy you? Say you want five limited edition runs monthly of around 25k NFT “badges.” The quantity changes barely affect the cost, but the results would be 1.5M NFTs a year, handed out to your customers digitally. You can charge for these items as you would for certificates of authenticity to the end customer or embed it in the cost. You can also add a “tail” to the NFT, in that when a transfer happens, you get a portion of the third-party sale as compensation for its creation. This feature could enable the solution to effectively pay for itself.

Applying Web3 concepts

Let’s look at another use case, where you can provide a valuable, 100% digital product, like a media file—audio, visual, or audio-visual. We’ve built a use case around an immersive mobile application like Nike built, but this one is for watches. Consumers get exclusive, limited-edition designs from their favorite designers. The use case for the brand would be testing out new designs—to discover which designs are popular before investing millions of dollars and months to produce them. You would also need an omnichannel marketplace to go with the same portal and reporting solutions from before. The total cost of ownership is $750k with $175k/year. It’s a steep price for a campaign, but a cheap price for a factory to create unlimited digital products for pennies in minutes. The revenue potential here is telling:

  • MSRP: $5 with a 10% tail
  • 1.5 times a year the watch is re-sold
  • Selling the same 1.5M “watch” NFTs a year

This leaves you with an amazing $8.4M revenue stream with a recurring revenue stream of $900k. Yes, this means recurring revenue on products you already manufactured and sold. Our calculations are a 22-day ROI with 91% margin. Eliminating physical manufacturing and distribution shows its upside here.

Applying Web3 concepts

What if you create media content instead of apparel? Imagine seats for an exclusive event or a concert. Web3 can enable secure content digital distribution without using a middleman. The content would use an immersive customer experience, but the delivery method would be the same as before. The key importance of this use case is that you are not tied down by categorization or classification. If you can devise, market, and sell the experience, you can engage the customer using a Web3 lever.

Here are some specifics to the business we use:

Applying Web3 concepts

Advocates of Web3 see the ability to reduce latency and intermediaries between the businesses making products and the customers consuming them. They see a fully hosted network connecting us all, free to use to view and track, and one that costs us fractions of pennies for each product produced or sold. Using this technology allows enterprises to connect, serve, and learn from customers directly without the need of physical presence, third-party transaction go-betweens, and/or heavily regulated privacy limitations.

So, what’s the catch? The technology is new and the skills to do this work are limited. You need smart, experienced, and capable people to help you turn Web3 from a pipe dream into real-world products. We offer that expertise. With our turnkey technology, we facilitate test runs of publicly tradeable NFTs at a fixed price. Find out how we can turn your Web3 project into reality.

Why Web3 can improve enterprise CX

The hype around Web3 and how it can transform the internet | World Economic  Forum

By now, you’ve probably heard the tech buzzword “Web3” in the context of Bitcoin or NFTs of viral memes. Web3, considered to be the next iteration of the internet, is based on blockchain technology, allowing users to read, write, and own their data. However, what you might not know is that Web3 can be leveraged by enterprise organizations to guarantee consumer privacy and security while still allowing for better CX as an outcome.

Here are a few reasons why Web3 can be the future of CX.

Web3 vs. blockchain: A primer

Blockchain is a distributed, decentralized ledger that records transactions securely, permanently, and efficiently. It is not controlled by any one entity, so it has no single point of entry. No one person or entity “owns” the information. The model is highly secure and can be applied to anything of value: currency, personal information, and data of any kind. Think of blockchain as a database arrayed in multiple redundant nodes in many locations. Web3 is the overarching trend that blockchain is a part of —it is the practice of using blockchain to accomplish business solutions beyond simply storing data. Web3 gets its name from its agreed-upon place in the evolution of the internet—the original internet made up of static HTML pages is known as “Web 1.0” and the transition to dynamic social media as Web 2.0. Thus, we’ve arrived at Web 3.0, or Web3.

Blockchain can be leveraged by smart developers to create what are called “smart contracts.” The name is a misnomer, as famous Web3 developers have said that they are neither “smart” nor “contracts.” If you are lost in the minutiae of this terminology, you are not alone. Elon Musk has famously said he is “too dumb to understand smart contracts.”

People get confused thinking that Web3 is the same as cryptocurrency. Though Web3 does include cryptocurrency, the key takeaway of this technology is that it can serve as a building block for enterprise solutions. It’s available to you turnkey and hosted in a public, private, and hybrid way—just like virtualized computers and databases—and adding blockchain to many enterprise solutions generates concrete business value.

But just because a technology is “cool” and hyped does not mean it’s relevant to your business. There is a lot of froth in the world of blockchain, so not all solutions branded as Web3 will yield much, if any, use value. The tenets of Web3, however, do allow for a more decentralized approach to applications and business services.

Why Web3 fits into CX strategy

Companies engage with customers where they are at. Nobody lives in the cloud. From an organizational perspective, however, the traditional challenge of being “everywhere all the time” for customers is expensive—for hosting, managing, and dealing with complexity. One such challenge of needing to be everywhere all the time is enabling point of sale and currency transactions. This is the origin story of blockchain to cryptocurrency to decentralized applications (dApps).

Why does Web3 matter to the enterprise? A turnkey hosted decentralized application framework with the trust and privacy of blockchain built in means fertile ground for B2C applications, by allowing for companies to be everywhere all the time. Web3 makes blockchain technology actionable.

Let’s say we want to create a certificate of authenticity for a physical product we market, manufacture, and sell (like Nike does). Remember the goal: we want to tie our customers to our brand. We need a record of ownership that our customers find value in now and over time. In a Web3 world, that is called an NFT—non-fungible token (a fancy way to say record of ownership). The art and card trading world is in the process of reinventing itself with this technology, but the same technology can be leveraged by enterprises.

This digital artifact can be given to a customer. That customer keeps it in their digital wallet (fancy way to say blockchain user account). When the customer wants to resell that product or prove ownership it is available 24/7/365 at no cost to you (the manufacturer) or them (the owner).

Get up and running with Web3

Advocates of Web3 see the ability to reduce latency and intermediaries between businesses making products and customers consuming them. They see a fully hosted network connecting us all, free to use to view and track, and one that costs us fractions of pennies for each product produced or sold. Leveraging this technology allows enterprises to connect, serve, and learn from customers directly without the need of physical presence, third party transaction go-betweens, and/or heavily regulated privacy limitations—resulting in a faster, more convenient, and more enjoyable experience for customers.

Implementing IoT for Your Sustainability Journey

You can’t manage what you can’t measure, and this is especially true for implementing IoT sustainability initiatives to reduce operational costs and increase efficiency.

Currently, investors and stakeholders are demanding measurable outcomes and economic benefits from ESG initiatives. Environmental benefits, in the form of energy savings, can return both carbon reduction and energy cost reduction. But achieving large savings, and doing so quickly, is hard to do. In addition, not a lot of people are aware of IoT’s role in accelerating sustainability initiatives, and that implementation is actually economically achievable.

In my latest webinar, I moderated a panel including representatives from Optio3 and Carbon Lighthouse. We discussed how businesses can transition toward measurable sustainability quickly and easily using IoT Edge technology.

If you missed the webinar, or would like to experience it again, watch the video below.

Implementing IoT sustainability measures for better business results

End of the Beginning for Machine Learning

Well thats a wrap for my initial journey into machine learning and artificial intelligence. But it’s only the end of the beginning for machine learning.  Before I finish up this series some people have asked that I provide some context. This technology is changing the industry. Some players have already adopted these technologies – my company and competitors. So to placate the masses, let’s list out what I have heard is currently available in the marketplace. I am not omnipotent, so if you see anything missing or wrong please let me know.

Artificial Intelligence is Old Hat

 
Let’s first talk legacy. Artificial intelligence has been part of software since my start of the industry. The most common of which are “rules”. Humans define the model though. This model could be a list of “if” statements. A tree stored in a database is also a model. The difference is not on the AI side, but the machine learning. Automated model building is what is different. Other legacy concepts are algorithm based. Two examples I have for you: linear regression trending and Holt-Winters smoothing. Both are available in open-source like MRTG as well as many commercial applications today. The commonality is that the algorithm provides the model. Let’s be clear, the algorithms doesn’t build the model, it IS the model. These are robust and well regarded solutions in the marketplace today.

Anomaly Detection vs Chronic Detection

 
 
Now lets move to machine learning. Anomaly detection, with various degrees of accuracy, is getting to be common in the marketplace. Many are black boxes that strain credibility and others are open time abyss of customization. The mature solutions are trying provide a balance between out-of-the-box value and flexibility. There are plenty of options with anomaly detection. Chronic detection and mitigation is much more rarer. I have not seen many who offer that functionality, especially accomplished with machine learning. Again on dealing with chronics, you mileage may vary but its out there.

Takeaways

Many of the products that use this technology do not specifically reference it. Usually when you hear analytics nowadays you can expect machine learning to be part of it. Most performance alerting (threshold crossing) leverage it in the realm of big data analytics. Most historical performance tools leverage machine learning to reduce the footprint of reporting. These three areas commonly have machine learning technology baked in.
 
What this means is that machine learning is NOT revolution technology that solves all our problems. At least not yet. Its revolution technology that lowers the bar. Because of this technology problems can be solved easier with far less resources than ever before. The price you pay for this is simple, machine learning will not catch everything. You will have to be fine with a 80% quality with 0% effort.
 
Thanks again for all the great input, keep commenting and I will keep posting.

Service Assurance's Future with Machine Learning

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Service Assurance’s Future with Machine Learning

Thanks again for all the great feedback on this blog series. I want to continue the ongoing discussion by guessing service assurance’s future with machine learning. There are infinite operational problems out their for providers and IT. Machine learning offers an inexpensive, yet expansive flexible way to solve problems. Here are some of the most extreme ideas I have thought of to common problems of the industry. If you have heard anyone tackle these with machine learning I would love to hear more about it.

“We Hate Rules” — Says Everyone

Service Assurance's Future with Machine Learning
One common compliant I have heard from customers and partners as long as I have been in business is around rules. “We hate rules!” I don’t like rules by the problem this technology solves is a big one. How do I decrypt vital fault details in a variety of different ways into operational actionable events? Right now people have compilers to take SNMP MIBs and export them into rules of some sort. From HPOV to IBM Netcool to open-source MRTG – its the same solution. What if?  Is it possible to apply machine learning? What if automation enriches faults and decides which KPIs are important? Google is a great source of truth. Consider deconstruction of the MIB into OIDs and google it. Based upon parsing the search results you may consider its worth of collection or not. Then, let’s use some of the solution we discussed already – fault storms and reduction. We can bubble up anomalies and chronics with zero human touch. How accurate it could it be? This should be surprising. You could always have an organic gamification engine to curate. Think about it from a possible results. No rules, No human touch, No integration costs, only ramp time. An interesting idea.

Are We Really Impacting the Customer?

Service Assurance's Future with Machine Learning
 
I know we have all heard this one before. Service impact. How do you know if a fault is service impacting or not? If you notify a customer they are down and they are not – they lower their opinion of you. Flip it around they hate you. Understanding impact is a common problem. Common industry practice is leverage a common event type category – think trap OID name. The problem is that it over simplifies it and their is a lot of guess work in those rules (see above). What if they fault is on a lab environment?  Is there no traffic on that interface? What if its redundancy is active or failed? Too much complexity. This is machine learning’s sweet spot. Imagine a backfill from ticketing to show that the customer confirms there was an impact. Than linking that data pool to the model of faults. Where you can compare that model to a current situation to score a likelihood of impact. That way you are using a solid source of truth, the customer, to define the model. — UPDATE — It’s true you could use network probes and they could scan the data to ensure the service is being used. Pretty expensive solution IMHO, buying two probes for every network service. It would be cheaper to use Cisco IPSLA/Juniper RPM or Rping MIB.

Which KPI is Important?

Service Assurance's Future with Machine Learning
 
Last idea I have seen is around service quality management. In service management, customers complain about templates and models need to be pre-defined. Typical SLAs do not have the detail required to support a technology model to track them. The research required to determine the performance metrics that drive takes too much time and effort. With machine learning and public algorithms like “Granger causality” a new possible emerges. The service manager can identify and maintain the model — whatever the product offered. How could it work? My thought is simple using root-level metrics – availability, latency, bandwidth to provide baseline. All other vendor OIDs or custom KPIs available can be collected and stored. With machine learning, you can develop the models for each root metric and each custom metric. Using artificial intelligence, you can identify which custom metrics predict the degradation of a root one. So those are the metrics you want to poll more frequently, have higher priority, and power service quality metrics. The result would be fewer high frequency polling. Then more meaningful prediction for service quality management.
 
 
Let me know your thoughts. These are some of the crazier ideas I have seen/heard, but I am sure you have heard of others.

Service Assurance's Future with Machine Learning

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Automated Detection and Mitigation of Chronic Issues

Let’s continuing our discussion around machine learning and AI focusing on chronic issues. Key is the automated detection and mitigation of chronic issues. As we have discussed in many of blogs of this series, anomalies are unusual behaviors while chronic situations occur all the time. One customer put it best. “I deal with more of the same than different every day — help me there”. Chronics are not noise. The example I give is the scenario where every night a managed router goes down in a WAN site. The identified root cause is router a plugged into light switch enabled power receptacle. After the janitorial staff finishes, they flip the switch and DOWN IT GOES. Customer does care because they turn on the light in the morning and do not notice it. Operations have no way to control this problem, but they need to track it. The RCA worked, its service impacting, but customer does not care. Do you leverage a business hours suppression engine? No, because if someone is working late and it goes down, you have lost the customer. As you can see chronics are common and frustrating for operations. Too many times these waste effort and cause complacency. Giving the power to humans within operations to ignore an outage is always a bad idea.

What is an Outage?

Automated Detection and Mitigation of Chronic Issues
The correct solution is to look for a typical behavior of outage. If the outage follows that procedure suppress, otherwise treat it as normal. Machine learning can detect the scenario in an automated fashion. Now who does the compare of the current pattern to the learned behavior model — artificial intelligence. The chronic detector will fire off a message. This will suppress the outage during the learned window. This can be overridden by the anomaly detector. This covers chronic conditions, but exits from the model to revert the chronic suppression. Together humans in operations can focus on what they do best — ACT. Instead of what can be difficult — remembering and tracking.
 

Example – Firmware Bugs

Automated Detection and Mitigation of Chronic Issues
We have discussed a customer driven behavior model, but what about a technology driven one. One of my customers is doing heavy amounts of work in SDN/NFV. They have a VNF with vendor “firmware” that had a nasty reboot bug. The trouble? Well the reboot was done in <1 second. It reoccured every 3 days for every VNF depending upon their boot cycle. While their system caught it was chronic. Their network services dropped traffic and sessions entirely every three days. It took weeks to understand the bug, but with chronic detection it becomes a snap. Machine learning would include the firmware version. Hundreds of VNFs on the same version would identify the problem. Machine learning with chronic detection would prevent a new ticket opening every time it occurred. Instead it would correlate to a root cause — bad vendor firmware. Once identified operations can escalate to the vendor, but keep their screens clear of all the random reboots.

Takeaways

With proper chronic detection and mitigation operations are free to operate as they do best. No longer are their screens cluttered with non-actionable events. No longer do operational learning curves start at 6 months and could be longer than 18 months. Operations need the freedom to assimilate new technology. Handling change with ease is the direction the business is saying. How do you do that? By simplifying operations so that do what the do best — ACT.

Automated Detection and Mitigation of Chronic Issues

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Learning Your Operational Performance

In business intelligence reporting, a common area is around learning your operational performance. This means tracking operation’s workload and results. While this can be a sticky subject for operations, its also a great opportunity to improve. Its a fact, when overloaded, operations suffers in the quality of their response. So its only common sense to track the NOC like you track the network. If operations is overload causing quality issues, operations need to be aware so that remediation actions occur. This could include staff augmentation or improved training regimes to drive better results. Trouble becomes how. Many focus on ticketing solutions. The ITIL compliance allows management of operational performance to set specifications. But those levels are not real-time. How does it help to know you needed help last Wednesday!

Where Machine Learning Coming Into Play

Learning Your Operational Performance
 
Again, ML/AI technology helps. Fault managers can track user and automation interactions with those faults. Most call these “event managers”. This audit trail can have machine learning applied to create the standard operational model. The result is a discovered model. Say a common fault usually takes 10 actions and 15 minutes to fix during business hours. When the NOC deviates from their previous score – good or bad. The AI can alert to the group, either GREAT JOB improving here is the new bar or let’s RALLY we are getting behind.

Proactive Workload Management

Learning Your Operational Performance
 
Let’s get into the details. Let’s say that machine learning exposes that during a certain time of day/day of week, 4 level1, tickets 5 level2 tickets, and 15 level3 tickets. Then the system is showing a systemic increase 2x, then 5x, and then 10x. AI agents can see this risk and alert. That alert can show that we have an abnormal amount of tickets opened. Operations managers can call in resources. The system can send an advisory email to the ticketing administration asking for a health check. Without ML/AI technology, running reports and interpreting requires so much time, most organizations will not even try. Those that do, latency could be weeks between needing a change to recognizing that need.

Positive Impact to Operations

Learning Your Operational Performance
 
The results of operational performance monitoring should be a smoother working operations teams. Fewer errors and happy customers is what every NOC should try to provide to the organization. Accomplishing this with zero human touch with a latency of less than 15 minutes. This has been unimaginable functionality up to this point. The difference has been the emergence of ML/AI technologies.
 
Let me know what you think here in the comments below. This is a cringeworthy conversation with operations. I do believe that near real-time operations performance management has value to NOCs today.

Learning Your Operational Performance

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Minimizing Faults with Machine Learning and AI

A great topic of conversation is fault reduction. First let’s not confuse terms. Data reduction is discarding data deemed not actionable. Fault reduction is prioritizing the fault stream. This enables the most impactful and actionable faults bubble up to the top. You should never ignore faults. They are pointing to a problem that may cause an outage in the future. Fault reduction is a challenging field of service assurance. There are some many ways to cheat the system — as in simply deleting non-actionable faults. But let’s get serious. Where we should focus is on the best practice. That is identification and prioritization of faults that enabling filtering.  This blog describes minimizing faults with Machine Learning and AI.  You can be the judge of the methodology.

Understanding Fault Noise

Minimizing Faults with Machine Learning and AI
 
Imagine if you will a universal “noise” level for operations. Currently there are tons of outages, so operations only work on outages – they want no noise. Outages are usually straightforward and actionable. You may want to use maintenance window filters. Then verify that the services affected are in production. Many filters are straightforward. The trouble is moving beyond the outages or dealing with outages that are not actionable. Let’s talk the first – problems and issues. Problems impair a service, but not affect it. Say a loss of redundancy as an example. Usually you need two problems before the situation becomes an outage. Issues are things like misconfigurations that complicate things and can cause problems. The trouble is a mature, legacy network has 10s-100s of outages. With an exponential amount of problems. Then issues are exponential to problems. You are talking information overload. How do you rank them? Well that is where ML/AI is being leveraged. The secret ingredient is statistical rarity. If the problem or issue is new and unusual there is a greater chance of a quick fix. The less rare it is, the more likely it is not actionable. But less test my hypothesis…

The Rogue Device Example

Minimizing Faults with Machine Learning and AI 
For example, a rogue device. Let’s say someone adds a new device to network without following best practices. Receiving traps from the device, but nothing else — they are a rogue device. When a new device first alarms, creating an anomaly. This kicks off automation that validates configuration. This opens a ticket for manual intervention upon failure. The net results is zero human intervention. This follows best practice; no quasi monitored production devices exist in the network.

Dueling Interfaces

Minimizing Faults with Machine Learning and AI
Another examples is interface monitoring. Let’s say two interfaces on a switch are down. One happens all the time, the other rarely occurs. Which do you think is more actionable? With ML/AI technology, you can create a model based upon Device/Interface occurrence. If the current situation indicates breaking that model you can enrich the alarm that is more rare. This way operations can focus their time, when that is a constraint, on the more actionable fault. The result would be addressing what is easy, then working on what harder later. With prioritization, operations can increase their efficiency. This also maximizes the value to the organization as a whole.

Take Aways

 
Reduction of the fault stream is something everyone wants to do. We must remember there are good and bad ways to achieve it. A good way is to rank your fault stream using rarity. ML/AI technology can help leverage rarity. This increases operational efficiency. This is yet another advantaging of leverage event analytics for real-time operations.

Fault Storm Management

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

An Umbrella for Fault Storm Management

Let’s continue our conversations around ML/AI in Service Assurance. I want to to explore an illustrated real life use case. The first example focus in on is around fault storms management.  When bad things happen, they may create an explosion of faults. Each fault may be a separate situation. This operational overload is best described by a customer of mine — “a see of red”.

Impact of Fault Storms on Operations

Fault Storm Management

When fault storms occur they cause many operational problems. First they cause blindness. It makes pre-existing problems and follow-on problems to get mixed in. Suddenly you have a mess. It may take hours to sort out responsibility alarms with manual correlation. Next they cause automation gridlock. Most service impacting alarms are set to immediately generate tickets. If 1,000 alarms try to open tickets at the same time, you may break your ticketing solution. Last they cause mistakes. Due to the human nature of sorting out the problem, errors are common. Operations can ignore a separate problem by assuming its part of another root cause. Fault storms, while rare, are dangerous for operations in assuring customer services.

Addressing Fault Storms with Machine Learning and AI

Fault Storm Management

Fault storms are a great use case for ML/AI technology. Machine learning sets the bar for a “storm”. Artificial intelligence can create the situation by encapsulating all the service impacting faults. This isolation/segment would mitigate the “sea of red”. When storms occur, the solution mitigates the blindness. The storm situation is isolated from pre-existing faults and all follow-on problems. Automation would only run on the situation created by ML/AI. This avoids the overload scenario. Fault storms are rare, but can devastate NOC performance. ML/AI technologies are a great choice to mitigate them.

Mitigating Effects Fault Storms

Fault Storm Management

The best way to illustrate how this technologies works is by showing a solution to a problem. For example, a site outage. We you have a power outage at a remote site, its devastating. All services depending upon infrastructure are no longer available. There are hundreds fo service impacting alarms. The final result is a complete mess for operations to clean up. Now ML/AI can address the fault storm caused by the site isolation. All the alarms could have the same location field key, then have a commonality. The count of alarms from that location is tracked. Machine learning can built a model based upon those records. The rush of faults breaks that model. Then the result is an anomaly centered upon that specific location. The anomaly encapsulates the situation – all the service impact alarms. With a processed alarm list, the “sea of red” becomes “clear as a bell”. Operations can assign the single site isolation to an individual. Then after validation, the user can perform action — dispatch. Instead of confusion and panic, operations continues to move forward like any other day. Business as usual, no stress should be the goal.

Take Aways on Fault Storms

Fault storms can break people’s day. They invite failure by operations. At the grand stacks of hundreds of outages to spotlight will be overwhelming. Operations has the opportunity – will the die or will they shine. Leveraging ML/AI technology can keep them on rails. Then success will be the standard operating procedure.

Article Map

Fault Storm Management

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.