Let’s continuing our discussion around machine learning and AI focusing on chronic issues. Key is the automated detection and mitigation of chronic issues. As we have discussed in many of blogs of this series, anomalies are unusual behaviors while chronic situations occur all the time. One customer put it best. “I deal with more of the same than different every day — help me there”. Chronics are not noise. The example I give is the scenario where every night a managed router goes down in a WAN site. The identified root cause is router a plugged into light switch enabled power receptacle. After the janitorial staff finishes, they flip the switch and DOWN IT GOES. Customer does care because they turn on the light in the morning and do not notice it. Operations have no way to control this problem, but they need to track it. The RCA worked, its service impacting, but customer does not care. Do you leverage a business hours suppression engine? No, because if someone is working late and it goes down, you have lost the customer. As you can see chronics are common and frustrating for operations. Too many times these waste effort and cause complacency. Giving the power to humans within operations to ignore an outage is always a bad idea.

What is an Outage?

Automated Detection and Mitigation of Chronic Issues

The correct solution is to look for a typical behavior of outage. If the outage follows that procedure suppress, otherwise treat it as normal. Machine learning can detect the scenario in an automated fashion. Now who does the compare of the current pattern to the learned behavior model — artificial intelligence. The chronic detector will fire off a message. This will suppress the outage during the learned window. This can be overridden by the anomaly detector. This covers chronic conditions, but exits from the model to revert the chronic suppression. Together humans in operations can focus on what they do best — ACT. Instead of what can be difficult — remembering and tracking.

Example – Firmware Bugs

We have discussed a customer driven behavior model, but what about a technology driven one. One of my customers is doing heavy amounts of work in SDN/NFV. They have a VNF with vendor “firmware” that had a nasty reboot bug. The trouble? Well the reboot was done in <1 second. It reoccured every 3 days for every VNF depending upon their boot cycle. While their system caught it was chronic. Their network services dropped traffic and sessions entirely every three days. It took weeks to understand the bug, but with chronic detection it becomes a snap. Machine learning would include the firmware version. Hundreds of VNFs on the same version would identify the problem. Machine learning with chronic detection would prevent a new ticket opening every time it occurred. Instead it would correlate to a root cause — bad vendor firmware. Once identified operations can escalate to the vendor, but keep their screens clear of all the random reboots.

Takeaways

With proper chronic detection and mitigation operations are free to operate as they do best. No longer are their screens cluttered with non-actionable events. No longer do operational learning curves start at 6 months and could be longer than 18 months. Operations need the freedom to assimilate new technology. Handling change with ease is the direction the business is saying. How do you do that? By simplifying operations so that do what the do best — ACT.

Article Map

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems. Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers. Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure. Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Days are changing. The physical is becoming virtual. It started in the datacenter and now it’s in the network. The surge of NFV is creating a new market freeze. Telecoms are waiting. They don’t want to buy physical, but they haven’t seen enough success in the virtual network world. A common reason for lack of success has been the vendors and devices employed. Immature is a common descriptor for the marketplace. This blog intends to educate the readers on what devices are out there. Here I catalog those VNF types I have seen. More people need to prepare themselves for the new realities of this new virtual world. The issues of NFV adoption is one the industry must address.

Challenges in NFV Adoption

The facts on the ground are these: the maturity of vendors defines what is possible. If your vendor deems tracking next hop & neighbor topology as irrelevant, then you may not be able to perform accurate root cause analysis. If your vendor deems that MAC addresses can change every time you reboot a VNF, then your network ARP table cache will look crazy. Virtualization breaks the rules and some vendors are not ready to deal with that fallout. The greatest threat and obstacle to VNF adoption is the quality and quantity of VNFs available. This is a serious challenge. Out of all the stories I have heard and spoken – VNF vendor maturity is usually at the core of the issue. NFV adoption will continue to be challenged until process can be developed to mitigate the issue.

Legacy VNFs

Let’s talk about categorizing the VNFs. We are seeing most VNFs falling into these silos. First there are legacy VNFs. They are your standard fare; just ported into a virtualized infrastructure. Virtual switches are usually imbedded into the hypervisor. Routers are common, lead by Cisco and Juniper. The mobile (3GPP) infrastructure is adopting VNFs so Ericsson, Nokia, and Cisco have offerings. Virtualized security has gotten plenty of traction. Most firewall PNFs have been ported, like Cisco, Fortinet, Palo Alto, and Checkpoint. These VNF types are PNFs emulated on a Linux OS. I compare the concept to game emulators. The “VNF” is nothing more than a Linux operating system emulating the BIOS of the pre-existing PNF. I call them “legacy” because all the integrations and interfaces are identical to their PNF brethren. What is interesting is the VNF image shipped has the Linux embedded. For any practical purpose you cannot manage the Linux operating system underneath. The VNF overrides the SSH, SNMP, telnet servers. So you cannot tell there is Linux underneath. Given that though, legacy VNFs are the easiest to manage. They are >90% like their PNF counterparts, which provides predictability.

MANO-enabled VNFs

MANO VNFs

Legacy is exactly that – old! Most product offerings focus on new services using new technologies. This is where you will experience the second type of VNF – MANO-enabled. These new VNFs support and act in concert with an orchestrated, elastic infrastructure. SD-WAN is a perfect example of new network technology with Velocloud/VMWare, Viptela/Cisco, and Versa as vendors. The trouble with these is many of the vendors don’t know the “rules”. Documentation is errant or missing. Common and core functionality support is missing – like the concept of interface monitoring. With giant gaps in functionality, especially from an assurance perspective, how can we maintain SLAs? When APIs have no documentation, you can bet that the vendor will not be able to provide best practices for KPIs. The new VNFs with new vendors have the hottest technology, but with the greatest challenges.

Custom VNFs

Rules? We don’t need rules? That is what the third VNF types is all about – custom VNFs. One of the advantages to virtualizing networks is telecom can create their own VNFs. Think this is not likely? Well one of the first customers I talked to used a customer VNF. They used a Linux OS firewalls with custom code to configure and manage them. Sprint has C3PO that will be using open source mobile core components. These VNFs enable customers increased flexibility, but they can come with a price: lack of consistency. Managing custom VNFs ends up being identical to custom applications then network elements. The good news is that you will be able to influence the changes needed. The bad news is you may be having those conversations AFTER those devices roll into production. A DevOps process can help support managing them effectively. The challenge is your assurance and delivery tools will need support agile processes. Custom VNFs are the most challenging for operations. They flip the traditional sourcing and boarding models predefined in the industry.

On-boarding Strategy

Onboarding VNFs

With such a diversity in VNF types, how can operations ensure proper engineering new services and offerings based up them? I call it an “on-boarding strategy”. When new VNFs are being planned or explored, operations must be involved to perform a proper assessment. This assessment must be holistic, including a multitude of different items. While many of these items are common in the PNF world, they have different levels of importance in the VNF world.

Example Checklist

Having a mature on-boarding strategy addresses NFV adoption. Below is my simplistic list, my recommendation is that you build you own:

Fault – This includes service and non-service affecting issue and log collection. Its import to understand the protocol, format, and overhead required.
Performance – This includes counters, KPIs, and KQIs associated to the technology. Passive vs active collection as well as protocol/format creates challenges.
Inventory – This includes interfaces and “slot/card” information. Protocol/format/overhead is important as well.
Configuration – This includes how to recreate the VNF. A VNF, like PNF, is a blank slate until it has a configuration. Recreating the VNF requires configuration collection, repository, and policy manager to restore.
Topology – The next hops by network layer (1/2/3/4, etc) allows for accurate root cause analysis. Without accuracy, your automation loses value.
Automation – Can you snapshot or change configuration without service impact? Understanding how well the VNF handles change allows better understanding of the limits of your tooling.

Being Proactive

Be proactive in its use so you can identify problems before they slow down your delivery. Many vendors offer similar VNFs. Say that from a contract perspective, switching between router vendors is insignificant. But let’s say they vary from an on-boarding requirements. The more information you have, the more chance you have at avoiding challenges and delay. The you avoid the delay, the more you NFV adoption will accelerate.

Impact on Tools

Impact on Tools

VNF maturity directly impacts your tooling. Automation becomes limited by the least common denominator of VNF capabilities. If you cannot perform accurate root cause analysis, then you cannot perform automation. If your VNF crashes when you perform a configuration pull, then automated restoration is limited. If the VNF stops logging under performance load, your operational processes are threatened. Virtualization of your network challenges legacy tools and processes like I have never seen. In my career, adopting a new tool to cover the new technology domain was enough – while increasing complexity and cost. NfV is different because it’s not a new domain, it unifies all domains.

When looking at tools, you should focus on three key areas. First is a no-brainer, scale. When you are changing the infrastructure to which all new network elements will run on – scale should be the first discussion point. Next is flexibility. When you are unifying all network domains with a common infrastructure, you need tools that cross domains . Lastly you need a tool with automation focus. Automation is how you will grow the network, but only if your tools support it. Your service delivery and assurance solutions need to embrace your new network infrastructure. You do not fight want them to fight you tooth and nail. With the proper tools and process, your NFV adoption will accelerate and enable operations.

Lessons Learned to Enable NFV Adoption

Virtualization is causing market confusion and hesitancy

Vendor maturity one of the largest issues

There are three types of VNFs

Legacy VNFs are like PNFs and you can expect similar feature/functionality, but with weird twists

MANO enabled VNFs have stable devices but inconsistency from a feature/functionality perspective
Custom VNFs that break all the rules which run more like applications then network

Create onboarding processes based upon your requirements. Make sure they are documented and easy to use by procurement, engineering, and vendors

Make sure you have tools with a focus on flexibility, scale, and automation

My advice of NfV Adoption

Leverage process -> Development and enforce a VNF onboarding strategy

Vendor management -> Only use VNFs from vendors you trust, that you have a strong relationship

Get Experience -> Model a network with the devices you want to use, then pilot that network with real traffic