Category Archives: SDN/NFV

Automated Detection and Mitigation of Chronic Issues

Let’s continuing our discussion around machine learning and AI focusing on chronic issues. Key is the automated detection and mitigation of chronic issues. As we have discussed in many of blogs of this series, anomalies are unusual behaviors while chronic situations occur all the time. One customer put it best. “I deal with more of the same than different every day — help me there”. Chronics are not noise. The example I give is the scenario where every night a managed router goes down in a WAN site. The identified root cause is router a plugged into light switch enabled power receptacle. After the janitorial staff finishes, they flip the switch and DOWN IT GOES. Customer does care because they turn on the light in the morning and do not notice it. Operations have no way to control this problem, but they need to track it. The RCA worked, its service impacting, but customer does not care. Do you leverage a business hours suppression engine? No, because if someone is working late and it goes down, you have lost the customer. As you can see chronics are common and frustrating for operations. Too many times these waste effort and cause complacency. Giving the power to humans within operations to ignore an outage is always a bad idea.

What is an Outage?

Automated Detection and Mitigation of Chronic Issues
The correct solution is to look for a typical behavior of outage. If the outage follows that procedure suppress, otherwise treat it as normal. Machine learning can detect the scenario in an automated fashion. Now who does the compare of the current pattern to the learned behavior model — artificial intelligence. The chronic detector will fire off a message. This will suppress the outage during the learned window. This can be overridden by the anomaly detector. This covers chronic conditions, but exits from the model to revert the chronic suppression. Together humans in operations can focus on what they do best — ACT. Instead of what can be difficult — remembering and tracking.
 

Example – Firmware Bugs

Automated Detection and Mitigation of Chronic Issues
We have discussed a customer driven behavior model, but what about a technology driven one. One of my customers is doing heavy amounts of work in SDN/NFV. They have a VNF with vendor “firmware” that had a nasty reboot bug. The trouble? Well the reboot was done in <1 second. It reoccured every 3 days for every VNF depending upon their boot cycle. While their system caught it was chronic. Their network services dropped traffic and sessions entirely every three days. It took weeks to understand the bug, but with chronic detection it becomes a snap. Machine learning would include the firmware version. Hundreds of VNFs on the same version would identify the problem. Machine learning with chronic detection would prevent a new ticket opening every time it occurred. Instead it would correlate to a root cause — bad vendor firmware. Once identified operations can escalate to the vendor, but keep their screens clear of all the random reboots.

Takeaways

With proper chronic detection and mitigation operations are free to operate as they do best. No longer are their screens cluttered with non-actionable events. No longer do operational learning curves start at 6 months and could be longer than 18 months. Operations need the freedom to assimilate new technology. Handling change with ease is the direction the business is saying. How do you do that? By simplifying operations so that do what the do best — ACT.
Automated Detection and Mitigation of Chronic Issues

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Projecting how 5G will impact operations

It’s coming they say.    Not winter (that’s here already), but mobile 5G is coming.   Sure the latency will allow a new generation of mobility applications.   The new RF control functions will allow better elasticity.   The network slicing will finally allow MPLS-like functionality in a 3GPP network.    But I think its time someone asked the basic questions on how 5G will impact operations.

Projecting how 5G will impact operations

Basic Tutorial and Terminology

5G means different things to different people.   The money involve means industry will be providing competing visions.   The essence of 5G is focusing on improving latency, bandwidth, network slicing, and elasticity.   Adding more endpoint (antennas) and distributed control functions will reduce the latency.    The control functions permitted more distributed switching.   The extra endpoints means fewer network mileage required.   These latency benefits are the biggest game changer so far.

The simplified control functions also enable better elasticity options.   Scaling up/down/in/out will allow a more natural self-optimizing network.   Adding class-of-service capabilities (like QoS in MPLS) will allowed tiered network options.   While net-neutrality questions still loom, this adds diversity to the single-use mobile network.   The amount of network required is still to be defined, but it looks to be at least 10x.   Mobile operators are also taking this opportune time to diversify their vendors.    Most US providers are adopting radios from at least two vendors.

The bottom line for operations is potentially terrifying.   Exponential scale in the network and backhaul is to be expected.    Exponential complexity increase with a “always in flux” network.   New network offerings and customer bases are bound to cause trouble.   Top those off with at least doubling the vendors.   Houston we have a problem!

 

Projecting how 5G will impact operations

Next-generation Mobile Network Services

As you can see, the investment will be significant.   The upside should be worth it though.   Traditional mobile services are commoditized as I blogged here last year.   Data, voice, and SMS do not provide enough value to the customer.  The new services provided will change that.   With the latency benefits, IoT services will become more viable.   I detailed IoT more in the previous blog around IoT Service Assurance.   In my opinion, the most intriguing new offering is “fixed wireless access” (FWA).    Ericsson did a really nice write up available here.    Verizon is augmenting their FIOS offer with a FWA offer in 2018.   This means that mobile providers are entering into the cable access market (HFC).    This sets Verizon against Comcast or T-mobile against Charter.   Gone will be the days that we will locked into high speed internet options solely by developed the neighborhood.

These new network services will drive new revenue potentials.   Most of these services will have direct competition so quality will matter.   With all these changes operations should expect challenges.    We should all expected quality problems with these new services.

 

Projecting how 5G will impact operations

Exponential Scale and Complexity

The first great challenge will be scale and complexity.  Tripling the number of devices in your network will stress your tools.    Realistically, can your OSS handle a 10x-1,000x increase in network size?    But this is not the only issue.   The self-optimizing vRAN means that network will constantly be in flux.   How can you troubleshoot a network that is always changing?   Due to size of investment, it only makes sense multiple vendors will be used.   Most mobile operators heavily depend upon their NEPs to provide OSS solutions.

The solution is simple, in fact simplification.   A vendor, technology, and product agnostic OSS solution is a must.   As you increase your tools, the complexity limits functionality.   Low-level optimization and orchestration can be done at the element manager layer.   This increases scale of both layers of the solution.

 

Projecting how 5G will impact operations

Becoming Geospatial Again

Remember the HP OpenView days of maps? When get prepared for those concepts to return. Like wifi antennas, 5G deploys radios with geospatial design in mind. Geospatial information (Lat/Long) will then drive behavior. GIS Correlation and visualization then becomes a need. Correlation and analytics are vital to reducing the complexity of 5G vRAN networks. External network conditions becomes more indicative. Things like hurricanes, floods, and power outages need to be taken into account. This is very similar the cable industry (HFC) access monitoring requirements. Operations will need help because most legacy tools are inadequate in these areas.

Projecting how 5G will impact operations

Bending but not Breaking with Elasticity

Elastic scaling the network is not a 5G concept. The trouble is we have seen that many, if not most, of the network functions are beats. Components like MMEs and SGWs take hours to spin up and configure. This reduces the value of elasticity in 3GPP networks. 5G replaces many complex functions, like eNodeBs, with smaller control functions. This will enable all the promises of network virtualization. The question becomes does operations have orchestration tools to enable automation. Some NEPs are building those functions into the element managers or VNFMs. Most service assurance tools do not have the capability to handle the network flux called by real-time elasticity. Operations will need to review their tools to make sure they are agile enough.

Projecting how 5G will impact operations

Operations Face Uphill Battle

The industry is beating the drums, 5G is coming. But I do not hear from the industry how operations will consume it. From my experience, nobody knows. Legacy tools are too difficult to share information. They are too tied to a vendor or technology domain. Most tools have difficulty scaling to 100k-5m devices. This forces most customers to silo their monitoring and management. This creates lacking visibility capability with drives quality issues. Most operational processes are ticket or fault-centric. Correlation is lagging behind. There will be too many faults to process. Visualization of the network will be a critical need, but may not be possible. Like winter, 5G is coming, so where is my 700 ft wall?

 

Shawn Ennis Projecting how 5G will impact operations

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

 

How Vendor Maturity Challenges NFV Adoption

Days are changing. The physical is becoming virtual. It started in the datacenter and now it’s in the network. The surge of NFV is creating a new market freeze. Telecoms are waiting. They don’t want to buy physical, but they haven’t seen enough success in the virtual network world. A common reason for lack of success has been the vendors and devices employed. Immature is a common descriptor for the marketplace. This blog intends to educate the readers on what devices are out there. Here I catalog those VNF types I have seen. More people need to prepare themselves for the new realities of this new virtual world.   The issues of NFV adoption is one the industry must address.
 

Challenges in NFV Adoption

The facts on the ground are these: the maturity of vendors defines what is possible. If your vendor deems tracking next hop & neighbor topology as irrelevant, then you may not be able to perform accurate root cause analysis. If your vendor deems that MAC addresses can change every time you reboot a VNF, then your network ARP table cache will look crazy. Virtualization breaks the rules and some vendors are not ready to deal with that fallout. The greatest threat and obstacle to VNF adoption is the quality and quantity of VNFs available. This is a serious challenge. Out of all the stories I have heard and spoken – VNF vendor maturity is usually at the core of the issue.   NFV adoption will continue to be challenged until process can be developed to mitigate the issue.
 
Legacy VNFs for NFV Adoption

Legacy VNFs

 

Legacy VNFs

Let’s talk about categorizing the VNFs. We are seeing most VNFs falling into these silos. First there are legacy VNFs. They are your standard fare; just ported into a virtualized infrastructure. Virtual switches are usually imbedded into the hypervisor. Routers are common, lead by Cisco and Juniper. The mobile (3GPP) infrastructure is adopting VNFs so Ericsson, Nokia, and Cisco have offerings. Virtualized security has gotten plenty of traction. Most firewall PNFs have been ported, like Cisco, Fortinet, Palo Alto, and Checkpoint. These VNF types are PNFs emulated on a Linux OS. I compare the concept to game emulators. The “VNF” is nothing more than a Linux operating system emulating the BIOS of the pre-existing PNF. I call them “legacy” because all the integrations and interfaces are identical to their PNF brethren. What is interesting is the VNF image shipped has the Linux embedded. For any practical purpose you cannot manage the Linux operating system underneath. The VNF overrides the SSH, SNMP, telnet servers. So you cannot tell there is Linux underneath. Given that though, legacy VNFs are the easiest to manage. They are >90% like their PNF counterparts, which provides predictability.
 
MANO-enabled VNFs for NFV Adoption

MANO-enabled VNFs

 

MANO VNFs

Legacy is exactly that – old! Most product offerings focus on new services using new technologies. This is where you will experience the second type of VNF – MANO-enabled. These new VNFs support and act in concert with an orchestrated, elastic infrastructure. SD-WAN is a perfect example of new network technology with Velocloud/VMWare, Viptela/Cisco, and Versa as vendors. The trouble with these is many of the vendors don’t know the “rules”. Documentation is errant or missing. Common and core functionality support is missing – like the concept of interface monitoring. With giant gaps in functionality, especially from an assurance perspective, how can we maintain SLAs? When APIs have no documentation, you can bet that the vendor will not be able to provide best practices for KPIs. The new VNFs with new vendors have the hottest technology, but with the greatest challenges.
 
Custom VNFs for NFV Adoption

Custom VNFs

 

Custom VNFs

Rules? We don’t need rules? That is what the third VNF types is all about – custom VNFs. One of the advantages to virtualizing networks is telecom can create their own VNFs. Think this is not likely? Well one of the first customers I talked to used a customer VNF. They used a Linux OS firewalls with custom code to configure and manage them. Sprint has C3PO that will be using open source mobile core components. These VNFs enable customers increased flexibility, but they can come with a price: lack of consistency. Managing custom VNFs ends up being identical to custom applications then network elements. The good news is that you will be able to influence the changes needed. The bad news is you may be having those conversations AFTER those devices roll into production. A DevOps process can help support managing them effectively. The challenge is your assurance and delivery tools will need support agile processes. Custom VNFs are the most challenging for operations. They flip the traditional sourcing and boarding models predefined in the industry.
 
On-boarding Strategy for NFV Adoption

On-boarding Strategy

Onboarding VNFs

With such a diversity in VNF types, how can operations ensure proper engineering new services and offerings based up them? I call it an “on-boarding strategy”. When new VNFs are being planned or explored, operations must be involved to perform a proper assessment. This assessment must be holistic, including a multitude of different items. While many of these items are common in the PNF world, they have different levels of importance in the VNF world.  

Example Checklist

Having a mature on-boarding strategy addresses NFV adoption.   Below is my simplistic list, my recommendation is that you build you own:
  • Fault – This includes service and non-service affecting issue and log collection. Its import to understand the protocol, format, and overhead required. 
  • Performance – This includes counters, KPIs, and KQIs associated to the technology. Passive vs active collection as well as protocol/format creates challenges.
  • Inventory – This includes interfaces and “slot/card” information. Protocol/format/overhead is important as well. 
  • Configuration – This includes how to recreate the VNF. A VNF, like PNF, is a blank slate until it has a configuration. Recreating the VNF requires configuration collection, repository, and policy manager to restore. 
  • Topology – The next hops by network layer (1/2/3/4, etc) allows for accurate root cause analysis. Without accuracy, your automation loses value.
  • Automation – Can you snapshot or change configuration without service impact? Understanding how well the VNF handles change allows better understanding of the limits of your tooling.

Being Proactive

Be proactive in its use so you can identify problems before they slow down your delivery. Many vendors offer similar VNFs. Say that from a contract perspective, switching between router vendors is insignificant. But let’s say they vary from an on-boarding requirements. The more information you have, the more chance you have at avoiding challenges and delay.   The you avoid the delay, the more you NFV adoption will accelerate.
 
Impact on tools for NFV Adoption

Impact on Tools

 Impact on Tools

 
VNF maturity directly impacts your tooling. Automation becomes limited by the least common denominator of VNF capabilities. If you cannot perform accurate root cause analysis, then you cannot perform automation. If your VNF crashes when you perform a configuration pull, then automated restoration is limited. If the VNF stops logging under performance load, your operational processes are threatened. Virtualization of your network challenges legacy tools and processes like I have never seen. In my career, adopting a new tool to cover the new technology domain was enough – while increasing complexity and cost. NfV is different because it’s not a new domain, it unifies all domains.
When looking at tools, you should focus on three key areas. First is a no-brainer, scale. When you are changing the infrastructure to which all new network elements will run on – scale should be the first discussion point. Next is flexibility. When you are unifying all network domains with a common infrastructure, you need tools that cross domains . Lastly you need a tool with automation focus. Automation is how you will grow the network, but only if your tools support it. Your service delivery and assurance solutions need to embrace your new network infrastructure. You do not fight want them to fight you tooth and nail.    With the proper tools and process, your NFV adoption will accelerate and enable operations.
 

Lessons Learned to Enable NFV Adoption

  • Virtualization is causing market confusion and hesitancy
  •  Vendor maturity one of the largest issues
  • There are three types of VNFs
  1.  Legacy VNFs are like PNFs and you can expect similar feature/functionality, but with weird twists 
  1. MANO enabled VNFs have stable devices but inconsistency from a feature/functionality perspective 
  2. Custom VNFs that break all the rules which run more like applications then network
  • Create onboarding processes based upon your requirements. Make sure they are documented and easy to use by procurement, engineering, and vendors
  •  Make sure you have tools with a focus on flexibility, scale, and automation
 

My advice of NfV Adoption 

  • Leverage process -> Development and enforce a VNF onboarding strategy
  • Vendor management -> Only use VNFs from vendors you trust, that you have a strong relationship
  • Get Experience -> Model a network with the devices you want to use, then pilot that network with real traffic
Shawn Ennis

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.