My Journey into Machine Learning with AI

I’m alive! Sorry for the lengthy lapse in updates, but things have been so busy. With the release of some of my work as of late, I now have some time to share what I have been working on. As the title suggests, its all around machine learning and artificial intelligence (AI). Despite being hot buzzwords, there are tons of success stories using the technology today. To explain what I have learned, I have written a blog series. My hope is to cover my journey in machine learning with artificial intelligence.

Blog Focus: Machine Learning

Journey into Machine Learning AI
First focus is around the technology. As with all new technology it has new terms and new concepts. Several of these are heretical to the status quo, so its important to set a proper level set.  Then we want to discuss where it applies. What problems does it solve? How well do they solve them? How do the new solutions compare to legacy ones?

Solving Problems with Machine Learning

Journey into Machine Learning AI 
Now we have a firm introduction, lets solve some problems. First with the use case of fault storm management. How are storms detected and mitigated? What are the rewards of using ML/AI when applied?
My next favorite use case is around fault stream reduction. Fewer faults mean less effort for operations. Can ML/AI help? How well does it work? How hard is it to use?
Operational performance management is a touchy subject, but a worthwhile exercise. Why should you monitoring your NOC? How can help operations without being Orwellian about it?
Chronic Detection & Mitigation is a common use case for operations. How does operations iron out the wrinkles of their network?  Can operations know when to jump on a chronic to fix them for good. Getting to 99.999% is hard without addressing chronic problems.

What is the Future of Machine Learning

Journey into Machine Learning AI
As part of this series, we should address the future of this technology. With ML/AI being so popular, where should this technology be applied? How can it help with service assurance to make an impact.
The plan is to wrap up the series with a review of what is currently available in the marketplace. So we are all are aware of what is current versus what is possible.
Stay tuned, the plan is to release the blogs weekly. Don’t be afraid to drop comments or questions, as would love to do a AMA or blog on the questions.

Article Map

Journey into Machine Learning AI

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

A Tale of Two Sales

Its been a long quarter, but to celebrate some of the successes I have had, I wanted to share.   Service assurance sales strategies is something everybody is looking for and most do not know how to achieve.   I get asked all the time, why does this sell and this do not.    To summate my response I decided to write a blog post — A Tale of Two Sales.

Sale A: Relationship Sale

Service assurance sales strategies

When customers or partner need help, they go to people they trust. If they have no one with answers they trust, they do RFI/RFPs. They do proof of concepts until they obtain trust. So many sales engagements is about trust – obtaining and fostering it. Once trust is there, the deal will only need to be palatable. Until trust is there, you must be patient and listen. This sales is also called a “consultive sale”. Customers must share their pain, their vendor must share their solutions. If there is a fit and proved, you have a match. This can take months or day, it all depends upon how open the engagement. Once the trust is there, its a matter of business case creation. Project approval and budget follow — sale done and dusted…


Sale B: Package Sale

Service assurance sales strategies

Same situation – a customer or partner in need. They have no trusted advisor relationship. Hello RFI/RFP! They get back a complete, documented, sourced, and independently verified business case. This “known” quantity details provides all the tangibles. This includes ROI projections, feature/benefit, comparison studies, and case studies. The entire business case with minor adjustments required. I call this the package sale. Sales makes the pitch and leaves the bank with the customer. References provide the trust. The customer can go to their management assured in their support. The quality of the deliverable is everything. Management agrees, project approval and budget follow. Sale is complete.


For you buyers out there…

Sales is hard, be nice do your sales rep.    Tell them what you want and why you want it.   They can get it for you if you ask.   If they every tell you false information, stop talking to them.    Its not worth putting your career at risk.   Even for a slightly superior product.


For you sellers out there…

Its all about the benefits.   Completely documented and third party certified benefits.    Features are for demos, benefits are for closers.   Customer and partners buy solutions to solve headaches or get promotions.    Its you job to help on both accounts.

I personally believe it does not matter which strategy is used.   Service assurance sales strategies are not complex, you just have to listen to your customer.    The customer will guide you and if you do it right, you will have both deliverables: the trust and business case.

Service assurance sales strategies

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Projecting how 5G will impact operations

It’s coming they say.    Not winter (that’s here already), but mobile 5G is coming.   Sure the latency will allow a new generation of mobility applications.   The new RF control functions will allow better elasticity.   The network slicing will finally allow MPLS-like functionality in a 3GPP network.    But I think its time someone asked the basic questions on how 5G will impact operations.

Projecting how 5G will impact operations

Basic Tutorial and Terminology

5G means different things to different people.   The money involve means industry will be providing competing visions.   The essence of 5G is focusing on improving latency, bandwidth, network slicing, and elasticity.   Adding more endpoint (antennas) and distributed control functions will reduce the latency.    The control functions permitted more distributed switching.   The extra endpoints means fewer network mileage required.   These latency benefits are the biggest game changer so far.

The simplified control functions also enable better elasticity options.   Scaling up/down/in/out will allow a more natural self-optimizing network.   Adding class-of-service capabilities (like QoS in MPLS) will allowed tiered network options.   While net-neutrality questions still loom, this adds diversity to the single-use mobile network.   The amount of network required is still to be defined, but it looks to be at least 10x.   Mobile operators are also taking this opportune time to diversify their vendors.    Most US providers are adopting radios from at least two vendors.

The bottom line for operations is potentially terrifying.   Exponential scale in the network and backhaul is to be expected.    Exponential complexity increase with a “always in flux” network.   New network offerings and customer bases are bound to cause trouble.   Top those off with at least doubling the vendors.   Houston we have a problem!


Projecting how 5G will impact operations

Next-generation Mobile Network Services

As you can see, the investment will be significant.   The upside should be worth it though.   Traditional mobile services are commoditized as I blogged here last year.   Data, voice, and SMS do not provide enough value to the customer.  The new services provided will change that.   With the latency benefits, IoT services will become more viable.   I detailed IoT more in the previous blog around IoT Service Assurance.   In my opinion, the most intriguing new offering is “fixed wireless access” (FWA).    Ericsson did a really nice write up available here.    Verizon is augmenting their FIOS offer with a FWA offer in 2018.   This means that mobile providers are entering into the cable access market (HFC).    This sets Verizon against Comcast or T-mobile against Charter.   Gone will be the days that we will locked into high speed internet options solely by developed the neighborhood.

These new network services will drive new revenue potentials.   Most of these services will have direct competition so quality will matter.   With all these changes operations should expect challenges.    We should all expected quality problems with these new services.


Projecting how 5G will impact operations

Exponential Scale and Complexity

The first great challenge will be scale and complexity.  Tripling the number of devices in your network will stress your tools.    Realistically, can your OSS handle a 10x-1,000x increase in network size?    But this is not the only issue.   The self-optimizing vRAN means that network will constantly be in flux.   How can you troubleshoot a network that is always changing?   Due to size of investment, it only makes sense multiple vendors will be used.   Most mobile operators heavily depend upon their NEPs to provide OSS solutions.

The solution is simple, in fact simplification.   A vendor, technology, and product agnostic OSS solution is a must.   As you increase your tools, the complexity limits functionality.   Low-level optimization and orchestration can be done at the element manager layer.   This increases scale of both layers of the solution.


Projecting how 5G will impact operations

Becoming Geospatial Again

Remember the HP OpenView days of maps? When get prepared for those concepts to return. Like wifi antennas, 5G deploys radios with geospatial design in mind. Geospatial information (Lat/Long) will then drive behavior. GIS Correlation and visualization then becomes a need. Correlation and analytics are vital to reducing the complexity of 5G vRAN networks. External network conditions becomes more indicative. Things like hurricanes, floods, and power outages need to be taken into account. This is very similar the cable industry (HFC) access monitoring requirements. Operations will need help because most legacy tools are inadequate in these areas.

Projecting how 5G will impact operations

Bending but not Breaking with Elasticity

Elastic scaling the network is not a 5G concept. The trouble is we have seen that many, if not most, of the network functions are beats. Components like MMEs and SGWs take hours to spin up and configure. This reduces the value of elasticity in 3GPP networks. 5G replaces many complex functions, like eNodeBs, with smaller control functions. This will enable all the promises of network virtualization. The question becomes does operations have orchestration tools to enable automation. Some NEPs are building those functions into the element managers or VNFMs. Most service assurance tools do not have the capability to handle the network flux called by real-time elasticity. Operations will need to review their tools to make sure they are agile enough.

Projecting how 5G will impact operations

Operations Face Uphill Battle

The industry is beating the drums, 5G is coming. But I do not hear from the industry how operations will consume it. From my experience, nobody knows. Legacy tools are too difficult to share information. They are too tied to a vendor or technology domain. Most tools have difficulty scaling to 100k-5m devices. This forces most customers to silo their monitoring and management. This creates lacking visibility capability with drives quality issues. Most operational processes are ticket or fault-centric. Correlation is lagging behind. There will be too many faults to process. Visualization of the network will be a critical need, but may not be possible. Like winter, 5G is coming, so where is my 700 ft wall?


Shawn Ennis Projecting how 5G will impact operations

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.


Merry Christmas 2017

Merry Christmas to all and to all a good night!    I wanted to take an opportunity to review 2017 and be thankful for all the blessings we received this year.    Let me provide some description of my good cheer by listing out the blessing I have gotten in 2017.

Ezekiel Ennis

My youngest son was gifted to me this year in February.   Very thankful to have a healthy and growing baby boy.   Nothing was going to top that.   Anyone would be thankful for this guy!  Now just sleep through the night, huh? 😉


It really works (and no that’s not Zeke).   I was fortunate to be involved with a revolutionary project in 2017.  It was around service assurance proving the value of SDN/NFV in telecom managed services.   We are all so proud of the pioneering work and working with the people that made is possible.   I hope to share more lessons learned in 2018.

Machine Learning

Elastic stack and machine learning has been a large focus on my Q4 this year.   The value proposition is stunning.  Merging service assurance with business intelligence and machine learning has been an eye-opener.   I hope to provide more perspective in the blog, but very thankful to have spent so much time here recently.

5G Mobility

Learning 5G mobility and vision.   The mobility world is changing rapidly. The services offered are about to begin an exponential leap.   Fixed wireless is only HFC access without the coax, what a realization.   A 10-100x increase in antennas and backhaul.   Breaking eNodeBs into virtualized network functions.   Really exciting stuff.   I hope to share more as new projects are become a reality.

IoT Digital Services

IoT is all about perspective.   Here is a link around IoT Service Assurance.   Very thankful to get a deeper understanding of the services and pain points.  The customers and partners I work with are outstanding.   I look forward to working in the new year realizing our co-developed vision.


Thanks again for all the customers, partners, and colleagues who made 2017 a great year.   Now on to 2018, we good boys and girls are getting 5G under their trees and gift cards for RPA.

Shawn Ennis

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Gone are the days…

In this holiday season (2017), I happen to chat with a customer about change.   Why is innovation so hard?  The answer is change.  No where is change harder than traditional telecommunications CSPs.   The impact is starting to show, OTT services are growing and unstoppable.   Traditional offers are withering.  Hardware providers (NEPs) are also on the ropes.  A Business Insider articles states that “it’s time to say goodbye to the text message“. Papworth first used SMS texting in a telecom hardware provider (Sema).  He sent it to Jarvis, a telecom service provider (Vodafone).  As an aside, the first text was “Merry Christmas”.  Gone are the days that innovation comes from NEPs and CSPs.   Providers need to accept change.

Gone are the days of the text message

Personal Testimonial

If that chart does not help show it, how about a personal testimonial.    Below is the usage from my mobile provider.   My account (#1),  leverages OTT offerings all day long for voice, data, and communications.    When I am traveling between offices and meetings I use mobile voice.   I use data to check and send email.    If I am in the office, I am using wifi.    I am like many other mobile consumers.

gone are the days of simple mobile services

Then there is my wife (#2, don’t tell her).   She is always in motion and must be completely mobile.   There is no wifi available at the park or in the playground.    My wife texts because she does not have the time to dedicate to a long conversation.   She texts because she wants to communicate via groups.   Data is used because of facebook and pinterest.    My wife is like many other mobile consumers.

Maybe 5G will change things?   Maybe the next iPhone XYZ will break us both out of our patterns?   We can agree that how we use our phone is going to change.   Technology is a constant ‘game changer’.  I for one look for change, but does my service provider?   There is a lot of uncertainty in mobility. One thing is for sure, gone are the days that voice and text drive our communications.

Change in Telecommunications

What this means for telecoms?   The value of “change” needs to embraced.   Change means opportunity.    As the business insider chart shows, fewer people are using a 25-year-old technology.   Innovations by NEPs and CSPs are fewer than ASPs and OTT providers.   The reason is change.    Embracing change drives innovation.   Innovation provides market opportunities.   They also make consumers happier.

The road for telecoms is clear and many are already following.    First they need to embrace OTT, not fight them (remember net-neutrality?).   T-Mobile with Netflix is showing the way.   Sprint just announced partnership with Hulu.   Together providers can provide a single throat to choke and cement customers in.

Next, telecoms need to leapfrog OTT providers.    Telecoms need to get back to their R&D, pioneering days.    Verizon has made investment in purchasing Yahoo and AOL.    AT&T is acquiring Time-Warner.    This is a good start, but innovation needs to come to telecoms – before its too late.  Before we say gone are the days CSPs have control of their customers.

The text message is dead (SMS), long live the text message (Facebook)…    Gone are the days…

Shawn Ennis

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Understanding Service Assurance Correlation & Analytics

Data is good right?   The more data the better.   In fact, there is a whole segment of IT related to data call BIG data analytics.   Operations have data, tons of it.    Every technology devices spits out gigabytes of data a day.   The question is figuring out how to filter data. It’s all about reducing that real-time stream of data into actionable information.   Understanding service assurance correlation & analytics is all about focusing operations. That attention can produce that better business results.   This blog details common concepts and what’s available in the marketplace. I want show the value of driving data analytics into actionable information. Value operations can execute successfully.

Maturity Curve for Service Assurance Correlation & Analytics

Let’s talk about terminology first.    Correlation versus analytics is an interest subject.   Most people I talk to, consider correlation to be only within fault management.    Analytics includes TimeSeries data like performance and logs.   Now I know some would disagree with that simplification, we can use it here to avoid confusion.   Whether it be either term, what we look for is reduction and simplification.   The more actionable your information is, the quicker you can resolve problems.

Visual Service Assurance Correlation & Analytics

Visual Correlation & Analytics

First step on the road to service assurance correlation and analytics is enabling a visual way to correlate data.   Correlation is not possible if the data is missing, so have unified collection is your first step.    Once you have the data co-located you can drive operations activities to resolution.    Technicians can leverage the tool to find the cause of the fault.   Drill-down tools can help uncover enough information.  Then the NOC techs can perform manual parent/child correlation.   

Once executed, users of the assurance tool can also suppress, or hide, faults.   Faults that are not impacting or known false errors become sorted out as “noise”.    Assurance systems then leverage third party data to enrich faults.   Enrichment would allow faults to include more actionable data. This makes them easier to troubleshoot.    All these concepts should be second nature.   Operations should have all these visual features as part of the assurance. Otherwise they are hamstrung.

Basic Service Assurance Correlation & Analytics

Basic Correlation & Analytics

Once you have a tool that has all your data, you will be swimming in the quantity of that data.    You must reduce that data stream.   If not, you will overload the NOC looking at the stack instead of the needle.   There are many basic level strategies that allow that reduction.

First, look at de-duplication.   This feature allows you to match up repeat faults and data points. Which eliminates 80% of duplicate data.   Matching “up” to “down” messages allow elimination of 50% of your data stream.    Reaping jobs can close out data that are not deemed “current” or limited log data.    Another common feature is suppressing faults. Suppression by time windows during scheduled maintenance or excluding business hours.    Threshold policies can listen to “noise” data and after X times in Y minutes create an alert.    These features should be available on any assurance platform. If yours lacks them, look to augment.

RCA Service Assurance Correlation & Analytics

Root Cause Analysis Correlation & Analytics

If you have a NOC with thousands of devices or tens of domains, you need cross domain correlation.   Root cause analysis is key to reducing complexity of large access networks.   Performing RCA across many technology domains is a core strategy. Operations can use it for consolidated network operations. Instead of playing the blame game, you know which layer is at fault.   Leveraging topology to sift through faults is common. Unfortunately its not typical in operations.   Topology data can sometimes be difficult to collect or of poor quality.   Operations needs a strong discovery strategy to prevent this.

Cluster-based Correlation

Cluster-based correlation is another RCA strategy. This one does not rely upon topology data.   The concept here is using trained or machine learning data. A written profile will align data when a certain pattern matched.  The tools create patterns during troubleshooting process. Others have algorithms that align faults with time and alerts.   Once the pattern matches, the alert fires causing a roll-up by the symptoms to reduce the event stream.    This correlation method is popular, but hasn’t provided much results yet. Algorithms are the key here. Many challenge its ROI model that requires machine training.

Customer Experience Assurance

Next, RCA enables operations to become more customer-centric.   Service oriented correlation allows operations to see the quality of their network. All through their customers eyes.   Some call this functionality “service impact analysis”. I like the term “customer experience assurance”.   Understanding what faults are impacting customers and their services enables higher efficient operations.   The holy grail of operations is focusing on only root causes. Then prioritize action only by customer value.

Service Quality Management

Lastly, you can track customer temperature by moving beyond outages and into quality.   Its important to under the KPIs of the service. This allows clarity on how well the service is performing. If you group these together, you simplify.  While operations ignore bumps and blips, you still need to track them.    Its important to understand those blips are cumulative in the customers eyes.   If the quality threshold violates, customers patience will be limited. Operations needs to know the temperature of the customer.   Having service and customer level insights are important to provide high quality service. Having a feature like this drives better customer outcomes.

Cognative Service Assurance Correlation & Analytics

Cognitive Correlation & Analytics

The nirvana of correlation and analytics includes a cognitive approach.   Its a simple concept. The platform listens, learns, and applies filtering and alerting.    The practice is very hard.   Most algorithms available diverse. They are either domain specific (website log tracking) or generic in nature (holtz-winter).    Solutions need to be engineered to apply the algorithms only where they make sense.

Holtz-Winter Use Case

One key use case is IPSLA WAN link monitoring.  Latency across links must be consistent.  If you see a jump, that anomaly may matter.   The Holtz-Winter algorithm is for tracking abnormal behavior through seasonal smoothing.   Applied to this use case, an alert is raise when the latency breaks its normal operation.    This allows operations to avoid setting arbitrary threshold levels.   Applying smart threshold alerting can reduce operational workload.   Holtz-winter shows how cognitive analytics can drive better business results.

Adaptive Filtering Use Case

Under the basic correlation area I listed dynamic filtering.   A fault can happen “X times in Y minutes”. If so, create alert Z.    This generic policy is helpful. The more you use it, you will realize that you need something smarter.   Adaptive filtering using cognitive algorithms allows for a more comprehensive solution.   While the X-Y-Z example depends upon two variables, the adaptive algorithm leverages hundreds.    How about understanding whether the device is in a lab or a core router?   Does the fault occurs ever day at the same time?   Does it proceeds a hard failure.   

You can leverage all these variables to create an adaptive score.   This score would be an operational temperature gauge or noise level.   NOC techs can cut noise during outages. They can increase it during quiet times or sort by it to understand “what’s hot”.    Adaptive filtering enables operations the ability to slice and dice their real-time fault feeds. This feature is a true force multiplier.

The Value of Correlation & Analytics

Understand the Value in Service Assurance Correlation & Analytics

The important part of correlation & analytics with service assurance is its value. You must understand what is freely available and it’s value to operations.    This subject varies greatly from customer to customer and environment to environment.   You have to decide how far the rabbit hole you want to go.    Always ask the question “Hows does that help us”.    If you are not moving the needle, put it on the back burner.   

If you are not saving 4-8 hours of weekly effort a week for the policy, its just not work the development effort.    Find your quick wins first.     Keep a list in a system like Jira and track your backlog.   You may want to leverage a agile methodology like DevOps if you want to get serious.   Correlation and analytics are force multipliers.   They allow operations to be smarter and act more efficiently.   These are worthwhile pursuits, but make sure to practice restraint.   Focus on the achievable, you don’t need to re-invent the wheel.   Tools are out there that provide all these features.   The question to focus on is “Is it worth my time?”.

Shawn Ennis

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

IoT Service Assurance Key Concepts

The IoT/IoE generation has been born.   Now countless things are about to be inter-connected.   We all see the hype is non-stop, but there many things are becoming a reality.   AT&T/Maersk closed a deal back to 2015.  This recently became a reality for asset tracking cold shipping containers.   Now, Uber is providing driverless trucks to deliver beer.    While GPS trackers are being used to track the elderly.   These services are being ubiquitous and common.   We are seeing the use cases have variety and are growing in depth.   But we also see that IoT is a very pioneering field.   If IoT managed services are to exist, operations will need to manage them.   The goals here is to start asking key questions.   The hope is through analysis we can provide some answers.   Let’s discuss the key concepts driving the new field of IoT Service Assurance.

Key Perspectives for IoT Service Assurance

For any IoT service, you must understand who uses it and who provides it.    As I explain it, there are three key perspectives for IoT services.    First, you have the network provider.   They provide the network access for the “thing”.   The “network” could mean LTE or Wifi or any other technology.     Network providers see the network quality has the focus.  This is similar to typical mobile providers.    Compare that to IoT services monitored with an application focus.  Its about monitoring the availability and performance of the “things”.  You want to make sure they are working.    Lastly, you may not care about the “things“.   Perhaps you only care about the data from the them.   Performing correlation and understanding the “sum of all parts” would be the key focus.   These perspectives drive your requirements and the value prop.    Through them, you can define quality and success criteria for your IoT services.

Key Requirements of IoT Service Assurance

Before we get to far along, let’s first talk about terminology.   In the world of IoT, what is a device?    We have to ask, is this “thing” a device?    With the world of mobility, the handset is not a devices its an endpoint.    So is the pallet being monitored in the cold shipping container a device or an endpoint?  Like the perspectives that drive your requirements, we should agree on terminology.   Let’s talk some use cases to better understand typical requirements.

Cold Storage Tracking IoT Service Assurance

Smart Cold Storage

In the Maersk use case, let’s say the initial roll-out listed as 250k sensors on pallets.   These sensors, at regular intervals, report data in via wireless burst communications.   The data includes KPIs that drive visibility and business intelligence.   Some common examples I have found are: temperature, battery life, and vibration rate.    Other environmental KPIs required can exist: light levels, humidity, and weight.   As we have discussed, location information with signal strength could be useful.   We can track in real-time to provide trend and predication.   One would think it would be best to know a failure before putting the container on the boat.

Bottom line is would have about around 25 KPIs per poll interval.  Let’s do some math for performance data.  Estimate 250k sensors * 25 kpis * 4 (15 min polls, 4/hour) * 24 (hours/day) = 600 million data points per day.   If you were to use a standard database storage (say mysql) you would require 200GB per day.   Is keeping the sensor data worth $300/month per month of data on AWS EC2?   Storage is so inexpensive, real-time monitoring of sensor data becomes realistic.

Now faults are different.  Some could include failed reconnects and emergency button pushed scenarios.    These faults could provide opportunities. Shipping personnel can fix the container before the temperature gets too warm.    Faults could provide an opportunity to save valuable merchandise from spoilage.   Together this information combines to provide detailed real-time IoT Service Assurance views.

Driver-less Trucks for IoT Service Assurance

Driverless Trucks Use Case

Let’s look at another use case: Uber with driverless trucks.   The Wired article does not include how many cars, so let’s look at UPS.   UPS has >100k deliver trucks.    Imagine if these logistics were 100% automated. This would create a tons of “things” on the network.  The network, controller, and data would work together to provide a quality IoT service.

First, let’s look at performance data.  The KPIs should be like the Maersk example.    Speed, direction, location, and range would be valuable real-time data.    Service KQIs like ETA and number of stops remaining would be drive efficiencies.  Let’s do the same math as the Maersk example. Say 100k trucks * 50 kpis * 4 (15 min polls, 4/hour) * 24 (hours/day) = 480 million data points per day.  So $240/day per day on AWS.    This shows that storage and requirements are practical for driverless logistics.

Now some faults would include vital real-time activity.   Perhaps an ‘out-of-gas’ event or network errors.    Getting real-time alerts on crash would definitely be useful.   So fault management would be a necessity in this use case.   Again, there are plenty of reasons to create and leverage real-time alerts.

Smart Home for IoT Service Assurance

Another use case would be smart home monitoring, like Google Nest or Ecobee.   These OTT IoT providers track and monitor things like temperature and humidity.   There is no fault data and no analytics.   The amount of homes monitored by Nest or Ecobee is not readily available on the internet.   According to Dallas News, there are 8 million thermostats sold yearly.   According to Fast Company, Ecobee has 24% marketshare, so 2 million homes per year.   Ecobee has been in business for more than 5 years, so assume they have 10 million active thermostats.  Doing some math, we have 10M homes, 10 kpis * 4 (15 min polls, 4/hour) * 24 (hours/day) = 10 billion data points per day.  So that would be around $4800/day per day on AWS.

IoT Service Assurance is Practical

What is interesting about these use case are their practicality.  Scalability is not a problem with modern solutions. All three cases show that from any perspective. Real-time IoT service assurance is achievable.   I am amazed how achievable monitoring can be for complex and IoT services.  Now you must asked the questions “why” and “how”.   To answer these questions, you must understand how flexible your tools are. What value can you get from them.

Understanding Flexibility of IoT Service Assurance

Let’s discuss flexibility.    First, how difficult is collecting this data?    So let’s focus this in the world of open APIs. The expectation is these messages would come through a load balanced REST application server.   I can image that 600 million hits per day is 2.7k hits/sec.    This is well within apache and load balancer tolerances.   As long as the messaging follows open API concepts collection should be practical. So from a flexibility, assuming you embrace open APIs, this is practical as well.

Understanding the Value of IoT Service Assurance

Its a fact, real-time is a key need in IoT Service Assurance.   If whatever you want to track can wait 24/48 hours before you need to know it, you can achieve it with a reporting tool.   If all you need is to store the data and slap a dashboard/reporting engine on top, then this becomes easy.   Start with open source databases like mariaDB are low cost and widely available. Next, add a COTS dashboards and reporting tools like Tableau provide a cost-effective solution.   

In contrast, Real-time means you need to know immediately that a cold storage container has failed.   Being able to automate dispatch to find the closest human and text that operator to fix the problem.    Real-time means that you have delivery truck on the side of the road and need to dispatch a tow truck.   Real-time IoT Service Assurance means massive collection, intelligent correlation, and automated remediation.  Now let’s look at the OTT smart home as a use case. The NEST thermostat is not going to call the firehouse when it reaches 150F.    Everything is use case dependent, so you must let your requirements dictate the tool used. 

Lessons Learned for IoT Service Assurance

  • IoT-based managed services are currently available and growing
  • Assuring them properly will require new concepts around scalability and flexibility
  • With IoT, you must always ask how far down is it worth monitoring
  • Most all requirements include some sort of geospatial tracking or correlation
 My advice on IoT Service Assurance
  • As always, follow your researched requirements.   Get what you need first, then worry about your wants.
  • Make sure you have tools with a focus on flexibility, scale, and automation.   This vertical has many fringe use cases and they are growing.
  • IoT unifies network, application, and data management more than any other technology.   Having a holistic approach can provide a multiplying and accelerating affect.

About the Author

Shawn Ennis IoT Service Assurance

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

How Vendor Maturity Challenges NFV Adoption

Days are changing. The physical is becoming virtual. It started in the datacenter and now it’s in the network. The surge of NFV is creating a new market freeze. Telecoms are waiting. They don’t want to buy physical, but they haven’t seen enough success in the virtual network world. A common reason for lack of success has been the vendors and devices employed. Immature is a common descriptor for the marketplace. This blog intends to educate the readers on what devices are out there. Here I catalog those VNF types I have seen. More people need to prepare themselves for the new realities of this new virtual world.   The issues of NFV adoption is one the industry must address.

Challenges in NFV Adoption

The facts on the ground are these: the maturity of vendors defines what is possible. If your vendor deems tracking next hop & neighbor topology as irrelevant, then you may not be able to perform accurate root cause analysis. If your vendor deems that MAC addresses can change every time you reboot a VNF, then your network ARP table cache will look crazy. Virtualization breaks the rules and some vendors are not ready to deal with that fallout. The greatest threat and obstacle to VNF adoption is the quality and quantity of VNFs available. This is a serious challenge. Out of all the stories I have heard and spoken – VNF vendor maturity is usually at the core of the issue.   NFV adoption will continue to be challenged until process can be developed to mitigate the issue.
Legacy VNFs for NFV Adoption

Legacy VNFs


Legacy VNFs

Let’s talk about categorizing the VNFs. We are seeing most VNFs falling into these silos. First there are legacy VNFs. They are your standard fare; just ported into a virtualized infrastructure. Virtual switches are usually imbedded into the hypervisor. Routers are common, lead by Cisco and Juniper. The mobile (3GPP) infrastructure is adopting VNFs so Ericsson, Nokia, and Cisco have offerings. Virtualized security has gotten plenty of traction. Most firewall PNFs have been ported, like Cisco, Fortinet, Palo Alto, and Checkpoint. These VNF types are PNFs emulated on a Linux OS. I compare the concept to game emulators. The “VNF” is nothing more than a Linux operating system emulating the BIOS of the pre-existing PNF. I call them “legacy” because all the integrations and interfaces are identical to their PNF brethren. What is interesting is the VNF image shipped has the Linux embedded. For any practical purpose you cannot manage the Linux operating system underneath. The VNF overrides the SSH, SNMP, telnet servers. So you cannot tell there is Linux underneath. Given that though, legacy VNFs are the easiest to manage. They are >90% like their PNF counterparts, which provides predictability.
MANO-enabled VNFs for NFV Adoption

MANO-enabled VNFs



Legacy is exactly that – old! Most product offerings focus on new services using new technologies. This is where you will experience the second type of VNF – MANO-enabled. These new VNFs support and act in concert with an orchestrated, elastic infrastructure. SD-WAN is a perfect example of new network technology with Velocloud/VMWare, Viptela/Cisco, and Versa as vendors. The trouble with these is many of the vendors don’t know the “rules”. Documentation is errant or missing. Common and core functionality support is missing – like the concept of interface monitoring. With giant gaps in functionality, especially from an assurance perspective, how can we maintain SLAs? When APIs have no documentation, you can bet that the vendor will not be able to provide best practices for KPIs. The new VNFs with new vendors have the hottest technology, but with the greatest challenges.
Custom VNFs for NFV Adoption

Custom VNFs


Custom VNFs

Rules? We don’t need rules? That is what the third VNF types is all about – custom VNFs. One of the advantages to virtualizing networks is telecom can create their own VNFs. Think this is not likely? Well one of the first customers I talked to used a customer VNF. They used a Linux OS firewalls with custom code to configure and manage them. Sprint has C3PO that will be using open source mobile core components. These VNFs enable customers increased flexibility, but they can come with a price: lack of consistency. Managing custom VNFs ends up being identical to custom applications then network elements. The good news is that you will be able to influence the changes needed. The bad news is you may be having those conversations AFTER those devices roll into production. A DevOps process can help support managing them effectively. The challenge is your assurance and delivery tools will need support agile processes. Custom VNFs are the most challenging for operations. They flip the traditional sourcing and boarding models predefined in the industry.
On-boarding Strategy for NFV Adoption

On-boarding Strategy

Onboarding VNFs

With such a diversity in VNF types, how can operations ensure proper engineering new services and offerings based up them? I call it an “on-boarding strategy”. When new VNFs are being planned or explored, operations must be involved to perform a proper assessment. This assessment must be holistic, including a multitude of different items. While many of these items are common in the PNF world, they have different levels of importance in the VNF world.  

Example Checklist

Having a mature on-boarding strategy addresses NFV adoption.   Below is my simplistic list, my recommendation is that you build you own:
  • Fault – This includes service and non-service affecting issue and log collection. Its import to understand the protocol, format, and overhead required. 
  • Performance – This includes counters, KPIs, and KQIs associated to the technology. Passive vs active collection as well as protocol/format creates challenges.
  • Inventory – This includes interfaces and “slot/card” information. Protocol/format/overhead is important as well. 
  • Configuration – This includes how to recreate the VNF. A VNF, like PNF, is a blank slate until it has a configuration. Recreating the VNF requires configuration collection, repository, and policy manager to restore. 
  • Topology – The next hops by network layer (1/2/3/4, etc) allows for accurate root cause analysis. Without accuracy, your automation loses value.
  • Automation – Can you snapshot or change configuration without service impact? Understanding how well the VNF handles change allows better understanding of the limits of your tooling.

Being Proactive

Be proactive in its use so you can identify problems before they slow down your delivery. Many vendors offer similar VNFs. Say that from a contract perspective, switching between router vendors is insignificant. But let’s say they vary from an on-boarding requirements. The more information you have, the more chance you have at avoiding challenges and delay.   The you avoid the delay, the more you NFV adoption will accelerate.
Impact on tools for NFV Adoption

Impact on Tools

 Impact on Tools

VNF maturity directly impacts your tooling. Automation becomes limited by the least common denominator of VNF capabilities. If you cannot perform accurate root cause analysis, then you cannot perform automation. If your VNF crashes when you perform a configuration pull, then automated restoration is limited. If the VNF stops logging under performance load, your operational processes are threatened. Virtualization of your network challenges legacy tools and processes like I have never seen. In my career, adopting a new tool to cover the new technology domain was enough – while increasing complexity and cost. NfV is different because it’s not a new domain, it unifies all domains.
When looking at tools, you should focus on three key areas. First is a no-brainer, scale. When you are changing the infrastructure to which all new network elements will run on – scale should be the first discussion point. Next is flexibility. When you are unifying all network domains with a common infrastructure, you need tools that cross domains . Lastly you need a tool with automation focus. Automation is how you will grow the network, but only if your tools support it. Your service delivery and assurance solutions need to embrace your new network infrastructure. You do not fight want them to fight you tooth and nail.    With the proper tools and process, your NFV adoption will accelerate and enable operations.

Lessons Learned to Enable NFV Adoption

  • Virtualization is causing market confusion and hesitancy
  •  Vendor maturity one of the largest issues
  • There are three types of VNFs
  1.  Legacy VNFs are like PNFs and you can expect similar feature/functionality, but with weird twists 
  1. MANO enabled VNFs have stable devices but inconsistency from a feature/functionality perspective 
  2. Custom VNFs that break all the rules which run more like applications then network
  • Create onboarding processes based upon your requirements. Make sure they are documented and easy to use by procurement, engineering, and vendors
  •  Make sure you have tools with a focus on flexibility, scale, and automation

My advice of NfV Adoption 

  • Leverage process -> Development and enforce a VNF onboarding strategy
  • Vendor management -> Only use VNFs from vendors you trust, that you have a strong relationship
  • Get Experience -> Model a network with the devices you want to use, then pilot that network with real traffic
Shawn Ennis

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Operations Digital Transformation Playbook

Digital Transformation is the buzz word of the day.  Whether TMF is saying it or Lightreading is reporting it, CIOs are doing it.    Here is an example roadmap for transforming operations for the business’s digital transformation.    This 4-step process leverages much of what the industry has already said. I have interweaved some color and advice .   I hope you find it useful and comment below.

Acceptance for Digital Transformation

Step 1: Acceptance

First, your organization needs to buy into the fact that something has to change.     Buy-in for digital transformation is the key to success.  While 100% agreement is not possible, getting an overwhelming majority will reduce timelines.   Forming a committee with regular cadence calls can assist on collection of use cases. As a sound board, they can be the voice of the organization. They will also provide you cover during the transformation process.
Here is some advice to help people get in the boat. Some will doubt the need for change. To those doubters, I would pose the following questions:
  • What % of the time does operations spend on firefighting?
  • What does your customers say about the quality of the services you provide?
  • How many compromises does your team make to “get the job done”?
  • Does 25% YOY staff turnover frighten you?

These questions are the canaries in the coal mine for impacting digital transformation.    If you cannot focus on improvement, growth, and resiliency – the organization is in danger   When the business is changing to a more agile footprint, operations gets left behind — or even worse, becomes the roadblock.


Selection for Digital Transformation

Step 2: Selection

Change scares people and organizations and digital transformation can get scary.   When it comes to selection, it must be a sober, deliberate decision.   RFIs are a common method for initiating change.    The trouble is the net you cast.    If you only send the RFI to your pre-existing vendors, you will get lots of the same.
I recommend that you start with a google search on “Operations transformation”, then “IT transformation”    The results should net you some NEPs/DEPs (EMC, Huawei) and Global SI players (Accenture, Deloitte).   If you are a fortune 500, they will be very kind to you and expect big money.    If you have a relationship with these guys, I recommend calling upon them and seeing the “big pitch”.    Its great for context and helps to understand the commitment involved in transformation.
The next step would be to call some analysts.   I have had great experiences with Analysys Mason, Appledore Research, and Gartner.    They can tell you what other customers have done.    Attend some webinars and trade events can help get you connected to the trend setters.    This will help you round out the group you want to invite into the RFI process.
With the RFI executed, you will want to review the material and cut down to 5 or less parties.    Make sure you have a global SI, a NEP/DEP, and some trend setters in the bunch.    Ask for presentations and documentation of best practices.   Get as much information as possible, creating quality requirements is key.
Within the transformation workgroup, create a top 10/25 list from each member of key issues.   Apply your use cases and develop a list of requirements (<100 items). Add to it a ratings system to keep it fair, to the point.   If you value verification of technical compliance (ie. support for Cisco IOS Y/N, etc), add another sheet.   Another tip; you can always demand entries to combine their offerings. This firms up and consolidates your options.   Use this list with your procurement team to create the RFP. Give at least 2 weeks to respond, and no more than 4 weeks.    Stick to your schedule and grade the entries responses.
Work within the workgroup to kill and combine entries until you get to at least 2 — the fewer the better. Based upon grading, provide a list to the down-selected parties of how they can improve their response. Giving parties the opportunity to focus and improve, will allow for better options. Schedule meetings with no more than 7 day’s notice for their presentation and response.    After all meetings are complete, revise the grading and make a selection with procurement.    Notify all parties and negotiate a contract.   I recommend all contracts as part of transformation be longer term, you will want a partner for at least 3 years.    I also recommend agreement of SLAs and penalties of failed/delayed delivery.


Execution for Digital Transformation

Step 3: Execution

After making the selection, now the hard work starts – addressing digital transformation.    Implementation should be a core concern during the selection process.   Some transformation projects are short-term (less than 6 months), longer term leverages milestones.    Phasing allows transformation projects to achieve quick wins and setup long-term success.   When building the business case, phasing allows prioritization of key objectives.   I always recommend to show significant value within a quarter, and every quarter after.   Regular improvement needs to be visible, or you will need significant executive sponsorship.    Phasing will help drive the value and keep on task.
Selection of a project mantra defines how that project will run.   Agile is very popular in IT projects.   A DevOps approach allows your transformation project to become evergreen.   For long-term projects where you need extreme flexibility, there is no better technique.    For short-term, fixed scope projects waterfall is more than satisfactory.
When executing the vision, setting phased milestones provides the director. Quarterly scheduled demonstrations keeps the faith. With consistent, planned deliveries will confirm healthy project management.   When it comes to execution focus, communication and delivery success should be the first priority.    Its always best to remember, if you have an unhealthy project — you will have poor deliverables.


RIO Renewal for Digital Transformation

Step 4: ROI & Renewal

Once the project has achieved it main objectives the question becomes “Now What?”.    In every sense of the word, there is an “end-state” with regards to transformation.   Once you get there, you will experience the fact that the goal posts moved on you.   This is another reason Agile methodologies are popular.

Meet with your workgroup and steering committee, does it make sense to continue.   One key issue my customers have seen, is that transformation can lead to change for only change’s sake.   There must be clear needs to continue.   You can always reduce team cadence and let the needs of the business denote the tempo.
In summation, executing a digital transformation is a heavy commitment.   The facts are that the change required is necessary to address industry climate.    Nobody wants to buy a T1 anymore and that is a good thing!    The good news is that meeting the needs of business is possible and profitable. Good luck transforming!

Key lessons learned for Digital Transformation:

  • Collaboration = Commitment = Success – if you communicate effectively
  • Select the best process and tools for your team.     Do not fall into the conformity for its sake.
  • Set achievable regularly delivered goals.   Show consistently increasing value to the business.
  • Focus on the present, but regularly plan for the future – and always communicate


Shawn Ennis

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems.   Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers.   Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure.  Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Best Practices Guide for Application Monitoring

Don’t Fear the App

Digital service providers are being driven by customers into the world of applications. Gone are the days that simple internet access is all you have to provide. The more complex the service, the more value it is to the customer. As SMB customers are embracing managed services, service providers are managing applications. While traditional network services are well defined, most applications are disparate and obtuse. Many of the customers I talk to see a real challenge in application monitoring.
Applications requires the same, if not more, care and feeding that any other tech.  Defining services is easier, but components are vast and complex.  Application discovery is still a new concept and is not yet 100%.  Knowing the availability, performance, and capacity of an application is vital information. Having the heuristics, audit, and log information to troubleshoot allows for quicker resolutions.  Performing end-to-end distributed active testing allows for basic verification. Passive activity scanning can ensure you know problems as soon as end-customers do.  Mission critical apps need comprehensive monitoring and management. To the tune of the same cost and value of that application deployed.
Applications can be very difficult to manage due to their inherent uniqueness. These custom digital services come in all forms and fashions. From printing queue services to real-time stock trading platforms. This series of blog articles to provide insight on how to plan for monitoring custom applications. Interested providers will be able to leverage these concepts for their own environment.

Discover the Application

First part of any new application monitoring is to determine what consists of the application.   Application discovery has two common flaws. First is over-discovery, or creating so much detail association is complex and useless. Or the problem is under-discovery, in which you are missing key associations and thus useless.   Discovery is like all other technology, it requires human guidance and oversight — do not blindly depend upon it.

Website Monitoring

For our working example, I will use a custom application using a traditional 3-tier architecture stack. We first start with the presentation layer. Its best to start by listing out what can go wrong. Network access might be down. Server failure is a possibility. The web server process (httpd) might no longer be running. Are the network storage directories mounted? Once you have your list, create your dashboard. Once you have your dashboard, link the necessary data to it (syslogs, traps, ping alarms). With a finished dashboard, you can automate it with policy. Create an alert that indicates an application error exists and points to the cause. If your assurance tool cannot perform these features, find one that does the job.

Database Monitoring

Now repeat the same for data layer. Which database do you have? MySQL provides rich monitoring plugins. What are the standard database KPIs? Google provides plenty of opportunity to leverage 3rd party lessons learned. What else is important with a database? Backup and redundancy are key. Are those working? Repeat the dashboard driven monitoring techniques from above. The result is 2/3rds of your custom application monitored.

The hard part…

The most difficult layer to deal with is the application layer. Here there are no rules. The best case is talking to the developers. Get them to explain and define the known KPIs and failure points. Worse case, you can break down the logs, processes, and ports in use to check for basic things. Do not discount basic monitoring such as this, the more your know the easier to troubleshoot. Run the dashboards you have as reports, get them into the inbox of the application team daily. This will assure the feedback you need to refine your monitoring policies.

Last advice…

– Be bold – Don’t be afraid of monitoring
– Communicate – Let the team see the results, if the data is wrong fix it
– If nobody cares about the data, you don’t have to keep it and don’t alert on it
– Alerts and notifications are only useful if they are rare and desired
My last point would be if you are a SMB, your managed service provider should be able to perform custom application monitoring. If the can’t, have them call me…