Category Archives: Service Assurance

Understanding Service Assurance Correlation & Analytics

Data is good right? The more data the better. In fact, there is a whole segment of IT related to data call BIG data analytics. Operations have data, tons of it. Every technology devices spits out gigabytes of data a day. The question is figuring out how to filter data. It’s all about reducing that real-time stream of data into actionable information. Understanding service assurance correlation & analytics is all about focusing operations. That attention can produce that better business results. This blog details common concepts and what’s available in the marketplace. I want show the value of driving data analytics into actionable information. Value operations can execute successfully.

Maturity Curve for Service Assurance Correlation & Analytics

Let’s talk about terminology first. Correlation versus analytics is an interest subject. Most people I talk to, consider correlation to be only within fault management. Analytics includes TimeSeries data like performance and logs. Now I know some would disagree with that simplification, we can use it here to avoid confusion. Whether it be either term, what we look for is reduction and simplification. The more actionable your information is, the quicker you can resolve problems.

Visual Correlation & Analytics

First step on the road to service assurance correlation and analytics is enabling a visual way to correlate data. Correlation is not possible if the data is missing, so have unified collection is your first step. Once you have the data co-located you can drive operations activities to resolution. Technicians can leverage the tool to find the cause of the fault. Drill-down tools can help uncover enough information. Then the NOC techs can perform manual parent/child correlation.

Once executed, users of the assurance tool can also suppress, or hide, faults. Faults that are not impacting or known false errors become sorted out as “noise”. Assurance systems then leverage third party data to enrich faults. Enrichment would allow faults to include more actionable data. This makes them easier to troubleshoot. All these concepts should be second nature. Operations should have all these visual features as part of the assurance. Otherwise they are hamstrung.

Basic Correlation & Analytics

Once you have a tool that has all your data, you will be swimming in the quantity of that data. You must reduce that data stream. If not, you will overload the NOC looking at the stack instead of the needle. There are many basic level strategies that allow that reduction.

First, look at de-duplication. This feature allows you to match up repeat faults and data points. Which eliminates 80% of duplicate data. Matching “up” to “down” messages allow elimination of 50% of your data stream. Reaping jobs can close out data that are not deemed “current” or limited log data. Another common feature is suppressing faults. Suppression by time windows during scheduled maintenance or excluding business hours. Threshold policies can listen to “noise” data and after X times in Y minutes create an alert. These features should be available on any assurance platform. If yours lacks them, look to augment.

Root Cause Analysis Correlation & Analytics

If you have a NOC with thousands of devices or tens of domains, you need cross domain correlation. Root cause analysis is key to reducing complexity of large access networks. Performing RCA across many technology domains is a core strategy. Operations can use it for consolidated network operations. Instead of playing the blame game, you know which layer is at fault. Leveraging topology to sift through faults is common. Unfortunately its not typical in operations. Topology data can sometimes be difficult to collect or of poor quality. Operations needs a strong discovery strategy to prevent this.

Cluster-based Correlation

Cluster-based correlation is another RCA strategy. This one does not rely upon topology data. The concept here is using trained or machine learning data. A written profile will align data when a certain pattern matched. The tools create patterns during troubleshooting process. Others have algorithms that align faults with time and alerts. Once the pattern matches, the alert fires causing a roll-up by the symptoms to reduce the event stream. This correlation method is popular, but hasn’t provided much results yet. Algorithms are the key here. Many challenge its ROI model that requires machine training.

Customer Experience Assurance

Next, RCA enables operations to become more customer-centric. Service oriented correlation allows operations to see the quality of their network. All through their customers eyes. Some call this functionality “service impact analysis”. I like the term “customer experience assurance”. Understanding what faults are impacting customers and their services enables higher efficient operations. The holy grail of operations is focusing on only root causes. Then prioritize action only by customer value.

Service Quality Management

Lastly, you can track customer temperature by moving beyond outages and into quality. Its important to under the KPIs of the service. This allows clarity on how well the service is performing. If you group these together, you simplify. While operations ignore bumps and blips, you still need to track them. Its important to understand those blips are cumulative in the customers eyes. If the quality threshold violates, customers patience will be limited. Operations needs to know the temperature of the customer. Having service and customer level insights are important to provide high quality service. Having a feature like this drives better customer outcomes.

Cognitive Correlation & Analytics

The nirvana of correlation and analytics includes a cognitive approach. Its a simple concept. The platform listens, learns, and applies filtering and alerting. The practice is very hard. Most algorithms available diverse. They are either domain specific (website log tracking) or generic in nature (holtz-winter). Solutions need to be engineered to apply the algorithms only where they make sense.

Holtz-Winter Use Case

One key use case is IPSLA WAN link monitoring. Latency across links must be consistent. If you see a jump, that anomaly may matter. The Holtz-Winter algorithm is for tracking abnormal behavior through seasonal smoothing. Applied to this use case, an alert is raise when the latency breaks its normal operation. This allows operations to avoid setting arbitrary threshold levels. Applying smart threshold alerting can reduce operational workload. Holtz-winter shows how cognitive analytics can drive better business results.

Adaptive Filtering Use Case

Under the basic correlation area I listed dynamic filtering. A fault can happen “X times in Y minutes”. If so, create alert Z. This generic policy is helpful. The more you use it, you will realize that you need something smarter. Adaptive filtering using cognitive algorithms allows for a more comprehensive solution. While the X-Y-Z example depends upon two variables, the adaptive algorithm leverages hundreds. How about understanding whether the device is in a lab or a core router? Does the fault occurs ever day at the same time? Does it proceeds a hard failure.

You can leverage all these variables to create an adaptive score. This score would be an operational temperature gauge or noise level. NOC techs can cut noise during outages. They can increase it during quiet times or sort by it to understand “what’s hot”. Adaptive filtering enables operations the ability to slice and dice their real-time fault feeds. This feature is a true force multiplier.

Understand the Value in Service Assurance Correlation & Analytics

The important part of correlation & analytics with service assurance is its value. You must understand what is freely available and it’s value to operations. This subject varies greatly from customer to customer and environment to environment. You have to decide how far the rabbit hole you want to go. Always ask the question “Hows does that help us”. If you are not moving the needle, put it on the back burner.

If you are not saving 4-8 hours of weekly effort a week for the policy, its just not work the development effort. Find your quick wins first. Keep a list in a system like Jira and track your backlog. You may want to leverage a agile methodology like DevOps if you want to get serious. Correlation and analytics are force multipliers. They allow operations to be smarter and act more efficiently. These are worthwhile pursuits, but make sure to practice restraint. Focus on the achievable, you don’t need to re-invent the wheel. Tools are out there that provide all these features. The question to focus on is “Is it worth my time?”.

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems. Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers. Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure. Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

IoT Service Assurance Key Concepts

Leave a reply

The IoT/IoE generation has been born. Now countless things are about to be inter-connected. We all see the hype is non-stop, but there many things are becoming a reality. AT&T/Maersk closed a deal back to 2015. This recently became a reality for asset tracking cold shipping containers. Now, Uber is providing driverless trucks to deliver beer. While GPS trackers are being used to track the elderly. These services are being ubiquitous and common. We are seeing the use cases have variety and are growing in depth. But we also see that IoT is a very pioneering field. If IoT managed services are to exist, operations will need to manage them. The goals here is to start asking key questions. The hope is through analysis we can provide some answers. Let’s discuss the key concepts driving the new field of IoT Service Assurance.

Key Perspectives for IoT Service Assurance

For any IoT service, you must understand who uses it and who provides it. As I explain it, there are three key perspectives for IoT services. First, you have the network provider. They provide the network access for the “thing”. The “network” could mean LTE or Wifi or any other technology. Network providers see the network quality has the focus. This is similar to typical mobile providers. Compare that to IoT services monitored with an application focus. Its about monitoring the availability and performance of the “things”. You want to make sure they are working. Lastly, you may not care about the “things“. Perhaps you only care about the data from the them. Performing correlation and understanding the “sum of all parts” would be the key focus. These perspectives drive your requirements and the value prop. Through them, you can define quality and success criteria for your IoT services.

Key Requirements of IoT Service Assurance

Before we get to far along, let’s first talk about terminology. In the world of IoT, what is a device? We have to ask, is this “thing” a device? With the world of mobility, the handset is not a devices its an endpoint. So is the pallet being monitored in the cold shipping container a device or an endpoint? Like the perspectives that drive your requirements, we should agree on terminology. Let’s talk some use cases to better understand typical requirements.

Smart Cold Storage

In the Maersk use case, let’s say the initial roll-out listed as 250k sensors on pallets. These sensors, at regular intervals, report data in via wireless burst communications. The data includes KPIs that drive visibility and business intelligence. Some common examples I have found are: temperature, battery life, and vibration rate. Other environmental KPIs required can exist: light levels, humidity, and weight. As we have discussed, location information with signal strength could be useful. We can track in real-time to provide trend and predication. One would think it would be best to know a failure before putting the container on the boat.

Bottom line is would have about around 25 KPIs per poll interval. Let’s do some math for performance data. Estimate 250k sensors * 25 kpis * 4 (15 min polls, 4/hour) * 24 (hours/day) = 600 million data points per day. If you were to use a standard database storage (say mysql) you would require 200GB per day. Is keeping the sensor data worth $300/month per month of data on AWS EC2? Storage is so inexpensive, real-time monitoring of sensor data becomes realistic.

Now faults are different. Some could include failed reconnects and emergency button pushed scenarios. These faults could provide opportunities. Shipping personnel can fix the container before the temperature gets too warm. Faults could provide an opportunity to save valuable merchandise from spoilage. Together this information combines to provide detailed real-time IoT Service Assurance views.

Driverless Trucks Use Case

Let’s look at another use case: Uber with driverless trucks. The Wired article does not include how many cars, so let’s look at UPS. UPS has >100k deliver trucks. Imagine if these logistics were 100% automated. This would create a tons of “things” on the network. The network, controller, and data would work together to provide a quality IoT service.

First, let’s look at performance data. The KPIs should be like the Maersk example. Speed, direction, location, and range would be valuable real-time data. Service KQIs like ETA and number of stops remaining would be drive efficiencies. Let’s do the same math as the Maersk example. Say 100k trucks * 50 kpis * 4 (15 min polls, 4/hour) * 24 (hours/day) = 480 million data points per day. So $240/day per day on AWS. This shows that storage and requirements are practical for driverless logistics.

Now some faults would include vital real-time activity. Perhaps an ‘out-of-gas’ event or network errors. Getting real-time alerts on crash would definitely be useful. So fault management would be a necessity in this use case. Again, there are plenty of reasons to create and leverage real-time alerts.

Another use case would be smart home monitoring, like Google Nest or Ecobee. These OTT IoT providers track and monitor things like temperature and humidity. There is no fault data and no analytics. The amount of homes monitored by Nest or Ecobee is not readily available on the internet. According to Dallas News, there are 8 million thermostats sold yearly. According to Fast Company, Ecobee has 24% marketshare, so 2 million homes per year. Ecobee has been in business for more than 5 years, so assume they have 10 million active thermostats. Doing some math, we have 10M homes, 10 kpis * 4 (15 min polls, 4/hour) * 24 (hours/day) = 10 billion data points per day. So that would be around $4800/day per day on AWS.

IoT Service Assurance is Practical

What is interesting about these use case are their practicality. Scalability is not a problem with modern solutions. All three cases show that from any perspective. Real-time IoT service assurance is achievable. I am amazed how achievable monitoring can be for complex and IoT services. Now you must asked the questions “why” and “how”. To answer these questions, you must understand how flexible your tools are. What value can you get from them.

Understanding Flexibility of IoT Service Assurance

Let’s discuss flexibility. First, how difficult is collecting this data? So let’s focus this in the world of open APIs. The expectation is these messages would come through a load balanced REST application server. I can image that 600 million hits per day is 2.7k hits/sec. This is well within apache and load balancer tolerances. As long as the messaging follows open API concepts collection should be practical. So from a flexibility, assuming you embrace open APIs, this is practical as well.

Understanding the Value of IoT Service Assurance

Its a fact, real-time is a key need in IoT Service Assurance. If whatever you want to track can wait 24/48 hours before you need to know it, you can achieve it with a reporting tool. If all you need is to store the data and slap a dashboard/reporting engine on top, then this becomes easy. Start with open source databases like mariaDB are low cost and widely available. Next, add a COTS dashboards and reporting tools like Tableau provide a cost-effective solution.

In contrast, Real-time means you need to know immediately that a cold storage container has failed. Being able to automate dispatch to find the closest human and text that operator to fix the problem. Real-time means that you have delivery truck on the side of the road and need to dispatch a tow truck. Real-time IoT Service Assurance means massive collection, intelligent correlation, and automated remediation. Now let’s look at the OTT smart home as a use case. The NEST thermostat is not going to call the firehouse when it reaches 150F. Everything is use case dependent, so you must let your requirements dictate the tool used.

Lessons Learned for IoT Service Assurance

IoT-based managed services are currently available and growing
Assuring them properly will require new concepts around scalability and flexibility
With IoT, you must always ask how far down is it worth monitoring
Most all requirements include some sort of geospatial tracking or correlation

My advice on IoT Service Assurance

As always, follow your researched requirements. Get what you need first, then worry about your wants.
Make sure you have tools with a focus on flexibility, scale, and automation. This vertical has many fringe use cases and they are growing.
IoT unifies network, application, and data management more than any other technology. Having a holistic approach can provide a multiplying and accelerating affect.

About the Author

About the Author

Best Practices Guide for Application Monitoring

Leave a reply

Don’t Fear the App

Digital service providers are being driven by customers into the world of applications. Gone are the days that simple internet access is all you have to provide. The more complex the service, the more value it is to the customer. As SMB customers are embracing managed services, service providers are managing applications. While traditional network services are well defined, most applications are disparate and obtuse. Many of the customers I talk to see a real challenge in application monitoring.

Applications requires the same, if not more, care and feeding that any other tech. Defining services is easier, but components are vast and complex. Application discovery is still a new concept and is not yet 100%. Knowing the availability, performance, and capacity of an application is vital information. Having the heuristics, audit, and log information to troubleshoot allows for quicker resolutions. Performing end-to-end distributed active testing allows for basic verification. Passive activity scanning can ensure you know problems as soon as end-customers do. Mission critical apps need comprehensive monitoring and management. To the tune of the same cost and value of that application deployed.

Applications can be very difficult to manage due to their inherent uniqueness. These custom digital services come in all forms and fashions. From printing queue services to real-time stock trading platforms. This series of blog articles to provide insight on how to plan for monitoring custom applications. Interested providers will be able to leverage these concepts for their own environment.

Discover the Application

First part of any new application monitoring is to determine what consists of the application. Application discovery has two common flaws. First is over-discovery, or creating so much detail association is complex and useless. Or the problem is under-discovery, in which you are missing key associations and thus useless. Discovery is like all other technology, it requires human guidance and oversight — do not blindly depend upon it.

Website Monitoring

For our working example, I will use a custom application using a traditional 3-tier architecture stack. We first start with the presentation layer. Its best to start by listing out what can go wrong. Network access might be down. Server failure is a possibility. The web server process (httpd) might no longer be running. Are the network storage directories mounted? Once you have your list, create your dashboard. Once you have your dashboard, link the necessary data to it (syslogs, traps, ping alarms). With a finished dashboard, you can automate it with policy. Create an alert that indicates an application error exists and points to the cause. If your assurance tool cannot perform these features, find one that does the job.

Database Monitoring

Now repeat the same for data layer. Which database do you have? MySQL provides rich monitoring plugins. What are the standard database KPIs? Google provides plenty of opportunity to leverage 3rd party lessons learned. What else is important with a database? Backup and redundancy are key. Are those working? Repeat the dashboard driven monitoring techniques from above. The result is 2/3rds of your custom application monitored.

The hard part…

The most difficult layer to deal with is the application layer. Here there are no rules. The best case is talking to the developers. Get them to explain and define the known KPIs and failure points. Worse case, you can break down the logs, processes, and ports in use to check for basic things. Do not discount basic monitoring such as this, the more your know the easier to troubleshoot. Run the dashboards you have as reports, get them into the inbox of the application team daily. This will assure the feedback you need to refine your monitoring policies.

Last advice…

– Be bold – Don’t be afraid of monitoring

– Communicate – Let the team see the results, if the data is wrong fix it

– If nobody cares about the data, you don’t have to keep it and don’t alert on it

– Alerts and notifications are only useful if they are rare and desired

My last point would be if you are a SMB, your managed service provider should be able to perform custom application monitoring. If the can’t, have them call me…

Intelligent Approach to Smart Cities

2 Replies

Smarter Smart Cities

At the Smart City Dublin forum, the subject was how municipalities can save money and better enable citizens. These opportunities are not driven by cities, but by service providers offering new services. Cities have assets, like right-of-ways. They have advancing needs, like tourism empowering free wifi. Governments have challenges, like reducing budgets and stodgy policies. While other providers may shy away, many see these challenges as possible revenue.

Simple Concept

There are plenty of opportunities for engagement. Right aways (lamp posts) are available. Engaging vendors to install a wifi network which generate advertising revenue. Smart cities can share in the ad-based profits. The service also provides new tourist engaging services to grow the local community. This enables a portal to show off the local digital economy. Multi-tenant access enables other services. Shared utilities like garbage and power can alert citizens in real-time. Digital services can enrich the peoples knowledge and grow the cities automation potential. The bottomline is reduction of cost and growth of engagement.

Where does service assurance come in?

The digital world is a unifying force. Providing a single pane of glass is common sense, but unfortunately not common place. Once deployed, the quality of city’s services define their brand. The analog and digital services will need assurance. Proactive engagement is no longer a nice-to-have, its expected. A proactive portal, empowered by service assurance, enables the smart city revolution.

Service providers, government, and equipment manufactures align with new services (ie Dublin). The question is how will service providers assure the quality and engage the populace in real-time?

Follow Shawn on Twitter

Predicting the IoT World

3 Replies

What are we going to do in the IoT world?

My typical response to service providers is, “well, that was last week…” All kidding aside, we live in the connected generation. Network access is the new oxygen. The price to be paid is complexity and scale. A good reference for what IoT use cases exist is this bemyapp article about Ten B2B use cases for IoT.

Common Threads

Its best to categorize them into three buckets. Environmental monitoring of smart meters to reduce human interaction requirements. Tracking logistics through RFID is another common trend with IoT communities. The most common is client monitoring. With mobility, handset tracking and trending is common in CEM. When considering an access network its monitoring the cable modems for millions of customers. Which ever category your use case may be, the challenges will be similar. How do you deal with the fact that your network becomes tens of millions of small devices instead of thousands of regular sized devices? How do you handle that fact that billions of pieces of data need to be processed, but only a fraction would be immediately useful? How can you break down the network to human understandable segmentations?

The solution is simple

With a single source of truth, you can see the forest through the trees. While the “things” in IoT are important, how they relay information and perform their work are equally important. Monitoring holistic allows better understanding of the IoT environment – single point solutions will not address IoT. Normalizing data enables for higher scale, while maintaining the high reliability.

How to accelerate

Now that the network has been unified into a single source of truth, operations can start simplification of their workload. First step, become service oriented. Performance, fault, and topology is too much data – its the services you must rely upon. How are the doing, what are the problems, how to fix them, and where you need to augment your network. Next up, correlate everything – you need to look at the 1% of the 1% of the 1% to be successful. KQIs are necessary, because the trees in the forest are antidotal information – the AFFECT. Seeing the forest (as the KQI) allows you to become proactive and move quicker, be more decisive because you understand the trends and what is normal. Its time to stop let the network manage you, and start managing your network.

End Goal is Automation

After unifying your view and simplifying your approach, its time to automate. The whole point of IoT is massive scale and automation, but if your SA solution cannot integrate openly with the orchestration solution, how will you ever automate resolution & maintenance? We all must realize, human-based lifecycle management is not possible at IoT scale. Its time to match the value of your network with the value of managing it.

Assuring quality real-time services

Leave a reply

Traveling from trade shows

Coming back from a trade show I took my Uber back to the airport, oddly enough I experience the value of real-time services. As most, I leverage the Uber ride-share service. My reasons are as others: connivence, price, quality, etc. In the past, I have typically taken taxis — which are twice the cost.

As we are driving, the drivers phone beeped. It told him that there was an accident up ahead and we needed to divert. Interestingly, my phone beeped as he said this and I got the same message showing a red line up ahead. The driver stated this was one of the reasons he switched to Uber, being a long time tax cab driver. Because other Uber drivers are constantly, autonomously reporting traffic (way more than cab drivers do) he spends more time driving and less time in traffic. He drives more customers and makes considerably more money. The customers are happier, online bill pay provide less hassle – he drives, that is all he worries about. The cost of Uber? For him nothing, the passengers do that. He drives and gets paid. And is nice – offered me a paper (quaint) and free bottle of water before boarding.

The moral of the story…

Uber based in California, 6,000 miles and 9 hours time difference away. Using AWS hosting, it allows real-time automatic cross matching of traffic to make lives a little easier a world away. The mini to the macro at work here. This 60+ year old driver, driving all his life, reaps the benefit. I pay an extra 2e, 40% reduction in rates, smoother ride in a new car, and nicer driver — that is value for the customer. What makes this miracle possible? Realtime digital services. Uber and others like them are winning the battle by pushing realtime digital services using LTE; competing against taxi cabs with CB radios. As the newspaper industry realized already, the taxi cab industry will soon become… quaint…

My question to you? What is your realtime service? What does it mean to your business? How do you assure it to continue to be realtime?

Drop me a message @Shawn_Ennis, I would love to talk about your real-time services.

PS. Thanks T-Mobile for included international roaming. Uber would not have been possible without you…