Category Archives: Guides

Understanding Service Assurance Correlation & Analytics

Maturity Curve for Service Assurance Correlation & Analytics

Let’s talk about terminology first. Correlation versus analytics is an interest subject. Most people I talk to, consider correlation to be only within fault management. Analytics includes TimeSeries data like performance and logs. Now I know some would disagree with that simplification, we can use it here to avoid confusion. Whether it be either term, what we look for is reduction and simplification. The more actionable your information is, the quicker you can resolve problems.

Visual Correlation & Analytics

First step on the road to service assurance correlation and analytics is enabling a visual way to correlate data. Correlation is not possible if the data is missing, so have unified collection is your first step. Once you have the data co-located you can drive operations activities to resolution. Technicians can leverage the tool to find the cause of the fault. Drill-down tools can help uncover enough information. Then the NOC techs can perform manual parent/child correlation.

Once executed, users of the assurance tool can also suppress, or hide, faults. Faults that are not impacting or known false errors become sorted out as “noise”. Assurance systems then leverage third party data to enrich faults. Enrichment would allow faults to include more actionable data. This makes them easier to troubleshoot. All these concepts should be second nature. Operations should have all these visual features as part of the assurance. Otherwise they are hamstrung.

Basic Correlation & Analytics

Once you have a tool that has all your data, you will be swimming in the quantity of that data. You must reduce that data stream. If not, you will overload the NOC looking at the stack instead of the needle. There are many basic level strategies that allow that reduction.

First, look at de-duplication. This feature allows you to match up repeat faults and data points. Which eliminates 80% of duplicate data. Matching “up” to “down” messages allow elimination of 50% of your data stream. Reaping jobs can close out data that are not deemed “current” or limited log data. Another common feature is suppressing faults. Suppression by time windows during scheduled maintenance or excluding business hours. Threshold policies can listen to “noise” data and after X times in Y minutes create an alert. These features should be available on any assurance platform. If yours lacks them, look to augment.

Root Cause Analysis Correlation & Analytics

If you have a NOC with thousands of devices or tens of domains, you need cross domain correlation. Root cause analysis is key to reducing complexity of large access networks. Performing RCA across many technology domains is a core strategy. Operations can use it for consolidated network operations. Instead of playing the blame game, you know which layer is at fault. Leveraging topology to sift through faults is common. Unfortunately its not typical in operations. Topology data can sometimes be difficult to collect or of poor quality. Operations needs a strong discovery strategy to prevent this.

Cluster-based Correlation

Cluster-based correlation is another RCA strategy. This one does not rely upon topology data. The concept here is using trained or machine learning data. A written profile will align data when a certain pattern matched. The tools create patterns during troubleshooting process. Others have algorithms that align faults with time and alerts. Once the pattern matches, the alert fires causing a roll-up by the symptoms to reduce the event stream. This correlation method is popular, but hasn’t provided much results yet. Algorithms are the key here. Many challenge its ROI model that requires machine training.

Customer Experience Assurance

Next, RCA enables operations to become more customer-centric. Service oriented correlation allows operations to see the quality of their network. All through their customers eyes. Some call this functionality “service impact analysis”. I like the term “customer experience assurance”. Understanding what faults are impacting customers and their services enables higher efficient operations. The holy grail of operations is focusing on only root causes. Then prioritize action only by customer value.

Service Quality Management

Lastly, you can track customer temperature by moving beyond outages and into quality. Its important to under the KPIs of the service. This allows clarity on how well the service is performing. If you group these together, you simplify. While operations ignore bumps and blips, you still need to track them. Its important to understand those blips are cumulative in the customers eyes. If the quality threshold violates, customers patience will be limited. Operations needs to know the temperature of the customer. Having service and customer level insights are important to provide high quality service. Having a feature like this drives better customer outcomes.

Cognitive Correlation & Analytics

The nirvana of correlation and analytics includes a cognitive approach. Its a simple concept. The platform listens, learns, and applies filtering and alerting. The practice is very hard. Most algorithms available diverse. They are either domain specific (website log tracking) or generic in nature (holtz-winter). Solutions need to be engineered to apply the algorithms only where they make sense.

Holtz-Winter Use Case

One key use case is IPSLA WAN link monitoring. Latency across links must be consistent. If you see a jump, that anomaly may matter. The Holtz-Winter algorithm is for tracking abnormal behavior through seasonal smoothing. Applied to this use case, an alert is raise when the latency breaks its normal operation. This allows operations to avoid setting arbitrary threshold levels. Applying smart threshold alerting can reduce operational workload. Holtz-winter shows how cognitive analytics can drive better business results.

Adaptive Filtering Use Case

Under the basic correlation area I listed dynamic filtering. A fault can happen “X times in Y minutes”. If so, create alert Z. This generic policy is helpful. The more you use it, you will realize that you need something smarter. Adaptive filtering using cognitive algorithms allows for a more comprehensive solution. While the X-Y-Z example depends upon two variables, the adaptive algorithm leverages hundreds. How about understanding whether the device is in a lab or a core router? Does the fault occurs ever day at the same time? Does it proceeds a hard failure.

You can leverage all these variables to create an adaptive score. This score would be an operational temperature gauge or noise level. NOC techs can cut noise during outages. They can increase it during quiet times or sort by it to understand “what’s hot”. Adaptive filtering enables operations the ability to slice and dice their real-time fault feeds. This feature is a true force multiplier.

Understand the Value in Service Assurance Correlation & Analytics

The important part of correlation & analytics with service assurance is its value. You must understand what is freely available and it’s value to operations. This subject varies greatly from customer to customer and environment to environment. You have to decide how far the rabbit hole you want to go. Always ask the question “Hows does that help us”. If you are not moving the needle, put it on the back burner.

If you are not saving 4-8 hours of weekly effort a week for the policy, its just not work the development effort. Find your quick wins first. Keep a list in a system like Jira and track your backlog. You may want to leverage a agile methodology like DevOps if you want to get serious. Correlation and analytics are force multipliers. They allow operations to be smarter and act more efficiently. These are worthwhile pursuits, but make sure to practice restraint. Focus on the achievable, you don’t need to re-invent the wheel. Tools are out there that provide all these features. The question to focus on is “Is it worth my time?”.

About the Author

Serial entrepreneur and operations subject matter expert who likes to help customers and partners achieve solutions that solve critical problems. Experience in traditional telecom, ITIL enterprise, global manage service providers, and datacenter hosting providers. Expertise in optical DWDM, MPLS networks, MEF Ethernet, COTS applications, custom applications, SDDC virtualized, and SDN/NFV virtualized infrastructure. Based out of Dallas, Texas US area and currently working for one of his founded companies – Monolith Software.

Operations Digital Transformation Playbook

2 Replies

Digital Transformation is the buzz word of the day. Whether TMF is saying it or Lightreading is reporting it, CIOs are doing it. Here is an example roadmap for transforming operations for the business’s digital transformation. This 4-step process leverages much of what the industry has already said. I have interweaved some color and advice . I hope you find it useful and comment below.

Step 1: Acceptance

First, your organization needs to buy into the fact that something has to change. Buy-in for digital transformation is the key to success. While 100% agreement is not possible, getting an overwhelming majority will reduce timelines. Forming a committee with regular cadence calls can assist on collection of use cases. As a sound board, they can be the voice of the organization. They will also provide you cover during the transformation process.

Here is some advice to help people get in the boat. Some will doubt the need for change. To those doubters, I would pose the following questions:

What % of the time does operations spend on firefighting?
What does your customers say about the quality of the services you provide?
How many compromises does your team make to “get the job done”?
Does 25% YOY staff turnover frighten you?

These questions are the canaries in the coal mine for impacting digital transformation. If you cannot focus on improvement, growth, and resiliency – the organization is in danger. When the business is changing to a more agile footprint, operations gets left behind — or even worse, becomes the roadblock.

Step 2: Selection

Change scares people and organizations and digital transformation can get scary. When it comes to selection, it must be a sober, deliberate decision. RFIs are a common method for initiating change. The trouble is the net you cast. If you only send the RFI to your pre-existing vendors, you will get lots of the same.

I recommend that you start with a google search on “Operations transformation”, then “IT transformation”. The results should net you some NEPs/DEPs (EMC, Huawei) and Global SI players (Accenture, Deloitte). If you are a fortune 500, they will be very kind to you and expect big money. If you have a relationship with these guys, I recommend calling upon them and seeing the “big pitch”. Its great for context and helps to understand the commitment involved in transformation.

The next step would be to call some analysts. I have had great experiences with Analysys Mason, Appledore Research, and Gartner. They can tell you what other customers have done. Attend some webinars and trade events can help get you connected to the trend setters. This will help you round out the group you want to invite into the RFI process.

With the RFI executed, you will want to review the material and cut down to 5 or less parties. Make sure you have a global SI, a NEP/DEP, and some trend setters in the bunch. Ask for presentations and documentation of best practices. Get as much information as possible, creating quality requirements is key.

Within the transformation workgroup, create a top 10/25 list from each member of key issues. Apply your use cases and develop a list of requirements (<100 items). Add to it a ratings system to keep it fair, to the point. If you value verification of technical compliance (ie. support for Cisco IOS Y/N, etc), add another sheet. Another tip; you can always demand entries to combine their offerings. This firms up and consolidates your options. Use this list with your procurement team to create the RFP. Give at least 2 weeks to respond, and no more than 4 weeks. Stick to your schedule and grade the entries responses.

Work within the workgroup to kill and combine entries until you get to at least 2 — the fewer the better. Based upon grading, provide a list to the down-selected parties of how they can improve their response. Giving parties the opportunity to focus and improve, will allow for better options. Schedule meetings with no more than 7 day’s notice for their presentation and response. After all meetings are complete, revise the grading and make a selection with procurement. Notify all parties and negotiate a contract. I recommend all contracts as part of transformation be longer term, you will want a partner for at least 3 years. I also recommend agreement of SLAs and penalties of failed/delayed delivery.

Step 3: Execution

After making the selection, now the hard work starts – addressing digital transformation. Implementation should be a core concern during the selection process. Some transformation projects are short-term (less than 6 months), longer term leverages milestones. Phasing allows transformation projects to achieve quick wins and setup long-term success. When building the business case, phasing allows prioritization of key objectives. I always recommend to show significant value within a quarter, and every quarter after. Regular improvement needs to be visible, or you will need significant executive sponsorship. Phasing will help drive the value and keep on task.

Selection of a project mantra defines how that project will run. Agile is very popular in IT projects. A DevOps approach allows your transformation project to become evergreen. For long-term projects where you need extreme flexibility, there is no better technique. For short-term, fixed scope projects waterfall is more than satisfactory.

When executing the vision, setting phased milestones provides the director. Quarterly scheduled demonstrations keeps the faith. With consistent, planned deliveries will confirm healthy project management. When it comes to execution focus, communication and delivery success should be the first priority. Its always best to remember, if you have an unhealthy project — you will have poor deliverables.

Step 4: ROI & Renewal

Once the project has achieved it main objectives the question becomes “Now What?”. In every sense of the word, there is an “end-state” with regards to transformation. Once you get there, you will experience the fact that the goal posts moved on you. This is another reason Agile methodologies are popular.

Meet with your workgroup and steering committee, does it make sense to continue. One key issue my customers have seen, is that transformation can lead to change for only change’s sake. There must be clear needs to continue. You can always reduce team cadence and let the needs of the business denote the tempo.

In summation, executing a digital transformation is a heavy commitment. The facts are that the change required is necessary to address industry climate. Nobody wants to buy a T1 anymore and that is a good thing! The good news is that meeting the needs of business is possible and profitable. Good luck transforming!

Key lessons learned for Digital Transformation:

Collaboration = Commitment = Success – if you communicate effectively
Select the best process and tools for your team. Do not fall into the conformity for its sake.
Set achievable regularly delivered goals. Show consistently increasing value to the business.
Focus on the present, but regularly plan for the future – and always communicate

About the Author

Best Practices Guide for Application Monitoring

Don’t Fear the App

Digital service providers are being driven by customers into the world of applications. Gone are the days that simple internet access is all you have to provide. The more complex the service, the more value it is to the customer. As SMB customers are embracing managed services, service providers are managing applications. While traditional network services are well defined, most applications are disparate and obtuse. Many of the customers I talk to see a real challenge in application monitoring.

Applications requires the same, if not more, care and feeding that any other tech. Defining services is easier, but components are vast and complex. Application discovery is still a new concept and is not yet 100%. Knowing the availability, performance, and capacity of an application is vital information. Having the heuristics, audit, and log information to troubleshoot allows for quicker resolutions. Performing end-to-end distributed active testing allows for basic verification. Passive activity scanning can ensure you know problems as soon as end-customers do. Mission critical apps need comprehensive monitoring and management. To the tune of the same cost and value of that application deployed.

Applications can be very difficult to manage due to their inherent uniqueness. These custom digital services come in all forms and fashions. From printing queue services to real-time stock trading platforms. This series of blog articles to provide insight on how to plan for monitoring custom applications. Interested providers will be able to leverage these concepts for their own environment.

Discover the Application

First part of any new application monitoring is to determine what consists of the application. Application discovery has two common flaws. First is over-discovery, or creating so much detail association is complex and useless. Or the problem is under-discovery, in which you are missing key associations and thus useless. Discovery is like all other technology, it requires human guidance and oversight — do not blindly depend upon it.

Website Monitoring

For our working example, I will use a custom application using a traditional 3-tier architecture stack. We first start with the presentation layer. Its best to start by listing out what can go wrong. Network access might be down. Server failure is a possibility. The web server process (httpd) might no longer be running. Are the network storage directories mounted? Once you have your list, create your dashboard. Once you have your dashboard, link the necessary data to it (syslogs, traps, ping alarms). With a finished dashboard, you can automate it with policy. Create an alert that indicates an application error exists and points to the cause. If your assurance tool cannot perform these features, find one that does the job.

Database Monitoring

Now repeat the same for data layer. Which database do you have? MySQL provides rich monitoring plugins. What are the standard database KPIs? Google provides plenty of opportunity to leverage 3rd party lessons learned. What else is important with a database? Backup and redundancy are key. Are those working? Repeat the dashboard driven monitoring techniques from above. The result is 2/3rds of your custom application monitored.

The hard part…

The most difficult layer to deal with is the application layer. Here there are no rules. The best case is talking to the developers. Get them to explain and define the known KPIs and failure points. Worse case, you can break down the logs, processes, and ports in use to check for basic things. Do not discount basic monitoring such as this, the more your know the easier to troubleshoot. Run the dashboards you have as reports, get them into the inbox of the application team daily. This will assure the feedback you need to refine your monitoring policies.

Last advice…

– Be bold – Don’t be afraid of monitoring

– Communicate – Let the team see the results, if the data is wrong fix it

– If nobody cares about the data, you don’t have to keep it and don’t alert on it

– Alerts and notifications are only useful if they are rare and desired

My last point would be if you are a SMB, your managed service provider should be able to perform custom application monitoring. If the can’t, have them call me…

Shawn Ennis

Founder, Father, and Serial Entrepreneur. Here is where I come to work, play, and share. Follow me on Twitter @Shawn_Ennis or Linkedin.