The Journey to AIOps

“I&O leaders should use AIOps platforms for refining performance analysis across the application life cycle, as well as for augmenting IT service management and automation.” Gartner 2019

I’ve been writing about automating event management for the last 20 years and although the title AIOps (artificial intelligence for IT operations) has been a fairly recent addition the issue itself has remained the same and is an evolution of IT operational analytics (ITOA).
Enterprises have spent the last 2 decades trying to bring together information from their siloed tools (often operated by siloed teams) whilst trying to understand which team should resolve any particular issue. The problem is that the tools create too many alerts and too much data for a single person to analyse in order to find the root cause of any issue.

Event correlation has always been a technique used by vendors to help with this issue by making sense of a large number of events by identifying and analysing relationships between them. This has been used as far back as the Tivoli Enterprise Console (TEC) through to IBM Netcool and in many other competing tools such as BMC Patrol and CA Unicenter NSM. The difference now is that AIOps tries to replace the old manually written correlation rules with artificial intelligence. In addition, because of advances in technology and storage capacity, AIOps instead of just concentrating on events looks at data from a number of sources. This data is growing exponentially and Gartner estimates that the average enterprise IT infrastructure generates two to three times more IT operations data every year. This amount of data has moved beyond the realm of human operators to both prioritise and understand.

The other big change influencing the rise of AIOps is the transitioning from a traditional infrastructure of separate, physical systems to a mix of various cloud environments (managed cloud, private cloud and public cloud) running on resources that scale and change constantly coupled with standard on-premises data centres.

The aim of AIOps is to deal with these issues and replace manual IT operations decisions with a single, intelligent, and automated IT operations platform.

The Benefits of AIOps

Enables IT operations to identify, address, and resolve slow-downs and outages faster

Achieves faster mean time to resolution (MTTR) by removing noise and correlating operations data from multiple IT environments.

Allows IT Operations to go from reactive to predictive management

The alerts are prioritised based on business needs

The more AIOps learns and automates, the more your IT operations team can focus on strategic tasks

Automate problem resolution by offering actions that can be taken to resolve them or automating recurring tasks

Improve user self-help by using chatbots and virtual support assistants (VSAs) to publish knowledge and automate recurring tasks

Automatically manage the complexity, volume, and variety of data

What actually is an AIOps platform?

The function of an AIOps solution can be summarised to do the following three functions:

Data: Collect and aggregate large volumes of operations data generated by multiple IT infrastructure components, applications, and performance-monitoring tools.
Machine Learning: Intelligently filter real alerts out of the noise to identify significant events and patterns related to system performance and availability issues.
Automation: Rapidly diagnose Root Causes and in some cases, automatically resolve these issues without human intervention.

These three things together will enable IT operations teams to respond more quickly and sometimes even before the detected issues have impacted the business. This is why most experts now consider AIOps to be the future of IT operations management.

Let’s look at those components in more detail:

1/ Data

Firstly, an AIOps platform needs to be able to aggregate large amounts of data in a single format and place. This data typically comes from the following areas:

Live and Historical Performance Data
Real-time Alerts
Network Data
Incidents and Tickets
Logs
And Possibly documents

These 6 elements can be considered as structured data, however, the better AIOps solutions should also be able to ingest unstructured data such as chat or social data so that it can understand conversations between support engineers and intervene when it has a solution.

Clearly this amount of data across a large number of data sources requires the AIOps platform to have a large datastore where the data can be normalised. Normalisation doesn’t necessarily require transforming all of the data into a single format, but it does need at least some dataset transformations.

For this, vendors use a range of Big Data technologies such as Apache Hive (for text), Kafka and Spark Streaming for data ingestion, Apache Hive and Spark for analytics, HBase and Cassandra for key-value stores, Logstash and Elasticsearch for Log Search and Big Data analytics tools such as Hadoop. For cloud, there are systems such as RedShift. These systems shouldn’t be seen as competing but rather as complementary technologies and will often be used together.

For AIOps products open solutions are better than adopting proprietary tools that can restrict a business’s ability to modify its AIOps toolset and processes in the future, however, be aware that many proprietary data collection and normalisation tools are built on top of these open source solutions.

2/ Machine Learning

I’ll start this part by defining what Machine Learning actually is and how this differs from Artificial Intelligence. Machine learning is a subset of AI and Deep Learning is a sub-set of Machine Learning. The following diagram defines the difference clearly.

Artificial Intelligence is the broad discipline of creating intelligent machines.
Machine Learning (ML) has the ability to modify itself when given more data and is dynamic and does not require human intervention to make certain changes.
Deep Learning (DL) refers to systems that learn from experience on large data sets using programmable neural networks to make more accurate predictions without help from humans.

The ability to manage environments is becoming impossible using traditional methods that often rely on human intervention. This approach simply doesn’t work in dynamic, elastic environments therefore one of the key aims of AIOps is to separate signals(significant abnormal event alerts) from the Noise which is everything else by analysing data in real-time.

Effective detection requires AIOps tools that are intelligent enough to set dynamic baselines that allow the tools to determine what constitutes normal activity under given circumstances (such as the time of day), then detect data or events that do not align with the dynamic baseline. Also, AIOps detection should be able to use Multivariate anomalies which detect outliers based on a series of different metrics to detect whether overall behaviour is out of the ordinary. In more complex situations, multivariate methods rely on neural networks to model interactions between various metrics and make decisions based on them.

By cutting through the noise and correlating operations data from multiple IT environments, AIOps is able to identify root causes and propose solutions faster and more accurately than humanly possible and therefore achieve faster mean time to resolution (MTTR).

The existing manual analytics should now be automated and new analytics on new data should now be possible at a scale and speed unavailable without AIOps.

Lastly, it’s important that Machine Learning is traceable so that the users and stakeholders trust the new AI-powered recommendations and automations, especially where it is automating actions on mission-critical workloads.

3/ Automation

Lastly AIOps tools should allow users the ability to leverage prescriptive advice from the platform. This advice should not be static with the tool able to learn continually so that algorithms can be tuned or new algorithms created to identify problems quicker the next time an issue occurs.

Ultimately the tool should be able to run automatically or offer a solution that will fix any issue. Gartner has suggested 4 stages to this process:

If an issue is resolved then the solution is recorded in a knowledge base of solutions based on a database of historical solutions (tribal knowledge) so that next time the issue occurs it is available for reference.
The second step is that If a problem comes in it can be matched to a set to recurring problems based on a category and the resolution can be improved via crowdsourcing. Crowdsourcing refers to users training the different algorithms to get to the correct solution.
Thirdly the resolution is suggested based on a probability of what the resolution will be using AI.
The last step is to trigger an automated response. Currently, Gartner believes this closed-loop process, referred to as “self-driving ITOM”, although highly desirable is still aspirational with very few vendors offering more than “reboots” or “raising a ticket”. They say that the likely candidates for automated actions from prescriptive tools tend to be low risk (e.g. deploying a patch).

Automated resolution may not be practical in all situations but AIOps should still provide insights that help lead to the resolution. As the tools get more sophisticated and confidence grows, the problems that AIOps tools resolve automatically will increase.

Take an Incremental Approach to AIOps

Like me, you have probably been looking at these issues for many years and have bought many disparate tools to try and address the very issues that I have detailed here for AIOps. For that reason, the best way is to look at an overall picture of what you are trying to achieve with AIOps and then map your current products to the model.

Helpfully in 2018 Gartner created the following diagram which details all the components.

If we take the example of a company that already has IBM Application Performance Monitoring (APM), Service Now, LogDNA, Dynatrace and IBM Netcool Operations Insight (NOI) most (if not all) of these tools can be used as part of the solution. Similarly, if you use Slack for support collaboration and are happy with this then you should not be expected to change but instead, look for a tool that integrates into your existing processes. Instead the processes should just adapt or evolve to the extra capabilities AIOps provides. Lastly in this scenario, the customer already uses Ansible for launching run books. We don’t want to change that either and so we should look for a product that can launch our current in-house automations.

In this specific scenario, we need a product that ingests data no matter where it has come from and provides real-time alerts and causality recommendations, but we don’t need products that collect the data as this is already covered. An inability to deal with these data requirements will prove costly to the success of the project. We also need the product to intelligently and adaptively automate recurring tasks and in this case, integrate with Ansible for run books and use Slack for collaboration.

There are products that meet this requirement, but this blog will not look at individual vendors or make recommendations, just suggest the process for creating the project.

Once you have selected the product or products how do you start?

I’ve been writing blogs for a while about only receiving alerts that require some outcome and this is no different with AIOps. When you are looking at your initial test cases make sure that you can drive a business outcome such as a run book or a manual script. Initially, these should be low risk such as opening-up a ticket or scaling Kubernetes Pods. In this way, success is easily measured by tracking the reduction in the number of false alarms and nonactionable tickets at the service desk, in avoiding the impact of detected anomalies and in improving performance.

The suggestion is not to create a big bang at your organisation but choose a tool that offers the ability to gradually increase the depth and breadth of analysis. Gartner suggests starting with events and then move to data while breaking down the different silos by progressively ingesting events and then data from all the use cases across all data sources.

Only after this should you start ingesting traces and non-structured data. However, the tool should be capable of ingesting and providing access to a broad range of historical and streaming data types in support of domain-agnostic use cases.

Do I need a Data Scientist?

“Machine learning promises much. However, without data science skills, it will be difficult for I&O [Infrastructure & Operations] leaders to realize that promise beyond the basic use cases of event correlation and anomaly detection”. Gartner 2019

My view is that this is overly pessimistic. If the product needs data scientists to keep running I think the project will fail. Clearly, the quality of the data is key for ensuring the tool’s accuracy as well as the ability to get this data into the platform but the vendor (or better still an independent consultancy such as Orb Data) will set this up initially and then the changes to the inputs should be fairly limited. From experience with past products if they are overly complex to maintain they will eventually be out of date and fall by the wayside.

You will however need individuals with the time and capability to write scripts for executing custom actions as this is a critical element to the success of any AIOps project, however, this is a standard DevSecOps skill.

What questions should I ask the software vendors?

When you are analysing the products, you will want to ask the following questions:

What kinds of data can the product ingest?
Can it accept data, events, etc from the products that I have?
What is the time to value? i.e. how long do the models take to learn?
Does the tool overlap with things that we already have and if so can I get rid of them?
How is the product costed? For example, is it based on data and if so how can I calculate what it will cost at various stages of the project?
How does the tool use machine learning?
What skills are needed to support the analytics and machine learning parts? Do I need a data scientist to integrate my current products?
Is the Machine Learning tool traceable so that I can analyse why an action was taken or a suggestion given? Trust in the tools is critical to the success of a project.
What visualisations does the product provide?
Can it integrate with current tools such as Slack for collaboration or Ansible for Runbooks?
What technologies does the AIOps platform use? Are they open or proprietary?
And lastly and most importantly, when will it pay for itself? i.e. what is the business base?

Gartner 5 Step Model

Lastly, I wanted to include the Gartner 5-step model which is designed to help you to evolve slowly through the AIOps journey/minefield/path/roadmap. For me it’s too prescriptive and some parts you may consider that you are already doing but I have included it here for completeness.

These are the following steps:

Deploy a product to detect patterns in large volumes of data. This is most helpful in separating events likely to end up as false alarms from those needing immediate attention.
The second stage is to test the degree to which these patterns allow users to take a manual action to resolve the issue.
The third stage is to deploy functionality that will create and utilise dynamic baselines and look for anomalies. The predictive nature of these tools will allow for proactive alerts to be created.
Another key use case for AIOps is causal analysis. This refers to the task of tracing a problem to its source or sources in order to help resolve it. The tool that is deployed should be expected to collect data that is not directly related to the problem whose causes you are analysing. This includes information such as how often similar problems have occurred in the past and what their causes were, or whether other systems are experiencing similar issues. Contextual information such as this can help you interpret the scope and significance of a problem and prioritize it accordingly. In addition, the AIOps tools should provide the ability to drill down into a problem in order to investigate it at a deep level.
Using AIOps with ITSM, starting with virtual support assistants/chatbots, ticket analysis and eventually change risk analysis. Chatbots and virtual assistants that can use Natural Language Processing (NLP)for running recurring tasks and for low-cost sharing of knowledge with employees and users, and with virtual customer assistants for automated engagements with users

Gartner suggests that all of these stages of AIOps are important and that companies should select tools that support as many of these stages as possible and ones that enable portability across tools.

Conclusion

To write this blog I’ve looked at several vendor products and noticed that almost all vendors now list an AIOps offering, however, it appears that many companies have just rebranded their current products to make them appear like new products. I’m planning to look at these products in a bit more detail in the coming weeks and also run a webinar in July on AIOps in general and then others looking at the various specific offerings. If you are interested in this then let me know and once the date is confirmed I will send you the subscription link. Alternatively, check back here periodically and I will add the link here.

If you have questions then please email me at simon.barnes@orb-data.com