Summary: Enterprise operation and maintenance requirements and challenges, let's see how Huawei AIOps can solve it!
This article is shared from the HUAWEI CLOUD Community "[Cloud Resident Co-Creation] AIOps? The new power of enterprise operation and maintenance! " , the original author: Qiming.
International practice, let us first introduce the concept of AIOps: AIOps, namely Artificial Intelligence for IT Operations, intelligent operation and maintenance, applying artificial intelligence to the field of operation and maintenance, based on existing operation and maintenance data (logs, monitoring information, application information, etc.), Use machine learning to further solve problems that cannot be solved by automated operation and maintenance.
Gartner predicts that current IT applications will change drastically, and the way the entire IT ecosystem is managed will also change. The key to these changes is what Gartner calls the AIOps platform.
What we are going to discuss today is the demand challenge of AIOps, and how we deal with this challenge.
AIOps needs and challenges
(1) New technologies and new challenges call for highly intelligent telecommunications networks
In recent years, new technologies represented by 5G have been rapidly applied in telecommunication networks. The application of new technologies has brought us a lot of benefits, such as large connections, low latency, high speed, and so on. With the development of 5G, these data have improved by at least an order of magnitude.
However, the increase in data level is accompanied by an increase in the difficulty of operation and maintenance, which brings the following challenges to operation and maintenance:
1. Network complexity:
The increase in data levels has made the network more complex : new technologies have been rapidly applied, but the old technologies have not been withdrawn simultaneously. As a result, each time we introduce a new technology, we need to add to the original complexity. In some scenarios, even multiplication is required.
For example, in the wireless field, 2G/3G/4G/5G, "four generations in one house"; in the core network, PS/CS/MS Internet of Things, etc. ten domains coexist... Such a high network complexity is bound Will bring considerable challenges to operation and maintenance.
2. 2B New Demand
The second challenge of operation and maintenance is the new scenario of To B, that is, enterprise applications. The application of 5G has promoted intelligent manufacturing, and the network has gradually integrated into the production and manufacturing process of enterprises. In this case, the requirements for network reliability will inevitably increase. After all, once the network has a problem, the production process may be affected or even interrupted, and the losses caused by this will be very large.
3. Cost pressure
The cost pressure is mainly conducted by the first two challenges. The first two challenges lead us to either face a more complex network or have higher requirements. If we deal with it in the traditional way of operation and maintenance, it will inevitably lead to a sharp rise in costs. Of course, another factor in the increase in cost is energy consumption. After all, the energy consumption of 5G is much higher than that of 4G.
In response to the above-mentioned challenges, how do we deal with them? AI technology is the key.
(2) AI is a key technology to enhance the automation and intelligence of telecommunications networks
In terms of operation and maintenance costs, statistics show that 90% of operations and maintenance require manual participation, and 70% of the costs are labor costs. In this case, a natural idea is whether AI technology can be used to reduce human costs and improve operation and maintenance efficiency.
For example, just mentioned the energy consumption of 5G, can we reduce energy consumption through artificial intelligence technology? Judging from past practical experience, the answer to the above question is yes.
Next, we use three examples to illustrate.
1. Base station energy saving
The first example is energy saving in base stations. The energy consumption of the base station is very high. In the initial stage of network deployment, the base station has fewer users, and sometimes the base station is often open. In response to this situation, the operator s solution is to make some predictions about the volume of dialogue. If we can accurately predict the volume of traffic, then, when the volume of traffic is small, we can turn off a certain amount of carriers, so as to achieve the purpose of energy saving. According to statistics, in the process of predicting traffic, using LSTM neural network to make predictions can achieve energy savings of more than 10%.
2. Core network KPI anomaly detection
The second example is anomaly detection. Deploy KPI anomaly detection services in the operator's core network. The original anomaly detection service uses fixed thresholds for alarm notification. AI technology, on the other hand, can identify abnormalities more intelligently, timely and accurately.
3. Fault identification and root cause location
Usually, once a fault occurs on the network, a large number of alarms will be triggered, and the system will also dispatch orders for operation and maintenance with high latitude and longitude dimensions. If multiple netizens report multiple alarms, this kind of duplicate dispatch will occur. That is to say, if a failure occurs, multiple network operators report an alarm, which may eventually cause orders to be dispatched in multiple domains (wireless domains and transmission domains, etc.).
(3) The development of AI applications still faces challenges: high development threshold and long cycle
From the above three examples, we can see that AI is relatively reliable. But since AI is so reliable, why hasn't it been fully and quickly applied? Because the development of AI still faces many challenges, a simple summary is six words: high threshold, long cycle
The picture above is a research report by Gartner. It analyzes the main obstacles to AI applications from four dimensions. The three main points are:
Understand gains and uses
Data scope and quality
This brings us back to the six words we said: high threshold and long cycle.
1. High barriers to entry
The "high barriers" mentioned here, the first point refers to the lack of AI algorithm developers . The general operation and maintenance team will not deploy dedicated AI algorithm developers, which will inevitably lead to the lack of AI skills.
But this is not the most critical, because AI personnel can be solved through training, training, and recruitment.
The most critical point, which is the second point we are talking about, is that it is difficult to combine algorithms with business . If you want to make an application well, the best thing is to start from the business and choose the appropriate algorithm according to the actual situation of the business, so that the application can be made well. But in the actual operation process, first of all, we need to have a business expert who has a deep understanding of operation and maintenance; secondly, we need to have an algorithm expert proficient in AI. After this, they need to have enough time and willingness to sit down and have an in-depth exchange. Here, time and willingness will become obstacles.
The third point is data . The data contains two problems: engineering problems and labeling problems. That is, the development of an AI application is actually a considerable amount of engineering, because it first needs to access massive multi-modal data to complete the training and inference of the model, and finally to complete the display of the results, including connecting some existing system. Therefore, in addition to the operation and maintenance experts and algorithm experts that are required in the front, a lot of engineering developers are also needed.
2. Long cycle
The high development threshold determines the long development cycle. After all, there is such a high threshold. If it cannot be solved well, the cycle will inevitably be particularly long. A long development cycle will lead to:
1. understand the gains and uses . How to understand it? In other words, if we do not get results for a long time, then corporate decision-makers may doubt the effects that AI can produce;
2. the longer the time, the higher the expectations for the project . Assuming that the same thing is done to achieve the same effect, for example, the fault repair time is reduced by 5%, and the evaluation may be completely different for two years and one month.
In response to the challenges encountered in the implementation of AIOps, Huawei launched the AIOps service! Now let's take a look at what the AIOps service is and how it solves the challenges we face in front of us.
Huawei AIOps Service
The picture above is the overall framework of the AIOps service. AIOps is divided into four layers from bottom to top:
The first layer: data collection and management . Data collection and management sounds easy, but difficult to do. Why? Because there are many data types to face, the interfaces and data types are not uniform. Just adapting to these data may be exhausted. Relatively speaking, Huawei's AIOps service first supports common interfaces, and then some common equipment has been preset, and finally it can reach a level of automatic docking and automatic data management.
The second layer: AI atomic capabilities . Huawei AIOps has more than 20 atomic capabilities, covering four scenarios: detection, prediction, identification, and diagnosis. Atomic capabilities are not just an implementation of AI algorithms. Each atomic capability has been verified by actual site data and optimized for specific operating scenarios. At the same time, each atomic capability is also integrated into Huawei's previous operation and maintenance experience, and some atomic capabilities can even be used directly without training.
The third layer: orchestration ability . Including process arrangement and big screen arrangement, as well as RPA arrangement. Atomic capability is the basic component of AIOps intelligent operation and maintenance. The process orchestration operation is simple and flexible. Simply drag data from the component library and combine with AI operation and maintenance capabilities to complete the end-to-end graphical orchestration of command scenarios, which truly supports partners Lower the development threshold and build an AI application orchestration framework efficiently.
The fourth layer: industry AI app . Out of the box for the most typical scenarios. Through rich 2D and 3D visualization components, for example, it provides more than 30 chart controls, covering styles such as polyline, topology, list, and column, and provides multiple map controls, interactive controls, and media controls. When the operation and maintenance effect is large, you only need to drag and drop various controls from the component library, combine free layout and flexibly configure various reports of the application as needed, and assist in monitoring and analysis, such as DIY microservice health monitoring hall to enable visualization , Show the average success rate of the interface, the average delay of the interface, the failure rate of the interface, the number of interface calls, etc. At the same time, it provides a list of KPI alarms to provide operators with a reference basis for failure warnings, drag and drop the required control numbers, and customize the style, data and interaction of the controls to meet the display requirements. The back-end data can also use various intermediate data defined in the app assembly process. After the configuration is complete, you can preview and publish the operation and maintenance effect with one click, display the interface on the large screen, the average success rate, the average delay of the interface, the failure rate of the interface, the number of interface calls, etc., quickly realize the DIY visualization on the large screen.
(1) RPA helps AIOps connect with existing operation and maintenance systems
In addition to the display position, the inference results must be able to help recover from the failure. At this stage, it is generally to interface with existing systems, such as the work order system (the person who needs the work order mailbox must handle it), automatic replies, and problem orders. If the docking is done manually, it is time-consuming, laborious and error-prone. Therefore, robotic process automation, that is, RPA service, is a matter of course. RPA services can complete data docking, handling, and issuance of work orders, etc., reducing manpower input and reducing error costs.
(2) 10+ out-of-the-box apps that support rapid deployment
For some of the most typical scenarios, HUAWEI CLOUD AIOps has prepared the orchestration capabilities in advance, that is, there are more than ten kinds of out-of-the-box apps , such as campus networks, DC networks, IT applications, operator networks, etc. Coverage; flexible deployment , support for public cloud, HCS deployment, On Premise deployment, and cloud-ground collaboration, etc.; open ecology , support partners to develop industry apps, and release AI applications to the AI market, win-win cooperation, and build a network AI ecosystem .
Let's use the "KPI Anomaly Detection" App to demonstrate how to use an out-of-the-box App.
Step 1: Import the list of network elements;
Step 2: Configure performance and alarm data sources;
Step 3: Associate the data source to the App;
Step 4: Start the App;
Step 5: Check the big screen and analyze the fault.
AIOps enables intelligent operation and maintenance of campus networks
So how does AIOps solve the actual operation and maintenance in the park?
(1) Campus network construction and maintenance mode
The above picture shows the two construction and maintenance modes of the campus network:
2B and 2C share the OMC of the big network : the current mainstream model. The enterprise rents the wireless equipment of the operator and some other equipment. The problem with this model is that the terminal is maintained by the enterprise and the network is maintained by the operator, so it is difficult to distinguish responsibilities when a problem occurs; another problem is the operator s operation and maintenance capabilities and the organization of the O domain of the large network 2C. It is difficult to support the high SLA of the enterprise intranet and strengthen the demands of customers.
Separate 2B and 2C OMC (EMS) : Enterprises purchase 5G CPE, wireless, core network and other equipment for maintenance, with an end-to-end view. Judging from the documents issued by the Ministry of Industry and Information Technology, VDF, Audi Park and corporate SLA guarantees, companies renting operator spectrum or dedicated spectrum to build 5G networks will gradually become mainstream .
(2) Business scenario and pain point analysis: Park customers need easy-to-use, multi-domain integrated network operation and maintenance
1. Typical network status
The picture above is a common video detection service in a park. We can see that even for the most common business, about a dozen network elements will participate in it, from 5G wireless to transmission to edge computing, and even the core network.
2. Park application
The above figure lists some common applications in the park, including edge AI detection, smart logistics, indoor positioning, etc. All these businesses are actually similar to the previous picture, that is, any simple business involves the participation of multiple domains.
So what is the difference between the park and the operator's operation and maintenance? There are three main points:
User : lack of professional communication knowledge and weak network operation and maintenance capabilities;
Network : The networking is relatively simple, but involves multi-domain, wireless, transmission, data communication, IT, etc.;
SLA : Production system network end-to-end SLA contract requirements are high, 7X24 hours, 99.99%.
Therefore, if the customer is operating in the park, the pain points are as follows:
Skills : The introduction of 5G 2B makes the network more complex, and enterprise engineers lack relevant skills, making operation and maintenance difficult;
Tools : There is a lack of effective operation and maintenance tools, and the location of complex network problems requires on-site consultation with cross-domain experts, which is costly and time-consuming.
In summary**, the campus network cross-domain equipment needs to realize data integration, support end-to-end analysis and presentation, and finally realize the unified operation and maintenance of enterprise ICT infrastructure. The campus network involves a lot of network equipment, and the boundary is blurred. It needs a unified cross-domain delimitation and positioning capability to accelerate the positioning of production network problems**.
(3) Traditional manual and tool-based operation and maintenance cannot meet the new needs of the park network, and there is an urgent need for intelligent transformation
According to the data in the above figure, we can see:
Passive operation and maintenance : 75% of problems are discovered by users rather than actively detected. If they are discovered by users, users are likely to complain;
Low degree of automation : 70% of the operating costs of the company's costs belong to labor costs, and the cost has increased sharply;
Difficulty in troubleshooting : 90% of the failure recovery time is used to locate the problem, and the real problem repair time accounts for a very small proportion.
From this point of view, regardless of whether it is considered in terms of efficiency or effectiveness, there is an appeal that is to introduce artificial intelligence to solve problems and enable the automated closed loop of network operation and maintenance prediction, analysis, and decision-making .
(4) Flow of cross-domain fault location algorithm
The figure above is the algorithm flow of cross-domain fault location. The whole process is as follows:
Alarm: the alarm reported by the device;
Topo: Network Topo structure;
Fault propagation diagram: the influence relationship between alarms.
Noise reduction: filter out the large number of and invalid alarms in the original alarms, such as flashes and earthquakes;
Aggregation: Divide the alarms, separate Topo unrelated alarms, and aggregate the alarms that may be related (belonging to the same fault) together to obtain multiple alarm groups;
Identify and locate: Analyze each alarm group in combination with Topo and fault propagation diagram, and identify how many faults in each alarm group, the root cause network element and root cause alarm of each fault;
Diagnosis: Diagnose the type of fault for each fault alarm, for example: power interruption.
Root cause of failure
Alarm of failure design
Failure recovery suggestions
(5) AIOps framework implementation algorithm flow
The above explained the entire algorithm flow. Next, let's take a look at how to use the Huawei AIOps framework to implement the algorithm flow.
1. Quickly configure data sources and orchestrate processes
Configure data source: access alarms from multiple domains such as wireless, transmission, and core network, and access network topology data;
Process orchestration: Commonly used existing atomic capabilities to quickly perform process orchestration.
After the above process, the "event notification" function can be completed, and the results can be saved to the record set (ie, the database) for large-screen display. The renderings are as follows:
Open one of the alarms, you can see the following information:
AIOps deployment recommendations
Based on the aforementioned practice, we can summarize the following:
1. Select mature scenarios and deploy AIOps step by step
After long-term practice, we have summarized the main reasons for the failure of AIOps deployment as follows:
Data is not available : The data is scattered on various independent systems, and there is a lack of comprehensive collection and management methods. Missing data and low data quality are the main reasons for the poor performance of AIOps;
Order not to go : lack of automated operation and maintenance tools, unable to actively detect and restore operations;
The model is not intelligent : it cannot effectively accumulate the annotation information in daily operation and maintenance, and it cannot realize the self-learning of the model.
Therefore, based on the failed deployment, we can conclude that if we want to successfully deploy AIOps, we need to:
From the conditions of the mature scene of view, a gradual advance AIOps deployment;
Data comes up , collect all kinds of operation and maintenance data in a comprehensive way , and improve data quality;
Orders can be taken , AIOps back-end docking is now automatic operation and maintenance tools, enhanced diagnostic methods and automatic recovery capabilities;
Effectively accumulate annotation data , so that the AIOps model can continuously receive feedback and have self-learning capabilities.
2. Choose mature AIOps services
For different types of enterprises, the selection of AIOps services is also different, as shown in the following table:
Huawei's AlOps service lowers the threshold for network AI application development and accelerates the implementation of network AI applications. It has accumulated 10+ out-of-the-box smart APPs, covering application areas such as operator networks, campus networks, data center networks, and IT applications. Pre-integrated rich AI atomic capabilities, covering fault prediction, detection, diagnosis, identification and other links. Support users to develop AI applications with zero coding to improve operation and maintenance efficiency.
If you are interested, come and experience it together~ www.hwtelcloud.com/products/ai...