In the face of large-scale K8s clusters, how to find problems before users?

In the face of large-scale K8s clusters, how to find problems before users?

Author | Peng Nanguang (Guangnan) Source | Alibaba Cloud Native Official Account

The embankment of a thousand miles collapsed in an ant's nest.


I don t know if you have ever experienced such a scenario: Suddenly being notified by the user that there is a problem with the system, and then looking in a daze to troubleshoot and repair it; or when you find that the system is malfunctioning, it has actually caused a serious adverse effect on the user.

The so-called dike of a thousand miles collapsed in an ant nest. The establishment of user trust is long and difficult, but it is very simple to destroy this trust. Once the above problems occur, it will not only greatly affect the user experience, but also leave the user with an unreliable impression of the product/team, and lose the credit capital accumulated by the user for the product/team for a long time. I want to build this in the future. The relationship of trust is difficult.

This is why we say that the ability to find problems quickly is so important. Only by finding problems quickly can we talk about how to troubleshoot and solve problems.

So how can we find problems before users in complex large-scale scenarios? Below I will bring some of our experience and practice in quickly discovering problems in the process of managing large-scale ASI clusters, hoping to inspire everyone.

Note: ASI is the abbreviation of Alibaba Serverless infrastructure, which is a unified infrastructure designed by Alibaba for cloud native applications. If you are interested, you can read: "Uncovering the Veil of Alibaba's Complex Task Resource Hybrid Scheduling Technology" .


1. Complicated scenes and dilemmas

The large-scale ASI cluster scenarios that we manage are very complex, which brings great challenges to our work. Inadvertent handling of any scenario may cause unexpected damage.

  • From the perspective of components, we currently have hundreds of components, and there are tens of thousands of component changes every year. How do frequent component changes strike a balance between stability and efficiency, how to make changes more stable, how to make gray levels more confident, and thus reduce the explosion radius?

  • From the perspective of the cluster dimension, there are currently thousands of clusters and a large number of nodes. There are many cluster/node problems encountered, and the monitoring link coverage is relatively complicated. How to make the cluster operation more reliable?

  • From the perspective of second-party users and business scenarios, we support a large number of second-party users of the group. At the same time, the business scenarios are also very complicated. How can we ensure that each unique business scenario can receive consistent and careful attention?

2. Problem prediction and solution ideas

Based on long-term cluster management experience, we have the following presets:

  1. As a forward link, data monitoring cannot cover all scenarios without blind spots. Even if the monitoring data of each node in the link is normal, it cannot be 100% guaranteed that the link is available.

    • The status of the cluster is changing all the time, and each component is constantly updated and upgraded. At the same time, every system on the link is constantly changing. The coverage of monitoring data is always a positive catch up, only approaching 100%. Full coverage can not be fully achieved.
    • Even if the monitoring data of all components/nodes in the entire cluster link are normal, there is no guarantee that the cluster link is 100% available. Just like business systems, they all seem to be available and no problems are exposed. But only after the entire link is actually probed through the full link pressure test, can the actual usable conclusion be obtained.
    • If you want to prove that something is available, you need to prove countless examples. And if the reverse proof is not available, a counterexample is enough. The data monitoring link can only approach full coverage, but cannot guarantee true full coverage.
  2. In large-scale scenarios, data cannot achieve 100% complete consistency.

    • When the cluster size is large enough, data consistency problems will become more apparent. For example, does the global risk control component have full cluster link coverage? Is the related flow control configuration equalized for the entire cluster link? Is the time zone of the pod main container consistent with the upper layer? Is the cluster client node certificate about to expire? And so on, if you are negligent, it may lead to serious failures.

Only by making up for the above two types of risk points can we have the confidence to truly discover problems before users. Our ideas for solving the above two types of risks are:

  1. Black box detection
    • The so-called black box detection is to simulate user behavior in a broad sense and detect whether the link is normal.
  2. Directional inspection
    • The so-called inspection is to check the abnormal indicators of the cluster and find the existing or likely risk points.

Based on the above ideas, we designed and implemented the KubeProbe detection/patrol center to make up for the deficiencies of the forward monitoring of complex systems and help us find system risks and online problems better and faster.


Black box detection and directional inspection

1) Black box detection

I don't know if you have experienced that the monitoring data of each system on a link is normal, but the actual link process just fails. Or because the system is changing fast, there will always be omissions in scenes where the monitoring coverage is less than 100%, which results in no alarms that affect users, and frequent alarms with no real impact on users, which makes them exhausted.

If a system developer does not use his own system, how can it be possible to discover system problems before users? Therefore, to discover system problems before users, we must first become users ourselves, and must be the users who use the most, understand the deepest, and use and perceive system conditions all the time.

The so-called black box detection is to allow yourself to become your own user and simulate the behavior of a generalized "user" to detect the cluster/component/link waiting for the object to be tested. Note that the "user" here is not just a classmate who uses the system in a narrow sense, but a user in a broad sense. For example, the "user" of etcd is APIServer, and the "user" of ASI may be a classmate who operates the cluster through APIServer, or it may be a publish/expand/shrink operation initiated by Normandy.

We hope that KubeProbe can be triggered by cycles/events when changes (monitoring of cluster status changes/component changes/component releases/system upgrades, etc.)/runtime (cycle, high frequency)/failure recovery (manual) Manually trigger, perform various types of black box detection, and sense the availability of components/clusters/links for the first time.

Taking the availability of etcd cluster as an example, we can implement a detection use case. The logic is to do create/get/delete/txn operations on etcd, and record the success rate/consumption time of each operation. When the success rate is less than 100% or After the elapsed time exceeds the tolerance threshold, an alarm is triggered. We will run this etcd detection use case periodically, and at the same time, for any changes in the etcd cluster, an event will be issued to trigger the etcd detection to run immediately, so as to ensure that the availability of etcd failure is found as soon as possible. At the same time, when the etcd cluster is unavailable for some reasons, we can also perform detection by other methods such as manual triggering, and we can also get the information about whether to recover in the first time.

2) Directional inspection

In a large-scale cluster/system scenario, data consistency is a problem that will definitely be faced. Inconsistent data will cause some hidden dangers and may cause certain deterministic failures in the future.

Compared with the unknown failure scenarios faced by black box detection, the goal of directional inspection is to scan the known risk points of the cluster.

We hope that KubeProbe can conduct regular directional inspections of the entire cluster/link to find out the points of inconsistency of these data, and judge whether the data inconsistency may cause risks, so as to prevent problems before they occur and cure them.

For example, the incomplete coverage of etcd hot and cold standby clusters may cause the cluster to fail to recover quickly. Then we regularly conduct directional inspections on etcd's hot and cold backup coverage to find out the clusters that are not covered and flattened, and give an alarm. For example, the cluster risk control system does not have the full cluster link coverage, and the current limiting configuration is not leveled with the full cluster link, which may cause some failure scenarios to cause a complete cluster collapse. We regularly scan the entire network for the risk control configuration to determine whether it may cause a failure. Find out these hidden known risk points and alert them.


1. Architecture

1) Basic structure

The basic implementation architecture of KubeProbe is roughly as shown in the figure below. The KubeProbe central end configures the association relationship between the cluster/cluster group and the inspection/probing use case/use case set, and is responsible for issuing a specific probe instance to the cluster. When a specific inspection/detection use case is sent to a specific cluster, a pod will be created using the mirror image of the use case. In this pod, a number of inspection/detection logic will be executed. When the execution is completed, the center will call back to write back to this tour. Inspection/probing results. The specific results are uniformly displayed/alarmed at the center and provided to other consumers for consumption (such as supporting the release and blocking of the ASIOps platform).

2) High-frequency architecture

In addition to the above basic architecture, we designed a set of distributed resident detection architecture in the cluster for high-frequency detection use cases (the detection period is short, the trigger frequency needs to be very frequent, and even the seamless detection is maintained). Through the changes in the probeConfig, which is the ProbeOperator component watch in the cluster, the custom object probeConfig is created to create a resident detection pod in the cluster, which will continue to run the detection logic without interruption to achieve near-seamless continuous detection, and pass the results through denoising/ordering After processing such as the current limit of the card barrel, it is reported to the central end for consumption by other consumers.

2. KubeProbe detection/inspection use case management

All detection/inspection use cases use unified git warehouse management. We provide a unified client library. There are two main methods provided by the core of the client library.

KPclient "{sigma-inf}/{kubeProbe}/client" //report success //This method will report to KubeProbe that the inspection result is successful KPclient.ReportSuccess() os.Exit(0) //report failed //The report method will report the inspection result to KubeProbe as a failure, and the failure message is `I failed` KPclient.ReportFailure([]string{"I failed!"}) os.Exit(1) Copy code

We can package this use case into a mirror image by providing a good Makefile, and enter it into the central end of KubeProbe to configure and distribute the cluster. Decoupling the specific inspection/probing logic from the KubeProbe central control end can allow more second-party users to access their own special inspection/probing logic flexibly and simply.

The detection/inspection use cases that have been used include:

  • General detection: Simulate the life cycle of pod/deployment/statefulset to detect the entire management and control link of the cluster.
  • etcd black box detection: Simulate the basic operations of etcd and detect the status of each etcd in the meta-cluster.
  • Canary detection (thanks to the strong support of quality technology students): simulate the deployment scenario of users using ASI, and realize the full-link simulation release/expansion/shrinkage of canary applications.
  • Virtual cluster detection: Detect the control link status of the vc virtual cluster.
  • Federation link detection: to detect the status of the related links of the federation controller.
  • Node general detection: Simulate and schedule a detection pod on each node of the cluster to detect the link status on the node side.
  • ASI client/server certificate inspection: Check the validity of the client/server certificate and whether the expiration time has exceeded the alarm threshold.
  • Global risk control current limit inspection: Check whether each ASI cluster has been flattened and turn on the KubeDefender global current limit risk control configuration.

3. KubeProbe central end control

After writing the detection/inspection use case, packaging and uploading the image, you need to register the use case template on the KubeProbe central end, that is, register the image in the database of the KubeProbe central end.

We can pass in some specified env environment variables to the inspection/probe pod through the "rendering configuration" parameter to execute different business logic and generate multiple use cases with the same use case template.

Finally, through unified configuration management and control, the use cases are bound to the cluster, the corresponding parameters are configured, and various distribution logics are executed.

At the same time, we have also done a lot of authority security control, dirty data resource cleaning and efficiency increase work on the KubeProbe central end (such as adopting the automatic cleaning ability of inspection/detection use case resources based on ownerreferences, etc.), which will not be repeated here. .

4. Get through release/change blocking

We have opened up the association between KubeProbe detection and release changes. When there is any change in the corresponding cluster (such as a component being released), we will automatically trigger all inspection/detection use cases bound to this cluster through the corresponding event. Check Whether the cluster status is normal. If the detection fails, the change will be blocked, the explosion radius will be reduced, and the stability of the cluster will be improved.

5. Why not use Kuberhealthy

There is an Operator in the community called Kuberhealthy that can also do similar things. We have also considered adopting it, and have used Kuberhealthy deeply and contributed to the community that participated in kuberhealthy, and finally came to an unsuitable conclusion. The main reason is that the support for large-scale clusters is weak. , At the same time, the main process stuck in high-frequency calls is more serious, does not support the event/manual single trigger feature, does not support unified reporting to the data center, etc., and finally chose the self-developed and self-built method, which is relatively correct s Choice.

Little result

Since the launch of KubeProbe, dozens of detection/inspection use cases have been implemented, and it has been run more than tens of millions of times in hundreds of ASI clusters in the group, and cluster failures and problems have been actively discovered more than a hundred times. Major failures effectively reduce system risks. At the same time, the change/release system has been opened up to improve the stability of changes. In addition, in the case of special failures, the problem was discovered many times before the business side, and the problem was solved earlier, which objectively reduced the failure loss.

Here is a specific example:

  • We will receive the release events of each component in each cluster. Triggered by the release event, we will run related inspections/probes in the corresponding cluster, such as scheduling a directed pod to the node where a certain node component is released. We found that the release of kube-proxy will cause the node to be temporarily unavailable, and the scheduled pod cannot be created successfully. There is no specific problem from the simple return/log/cluster event, and it continues to recur. After in-depth investigation, it was found that it was a problem with kube-proxy, and there was a netns leak. After running for a long time, it will leak. When kube-proxy restarts, the kernel has to clean up netns, and it will be stuck for a period of time to clean up, resulting in node link failure for a period of time, pod can be scheduled up but can t run, and kube-proxy is subsequently promoted The problem is fixed.

After finding the problem

1. The difference between KubeProbe and data monitoring alarms

KubeProbe faces different scenarios and data monitoring, and is more inclined to link detection.

For example, general alarms for monitoring alarms may be as follows:

  • xx container memory usage rate is 99%
  • The double copy of webhook is all down
  • All three copies of apiserver are down

These alarms often contain specific points of failure, but KubeProbe's link detection alarms are quite different, such as:

  • Statefulset link detection failed, Failed to create pod sandbox: rpc error: code = Unknown
  • Black box detection of the entire process of etcd failed, context deadline exceeded
  • CloneSet expansion failed, connect: connection refused

These KubeProbe alarms are often difficult to literally tell why the inspection/detection failed. We often need to return logs based on related use cases, inspection/probe pod logs, comprehensive investigation of KubeProbe-related cluster events, and locate the cause of failure. .

2. Root cause positioning

Using the relatively chaotic KubeProbe detection failure alarm as a clue, we built a set of KubeProbe self-closed loop root cause location system, sinking the expert experience of troubleshooting into the system, realizing fast and automatic problem location function, a simple The positioning rules are as follows:

We will use ordinary root cause analysis trees and machine learning classification algorithms for failed inspection detection events/logs (continuous development investment) to locate the root cause of each KubeProbe's detection failure case, and implement it uniformly in KubeProbe The problem severity evaluation system (the rules here are still relatively simple) evaluates the severity of the alarm to determine how appropriate subsequent treatment should be done, such as whether it is self-healing, whether it is a telephone alarm, etc.

3. Oncall and ChatOps

With the above-mentioned root cause location and alarm severity evaluation system, we used the nlp alarm robot to implement an automated Oncall system and ChatOps. Some examples of use are shown below. Through ChatOps and Oncall robots, it is extremely Reduce the complexity of problem handling, try to use technical means to solve repetitive problems.

We are still on the way

The above is a little bit of our experience in managing large-scale Kubernetes clusters, and some common problems have also been solved. I hope it can be helpful to everyone. At the same time, these tasks need to be continuously polished under the massive scale of Alibaba Cloud. We are still on the road and will continue to be on the road.