The first time I heard the name of Kubernetes was 16 years. At that time, Kubernetes was still in the “Three Kingdoms Times” of Docker Swarm, Mesos program, Kubernetes, due to a series of advantages (scalable, declarative interface, cloud friendly) in this The competition has emerged and finally gains the dominance. Kubernetes as a cncf core project (no one), is a base of Cloud Native (Yunyuan) landing, at present, Ali has been fully based on Kubernetes to carry out the cloud primary renovation of the whole station, within 1-2 years, Alibaba 100% The business will run on the public cloud.
Cloudnative definitions in CNCF are: in public clouds, private clouds, mixed clouds, including Containers, Service Meshes, Microservices, Immutable Infrastructure, Declarative APIVE APIVE APIVE APIVE APIVE APIVE APIS, and runs elastically extended and easy to manage. , Observable, loosely coupled applications. Observability is an essential part of the application system. There is a design idea of ??Yunyui: DiagnosAbility, including cluster level logs, metric and trace.
Why do we need a log system?
Usually the positioning process of a line problem is: Discovery by metric, locate the TRACE to the problem module, depending on the cause of the module specific log positioning problem. In the log, information such as errors, critical variables, code running paths, etc. are the core of the problem troubleshoot, so the log is always a must-have path for online issues.
In the more than ten years of Ali, the log system is accompanied by the development of the calculation form, which is roughly divided into three main stages:
In standing era, almost all applications are single-machine deployment. When the service pressure is increased, only higher specifications of IBM small machines can be switched. As part of the application system, it is mainly used as program Debug, which is usually analyzed for Linux, which combines GREP.
With the bottleneck that restricts the development of Ali’s business, in order to real Scale Out, the Flying World Project started: 2013 Flying 5K project officially launched. At this stage, each business has started distributed transformation, and the call between services is also changed from local to distributed. In order to better manage, debug, analyze distributed applications, we have developed Trace (distributed link tracking) system, A wide variety of monitoring systems, the unified feature of these systems is to centrally store all logs (including metric, etc.).
In order to support faster development, iterative efficiency, in recent years, we have begun to transform the container, and began to embrace the Kubernetes ecology, the business is full of clouds, Serverless and other work. At this stage, the logs are exploding from the scale, and the types are digitally used, and the demand for intelligent analysis is getting higher and higher, so the unified log platform came into being.
Observable ultimate interpretation
In CNCF, the main role of observability is the diagnosis of problems, rising to the company’s overall level, but observability not only includes the DEVOPS field, including business, operation, BI, auditing, security, etc., observability The ultimate goal is to achieve digital and intelligent in all aspects of the company.
In Ali, almost all business characters will involve a variety of log data. In order to support all kinds of application scenarios, we have developed a lot of tools and features: log real-time analysis, link tracking, monitoring, data processing, streaming Calculation, offline calculation, BI system, auditing system, etc. The log system focuses on real-time acquisition, cleaning, intelligent analysis and monitoring of data, and docking all kinds of stream computing, offline systems.
Difficulties in the Kubernetes log system
Simple logging system solutions, relatively mature, here no longer goes, our only for the construction of log system on Kubernetes. The log program on Kubernetes is very different than the log scheme of the virtual airport, for example:
The form of the log is more complex, not only the logs on the physical machine / virtual machine, but also the standard output of the container, files, container events, kubernetes events, etc. need to be acquisition.
The dynamics of the environment is so strong, in Kubernetes, the machine’s downtime, the line, the POD destruction, the expansion / harm, etc., in which case the existence of the log is instantaneous (for example, if the POD is destroyed The POD log is not visible), so log data must be collected in real time to the server. At the same time, it is also necessary to ensure that the collection of logs can accommodate this dynamic scene.
The type of the log is much more, the figure is a typical Kubernetes architecture, a request from the client needs multiple components such as CDN, Ingress, Service Mesh, Pod, involving a variety of infrastructure, where the type of log has increased, for example K8S various system components logs, audit logs, servicemesh logs, Ingress, etc. Business architecture changes, now more and more companies have begun to land on the Kubernetes, in the micro-service system, service development is more complex, and the dependence between services and more and more dependence on the service underlying products. The problem investigation will be more complex, and if the log of all dimensions will be a difficult problem.
Difficulties in logging solutions, usually we will build a CICD system on Kubernetes, which requires asyably as possible to complete the integration and deployment of business, where logs are acquired, stored, cleans, etc. also need to integrate into this system. In the way, and the declaration of K8s is as consistent. And existing log systems are usually more independent systems, integrated into CICD.
Log size, usually we will choose to build an open source log system at the beginning of the system, this way is there is no problem in the test verification phase or the initial development of the company, but when the business grows, the log is growing to a certain scale. Self-built the open source system often encounter a variety of problems, such as tenant isolation, query delay, data reliability, system availability, etc. Although the log system is not the core path in IT, once these problems, these problems will be very terrible, such as emergency problems, interview multiple engineers, and explode the log system, leading to failure The recovery time becomes long and the impact is received.
This article Author: Yuan B
Original link: https: //yq.aliyun.com/articles/717779? UTM_CONTENT = G_1000076969
This article is the original content of Yunqi Community, which is not allowed to be reprinted.
Size: 868.2 KB Size: 667.9 KB Size: 627.2 KB View Image Accessories