To memorize the two year journey from 2021 to 2023.
Summary
Tencent's Infrastructure Services (TIS) owns a critical part of the Chinese internet (such as WeChat). As time flies, operating the business becomes more painful using the legacy embedded model. To transform the operations, we built the internal DevOps platform (EasyCloud) to solve the challenge using software.
TIS architecture is one of the most technically complex parts across the company. It has built core COS (similar to S3), CBS (similar to EBS), and CDN for the whole Tencent business. TIS's business has both a public and private cloud. The public cloud runs a partnership with Tencent Cloud with a focus on the dataplane side.
Both the technical and business complexity have posed pressure on both engineering and DevOps, leading to the pains of efficiency. The leaders of the business want to make a change.
Over the past two years, TIS built its internal EasyCloud to unify and automate the operations. At the end of the 2-year journey, the VP said, “The things I did not see change in the last 10 years, I see a change in the last 2 years.”
Legacy architecture and challenges
TIS started as a storage service, and the complexity of architecture, coupled with rapid growth, required Ops engineers to collaborate deeply. To support this growth, the ops teams were embedded into the storage business, a strategy that proved to work well.
Figure 1: Legacy Ops Embedded Model empowered the growth also leads to tooling silos and fragmentation.
During more than 15 years of growth, as TIS scaled its business, the embedded model scaled accordingly. As a result, the overall business faced the following challenges:
A 6:1 dev to ops ratio: The public cloud business operations demanded a higher release frequency, a larger customer base, and a 10x increase in zone-based geolocations. Due to the current fragmented tooling and human-driven operational model, more business operations required more ops engineers.
30% deployment failure rate: The deployment best practices were tribal knowledge held by experienced subject matter experts due to the tooling fragmentation. A new Ops engineer and a complex deployment could easily lead to deployment failure.
Low deployment standard parity: The continuous deployment platform had been rebuilt four times, and there were three deployment standards released before. During a customer conversation, one customer asked a question: “When did the deployment standard actually land consistently?”
Current Architecture
We achieved a 10:1 engineering to ops ratio by building the EasyCloud platform. This platform allowed us to build a suite of services and an ecosystem to automate deployment, chaos management, policy enforcement, and more. Enabled ops model transformation.
Figure 2: EasyCloud enables an automated operating standard across businesses, transforming the operating model.
The Product Catalog and EasyCloud Portal lay the foundation for the EasyCloud ecosystem, facilitating transformation. The EasyCloud Portal serves as a unified entry point, offering insights and tools for daily use.
The Continuous Deployment Platform introduces a new CD platform with an embedded deployment standard and a pluggable architecture to execute deployment workflows at scale.
The Ecosystem and other platform services, such as Chaos Engineering, are built from day 1 based on the product catalog. As the platform proves successful, we continue to build more platforms, enhancing business efficiency, including observability and build platform.
Product Catalog & EasyCloud Portal
The Product Catalog (PC) constructs a tag-based Configuration Management Database (CMDB) for approximately 5 million instances worldwide. It establishes a unified and modernized view for all business operations. Engineers with over 5 years of experience in the field within the business have expressed that 'it has achieved what they dreamed about before.
Figure 3: Product catalog synchronizes data from existing systems and builds a foundation to enable an ecosystem.
Millions of instances of data synchronization. The TIS business had integrated the cluster and application launching process deeply into its own systems. The synchronization incrementally syncs the data (application, instance, and tags, etc.) at scale, builds appropriate indexes, and performs anti-entropy for data accuracy.
The CMDB is a tag-based service. It has built the batch and dynamic tag-based query with pagination to support legacy wildcard query use cases. The CMDB separates the primary and backup, that the primary is for write and backup for read.
Continuous Deployment
Continuous Deployment (CD) deploys to millions of instances globally for both the public cloud and private cloud. A modern-tool-based deployment improvement has accelerated the deployment success rate to one quarter per business, three times faster than AWS.
Figure 4: Continuous deployment empowers the standardization and flexibility at scale.
For the new CD, it enabled 100% deployment standard parity. The Workflow Definition Document (WDD) defines a standardized schema and builds 34 default workflow step execution plugins to support blue/green deployment, approval & notification, staggered deployment etc. It minimizes the deployment standard parity cost, with benefits on ease of use and scalability.
70% to 100% deployment success rate improvement.
6% reduction in system failures. The core of the CD is a workflow execution service that builds workflow allocation, idempotency, isolation, and delay tolerance to dramatically reduce service-related failures.
10% reduction in failures. The standardization of the instance deployment template and improved failure handling both reduce and tolerate partial instance failures. The flexible workflow orchestration supports tag-based arrangements to enable use cases like hardware-type-based blue-green deployment.
8% reduction in human errors. The new CD supports Subject-Matter-Experts (SME) to embed their experiences into the system. They are empowered to define their own standards, enabling any operator to operate safely without breaking the deployment.
4% enforcement of quality checks. The system utilizes the PC to unify the data from CI and checks testing and version information as mandatory. It also enforces a double deployment check process (both the OP Leader and Dev Leader), the streamlined process leads to stable expectation on deployment.
4.8 out of 5 satisfaction. On the ease-of-use side, the WDD enables a flexible UI-driven orchestration, with a 200 ms latency at p99. On the system reliability side, the powerful execution engine provides control to pause and resume reliably. The overall service now boasts a 99.95% availability compared to 95% before. The simplified experience and dedicated on-call schedule contribute to a superior support experience. Requirements and feature requests are managed using our sprint and bi-weekly release process, providing a predictable expectation.
Ecosystem & Other platforms
As the CD proved to be successful, when given a chance, more systems were rewritten to build on top of the platform from day 1. Chaos engineering leverages the PC to perform scoped operations based on instance tags. The Policy engine service leverages PC to control the pace of scanning the operational environment to ensure production is safe.
The observability platform builds a unified metrics and event store to provide an application-centric and unified view to check the application health, handling petabytes of monitoring data per day. The build platform utilizes Bazel to offer an incremental build experience, with customized support for security and integration of the ecosystem.