Enhancing Flexibility and Efficiency: "Ali Lab Next" Environment
Public clouds offer virtually unlimited resources, allowing users to scale their labs up or down based on project needs. This enables teams to handle varying workloads without investing in physical infrastructure.
However, in a lab environment designed for engineers to explore
features and potential application uses, such scalability and flexibility
can lead to security vulnerabilities or unexpected high costs. Striking
the right balance between flexibility, security, and cost control is a
significant challenge.
Understanding Lab Environment
A lab environment serves as a sandbox for testing and development. It's a controlled setting where users can experiment with new technologies, validate integrations, and conduct simulations without impacting the production environment. The lab aims to replicate elements of our real-world infrastructure to enable accurate testing while offering flexibility for diverse use cases.
Typical User Journey
Before diving into the challenges, let's outline a typical user journey:
- User Onboarding: Access requests are made to the lab environment.
- Resource Allocation: Users receive access to necessary resources (e.g., VPCs, containers).
- Testing and Validation: Users perform tests, validate configurations, and simulate real-world scenarios.
- Reporting: Results are documented and shared for further analysis.
- Resource Cleanup: Resources are either retained for extended use or automatically cleaned up based on predefined policies.
Pain Points
From Users' Perspective:
- Discrepancy from Real Environment: The current lab environment on AliCloud significantly differs from the real Michelin vDC environment. For example, it lacks the networking layers present in the Michelin landing zone, making it impossible to identify necessary network changes and configuration adjustments required during real deployments.
- Missing Fundamental Services: Essential foundational services are absent and need to be re-established before each initiative, greatly slowing down testing and validation processes.
- Limited Privileges and Access Delays: Basic privileges are missing, forcing users to submit requests and obtain approvals. This process often involves several rounds of justification and long waiting periods before access is granted.
- Inability to Access Michelin Resources from Lab: Due to security concerns, users are unable to access Michelin internal resources from lab environments, limiting the ability to test integrations with internal systems.
From Lab Administrators' Perspective:
- Repetitive Setup Tasks: Administrators must repeatedly set up environments and grant access for the same purposes to different users, leading to inefficiencies.
- Time-Consuming Review Processes: Significant time is spent reviewing each request meticulously to avoid security risks or excessive costs.
- Resource Optimization: Difficulty in optimizing resources effectively, which can lead to unnecessary expenses and resource wastage.
- Maintenance Challenges: Maintaining and updating the lab environment is difficult due to manual processes and lack of automation.
User Cases
To identify solutions for these issues, we selected key user case and
analyzed their expectations step-by-step.
Scenario 1: Testing a New Service (HashiCorp Vault) on Alibaba Cloud Container Service for Kubernetes (ACK)
- Obtain Full Access to a Dedicated Namespace within an existing ACK cluster (in lab) to install and run the Vault service.
- Perform read and write operations of secrets from an existing namespace on a non-production ACK cluster (in vDC) to the Vault service to validate functionality.
- Test the management of the Vault service from Michelin's internal network under various user roles to ensure proper access control and role-based functionality.
- Evaluate the Vault service against Michelin's security policies and compliance requirements to ensure it meets all necessary standards.
Scenario 2: For Cloud Ops, practice pre-defined ACK cluster management activities
- Execute predefined high-risk activities that could potentially cause cluster-level downtime to understand their impact and develop mitigation strategies.
- Simulate various failure scenarios within the ACK cluster and practice restoring services to improve disaster recovery procedures.
- Utilize monitoring tools and analyze log data during simulations to identify issues and enhance cluster observability.
Scenario 3: For Application functional verification, data migration from Azure to Ali
We have data stored in Azure Storage account / DB. Due to
application migration, some data need be transferred from Azure to
Ali.
Ali Cloud has a tool DTS (data-transfer-service), and there are some
other 3rd party options. However, we need ensure the data is
transferred via trusted network connections.
It will be too heavy and certain confidential concerns to duplicate the
data sources in Lab, we hope the selected tool can read data directly
in a secure way.
Proposed Solution: "Lab Next" /"Hybrid Lab"
To address these challenges, we propose enhancing the current lab environment to better integrate with Michelin’s internal systems while focusing on security, operational efficiency, and cost management.
The "Lab Next" environment can help users do some functional testing which has Intranet requirement and easily re-produce some issues with private network.
Decision Tree
Scenario A
- User develops some new Ansible roles, need to test from AWX, in this case, they can use the ECS on managed zone for real testing.
- DevOps team want to compare the MySql8 / PgSql16 access from ACK, managed zone is ready for them.
Scenario B
- These is a project, we choose 'Mongo' as the database, Ali Cloud provide 'private endpoint' for 'Mongo', but currently in Ali vDC we have never used it. So once cloud architecture confirms this can be the candidate and security SME approved this service (data encryption / Intranet integration / ...), 'Mongo' can be put into restricted zone.
- Test the new OS Ubuntu 24 in Michelin's Intranet, restricted zone is the best choice for it.
Scenario C
- User wants to do a PoC of a brand new SaaS service like AI or something else provided by AliCloud: Un-managed zone is opened for them.
What is 'Internal'
Currently we define / support below scenarios for Internal use cases:
- Need Michelin LDAP
- Job run from Michelin Ansible
- Access Michelin Artifactory (ALI)
- Managed from Michelin network
What is 'Managed'
"Managed" means:
- Resources are out of the box
- Fully automated for provision
- Configuration is same as UAT/PRD
- Verified Intranet flows are accepted in advance
What is 'Restricted'
- Able to connect to Michelin's network, but the network traffic is blocked. User need get the permission to allow the traffic
- Approved resources are open to user to create, need follow the security requirements
What is 'Un-managed'
- Just like traditional Lab environment, user can create any types of resources, no restriction
- Network is totally isolated from Ali vDC / managed zone / restricted zone.
Catalog of “Managed”
Pre-defined several GA (General Available) versions of service. All these resources are provisioned by API / Terraform.
The list will growth with business.
- ECS (Ubuntu 22.04 / Windows 2022)
- ACK (Pro v1.30)
- RDS MySql (8.0), RDS PgSql (15.0 / 16.0)
- OSS bucket
High Level Chart
Below chart describes how to align with the solution / decision tree to fulfill the requirements.
Let’s see how this design resolves or mitigates the issues
identified in the use cases above.
Terms:
- VPC: Private network in Alibaba Cloud. They are logically isolated from other virtual networks in Alibaba Cloud.
- vSwitch: A basic network device in a VPC and is used to connect cloud resources
- Route Table: Manage and control network traffic of virtual private clouds.
- ECS: A high-performance, stable, reliable, and scalable IaaS-level service provided by Alibaba Cloud
- RDS: (Relational Database Service) is a stable and reliable online database service that scales elastically.
- ACK: A Kubernetes-based service that ensures high efficiency for enterprises by running containerized applications on the cloud
- CEN: Cloud Enterprise Network (CEN) is a highly available network built on the global private network of Alibaba Cloud. CEN uses transit routers to establish inter-region connections between virtual private clouds (VPCs)
- Cloud Firewall: cloud security solution that provides firewalls as a service
- OSS: A secure, cost-effective, and high-durability cloud storage service
- AK: AccessKey pair is a permanent access credential that is provided by Alibaba Cloud to a user. An AccessKey pair consists of an AccessKey ID and an AccessKey secret
Comprehensive Security Overview
Security is the most important part for "Lab Next", we use below measures to control and monitor.
Key Security Policies
-
A stringent 'Deny' policy will be implemented to prevent unauthorized modifications to essential resources. This policy is crucial for maintaining the integrity of 'lab next' environment.
-
In "Lab Next", public IP as well as VPC permission are prohibited. These kinds of key resources are centrally managed in one resource group to prevent any changes by users.
-
Role-Based Access Control (RBAC): This system ensures that users have permissions appropriate for their roles, thereby enhancing security while allowing necessary accesses. RBAC can do precise authorization so that the deployed resource is under control
-
AK
- STS: Security Token (temporary credentials) used to create managed resource for automation routine job, permission is granted on ECS (RAM role on ECS )
- User AK: Each user has their own AK and rotated with 6 months, user key is managed by automation AK(STS),
-
KMS
- Key Management Service (KMS) is an end-to-end service platform for key management, data encryption, and secret management.
- We will use one custom KMS key to encrypt Disk / RDS data / OSS bucket data to make sure data security
-
Resource Audit:
- Cloud Config: A resource auditing service to track configuration changes of your resources and evaluate configuration compliance
- Deny Public IP using in 'managed' zone.
- Detect resources which does not attach KMS key in ‘managed’ zone.
- Action Trail log: It can keep all the activity events on Ali Cloud and will be monitored by administrator.
- Cloud Config: A resource auditing service to track configuration changes of your resources and evaluate configuration compliance
Firewall Rules
Cloud Firewall implements centralized security isolation and traffic control for your cloud assets at the Internet, virtual private cloud (VPC), and host boundaries. Cloud Firewall is the first line of defense to protect your workloads. For the design, we only add ‘VPC firewall rule’ for Intranet purpose.
VPC firewall rule: monitors and manages traffic between VPCs and traffic between a VPC and a data center.
NAT firewall rule: monitors all outbound traffic from internal-facing resources in VPCs to the NAT gateway.
-
Pre-added firewall rules
- Allow from 'Lab Next' to access 'On-Premise' LDAP for integration
- Allow from 'On-Premise' PC to access 'Lab Next' ECS for operation purpose
- Allow from 'On-Premise' AWX to managed zone for automation testing
- Allow from 'Lab Next' to Artifactory for CI/CD
-
New rules are not permanent, they are on demand and time based, and we offer 2 options to end user
- One-Off is the default setting which will be cleaned every 10 PM (CST)
- Time based (maximum duration 2 weeks) is the other option to let user choose
Resource Management Strategies
Provisioning of Resources
- Day 1 Resources: Essential infrastructure resources, referred to as Day 1 resources, will be pre-created and will not be deleted. These foundational components are critical for the lab's operation. (VPC / Route Table / Load Balancer)
- Day 2 Resources (Zones): resources for verify / testing purpose
- Managed Zone: This area will consist of resources that are automatically provisioned and managed by Terraform from Gitlab runner. Users can access these resources without needing to perform extensive setups.
- Restricted Zone: This zone will have limited access. While some parameters can be adjusted by users, the overall access is controlled to ensure security. Resources are created via Ali console.
- Unmanaged Zone: In this area, users have more freedom to manage resources as needed without stringent oversight.
Configuration and Cleanup Procedures
- User Restrictions on managed resources: Users will not be able to modify 'Managed' resources directly. This restriction helps prevent accidental changes that could lead to security vulnerabilities.
- Daily Cleanup Protocol: Resources that are not essential will undergo daily cleanups at 10 PM CST. This strategy helps in managing costs effectively and ensures that only necessary resources are maintained.
Exclude for cleanup: users can ask for keep the resource for a short period via Power App like Firewall rules.
Consuming resources
For different usage of resources, we created a Power App especially for the managed resource besides Ali console.
Power App is the "Ticket Agency" for 'managed' resources, user login to "Power App" and request for resources, all the necessary information will reply them with a mail after the automation flow is completed. (Manager approval or Security approval.,etc.)
'Restricted' zone resources are free to use when got approved.
Resources in 'Un-managed' zone are self-services, just have fun to use.
Cost Management Strategies
- Resources Approach:
- Managed resources will be shared among users, promoting efficiency and reducing costs
- Cost dashboard will also track the overall data
- Daily Resource Management: Non-essential resources will be reset and cleaned daily, helping to maintain a lean operational model.
Monitoring and Backup Considerations
- No Backup Services: Currently, no backup services are provided for lab resources, which means users must be cautious about data management.
- Monitoring Rules: There will be no specific monitoring rules configured for lab resources, necessitating manual oversight by users and administrators.
Conclusion
The above design, setup, and procedures have resolved—or at least
mitigated—the pain points identified for both users and
administrators. We will apply the PDCA (Plan, Do, Check, Act)
approach to continuously enhance the usability of our lab
environments and maintain a balance between flexibility, security, and
cost. We hope our experience can serve as a useful reference for
other cloud management teams.