Modernizing Access Management on Alibaba Cloud

Evolving from Fragmented Access Keys to a Centralized RAM Role Architecture

Modernizing Access Management on Alibaba Cloud

The launch of our Alibaba Cloud Virtual Data Center (vDC) in 2024—built on Alibaba Landing Zone best practices and aligned with our corporate security standards—marked a major milestone in our cloud transformation. While Microsoft Azure remains a core cloud platform, Alibaba Cloud was introduced to address regional, regulatory, and performance requirements and to reduce dependency on a single provider. This multi-cloud approach allows us to extend consistent governance and security principles across platforms.

However, as the migration from Azure accelerates, we have begun to face increasing complexity in access management. Decentralized Access Key (AK) usage and fragmented RAM permissions distributed across multiple resource accounts have introduced operational overhead and created security risks. This design was also shaped by earlier limitations in Alibaba Cloud’s resource management capabilities, which required us to rely on direct Access Keys to grant and manage access.

Despite the strong foundation of our vDC architecture, managing access through Alibaba Cloud’s Resource Access Management (RAM) service remains challenging. Access controls are still scattered across accounts, complicating both individual user access and programmatic application authentication—issues that have become more pronounced as the Azure migration progresses.

Fortunately, the ongoing maturation of Alibaba Cloud RAM Roles, now supporting a wide range of native cloud services, has provided an ideal opportunity to fundamentally redesign our access model. This blog post outlines our end-to-end strategy to transition from a fragmented, AK-centric approach to a secure, scalable, and role-based centralized governance framework.

The Challenge: Diagnosing the Core Problems

Our initial access management approach was shaped by early technical constraints. As we scaled, this model revealed four critical pain points that threatened our security and agility.

  1. Early RAM Role Limitations and Resulting AK Sprawl
    When our vDC was first implemented, RAM Role support for programmatic access was limited. This forced a heavy reliance on creating permanent Access Keys (AKs) individually within each resource account. The result was AK sprawl—dozens of hard-to-track credentials scattered across our cloud environment, each with its own permission set.
  2. Scaling Bottlenecks with Manual Configuration Processes
    Initially, a central team of cloud security experts could manually handle AK creation and permission assignments. However, with the accelerating volume of applications migrating from Azure, this manual process became a severe bottleneck. What was once manageable for a handful of applications became unsustainable for dozens, slowing cloud onboarding and extending delivery cycles.
  3. The Imperative for a Secure Operational Handover
    Driven by local team re-org, the responsibility for day-to-day security operations required a formal transfer to the Cloud Management Services (CMS) team. The existing model didn't provide the CMS team with the necessary permissions or a streamlined process to assume this responsibility effectively.
  4. Granular Policy Challenges for Platform Engineering
    The Platform Engineering (PE) team requires deep, service-level access to deploy and maintain middleware services. Drafting, testing, and validating the necessary fine-grained policies—often comprising hundreds of individual actions—was immensely time-consuming and directly impacted service rollout velocity.

Visualizing the Problem: Our Current Fragmented Architecture

The diagram below illustrates why our current architecture creates so much operational overhead and security risk:

Key Problems Illustrated:

  • Fragmented Keys: AKs created separately in each resource account
  • Redundant Groups: Each AK assigned to a dedicated RAM User Group (often with just one member)
  • Static Permissions: Policies attached directly to groups, creating permanent access rights
  • Management Overhead: Proliferation of AKs and groups across accounts makes governance, auditing, and rotation difficult

Implementing the Design with Precision

Structured RAM Role Design
We standardized on three primary role types to cater to different access patterns:

  • Application Access Role (role-$app-prd): Designed for general applications needing access to their own resources. In the initial phase, this starts with a standardized OSS access policy, with more added as needed.
  • Platform Engineering Access Role (role-da9-prd): Designed for PE team tools and middleware services. Access permissions are highly customized and mutually agreed upon by the PE and cloud teams through collaborative workshops.
  • Cloud Operations Access Role (role-8ca-sys): Primarily used by the CMS team, granting broad read-only access across most cloud services to support general querying and monitoring tasks.

Granular Access Policy Classification
Policies, which define the actual permissions, are attached to the above roles:

  • Application Standard Policy: A unified, baseline access policy. Any deviation requires the creation of a custom policy template, which must undergo a formal review and approval process by both the Security Team and Cloud Architect.
  • Middle-layer Service Policy: Complex, fine-grained policies for services like Kafka and Elasticsearch. Their design is handled collaboratively in workshop-style meetings to ensure they are both functional and secure.
  • CMS Dedicated Operational Policy: A set of universal read-only policies to empower the CMS team for operational tasks without granting write permissions.

Key Roles & Responsibilities in the New Model
A successful governance model requires clear ownership. The table below outlines the key responsibilities in our new framework:

RoleResponsibility in New Model
CMS TeamCreates and manages all AKs in the centralized cloudops account; assumes the Cloud Operations Role for operational support.
Application TeamUses assigned AKs to assume their designated Application Role; submits requests for additional permissions via formal tickets.
PE TeamDevelops tools to support CMS; designs and maintains policy templates for middleware; manages their Platform Engineering Roles.
Security TeamReviews and approves all policy templates to ensure alignment with the principle of least privilege and corporate security standards.
Cloud ArchitectCollaborates with relevant teams to develop and standardize access right templates during product industrialization.

Integrating Foundational Security Principles
Our design is intrinsically built upon our internal security policy (MGSR) and Alibaba Cloud best practices:

  • Principle of Least Privilege: Every policy is scoped to grant only the minimum permissions required for a legitimate task.
  • Identity Separation: A strict distinction is maintained between human users (using SSO) and system identities (using RAM Roles).
  • Temporary Credentials: We mandate STS tokens over permanent AKs for all programmatic access, ensuring credentials are short-lived.
  • Regular Rotation & Lifecycle Management: Automated processes for credential rotation and full identity lifecycle management are enforced.
  • Secure Credential Storage with KMS: AKs are securely stored in Alibaba Cloud Key Management Service (KMS). Applications can retrieve credentials for use but cannot view the plaintext values, and automated, transparent key rotation is implemented.

The Blueprint: Architecting a Secure and Scalable Future

To overcome these challenges, we designed a target architecture founded on two core principles: centralized identity management and role-based access control.

Core Design Philosophy: Centralized Identity, Distributed Permissions
Our model introduces a clean separation of concerns:

  • Centralized Identity & Credential Management: All system-level Access Keys (AKs) are created, stored, and managed within a single, dedicated cloudops resource account. These AKs have no embedded permissions.
  • Distributed Permission Governance: RAM Roles and their associated policies are defined within the individual resource accounts where the actual cloud resources reside.

Key Improvement: From Static AKs to Dynamic Role Assumption
We eliminated the risky paradigm of using permanent AKs with embedded permissions. The new, secure access flow uses temporary Security Token Service (STS) credentials obtained through role assumption, significantly reducing the risk of credential leakage.

Our Target Architecture

The target architecture introduces a centralized, role-based model that is secure, efficient, and aligned with best practices:

Authentication and Authorization Flow (AK → AssumeRole → Resource Access)

1. Application Authentication Using Access Keys (Control Plane Only)

Each application (Application X, Y, Z) runs in the CloudOps Master Account and is provisioned with a dedicated Access Key (AK):

  • AK_X → Application X
  • AK_Y → Application Y
  • AK_Z → Application Z

These Access Keys are:

  • Stored securely in Alibaba Cloud KMS
  • Centrally managed and rotated
  • Not granted direct permissions to access cloud resources

At this stage, the Access Key is used only to authenticate the application’s identity to Alibaba Cloud RAM.

2. AssumeRole via RAM STS (Access Control Layer)

When an application needs to access cloud resources, it does not use its Access Key directly.

Instead, the application:

  1. Calls the RAM AssumeRole API using its Access Key
  2. Specifies the target RAM role in a resource account
  3. Is evaluated against the trust policy of that RAM role

This step forms the Access Control Layer, which:

  • Separates identity authentication (AK) from authorization (role policies)
  • Enforces least-privilege and cross-account boundaries

If the trust policy allows the request, RAM STS issues:

  • Temporary Access Key ID
  • Temporary Access Key Secret
  • Security Token
  • Limited validity period (short-lived credentials)

3. Cross-Account Role Assumption

Each resource account exposes application-specific RAM roles:

Resource AccountRAM RolePurpose
Account Arole-appX-prdApp X production access
Account Brole-appY-prdApp Y production access
Account Crole-appZ-prdApp Z production access

These roles:

  • Trust only the CloudOps Master Account
  • Are tightly scoped to application identity
  • Contain resource-level permission policies

4. Authorization via Role Policies (Data Plane)

Once the role is assumed, the application uses the temporary credentials to access resources within the target resource account, strictly limited by the role’s policy.

Examples from the diagram:

  • Account A – role-appX-prd
    • RDS: DB_Production
    • ECS: Cluster_App-Servers
    • OSS: app-data-bucket
  • Account B – role-appY-prd
    • OSS: logs-backup
    • SLS: analytics-logs
  • Account C – role-appZ-prd
    • VPC: production-vpc
    • NAT Gateway: outbound-gw

Key Benefits Illustrated:

  • Centralized Key Management: All AKs created in single cloudops account
  • Role-Based Access: Permissions granted to RAM Roles in resource accounts, not directly to AKs
  • Temporary Credentials: Applications use AK to assume RAM Role, obtaining temporary STS tokens
  • Separation of Duties: cloudops account manages identities, resource accounts manage permissions
  • Secure KMS Integration: AKs stored securely in KMS with applications having retrieval permissions only

Addressing Performance and Capacity Concerns

While our new RAM role–based architecture brings significant improvements in security and centralized governance, it also introduced practical concerns from our development team—specifically around peak QPS capacity and the longer authentication workflow required to obtain and maintain temporary STS tokens. These concerns were valid, and resolving them was critical to ensuring system stability and performance at scale.

To handle capacity challenges, especially in high-concurrency scenarios where multiple applications simultaneously assume RAM roles, we adopted several strategies:

  • Intelligent STS Token Caching and Reuse – Minimizing repetitive AssumeRole calls and reducing pressure on RAM.
  • Strategic AssumeRole API Optimization
    • Default Limits: Each RAM role supports a default limit of 100 QPS.
    • Quota Management: For high-traffic systems, we proactively request QPS limit increases through Alibaba Cloud support.
    • Bulk Token Acquisition: Services with multiple parallel processes obtain tokens in batches instead of issuing individual AssumeRole calls, significantly reducing total invocation count.

On the performance side, we further analyzed the token lifecycle to ensure that maintaining temporary credentials would not degrade throughput during peak load. Alibaba Cloud’s AssumeRole mechanism, built on STS, issues short-lived temporary credentials (AccessKeyId, AccessKeySecret, SecurityToken) that must be refreshed periodically—typically every hour.

To support large-scale workloads, Alibaba Cloud recommends a token lifecycle management pattern that we have adopted:

  • Applications reuse the same temporary token until it nears expiration.
  • Token refresh occurs proactively in the background, without blocking requests.
  • Alibaba Cloud SDKs natively manage expiration, automatically refreshing tokens on demand and seamlessly switching to the new credentials.

This combination of short-lived tokens, proactive refresh, and client-side caching ensures that AssumeRole authentication remains both secure and capable of supporting enterprise-level TPS requirements.

We collaborated closely with our development team to validate these mechanisms under realistic peak workloads. The results met our expectations: token caching and refresh logic effectively eliminated performance degradation during high-traffic periods, ensuring stable end-to-end throughput.

This architecture has now been deployed in one of our major internet applications, serving approximately 6 million users. The system has successfully gone live and is operating smoothly under sustained high concurrency, demonstrating that our redesigned authentication model can reliably meet both our capacity and performance demands in production.

The Evolution: Quantifying the Benefits

Before vs. After Implementation

AspectCurrent State (Problem)Target State (Solution)
AK ManagementDecentralized across all resource accountsCentralized in cloudops account
CredentialsPermanent AKs with embedded permissionsTemporary STS tokens via role assumption
Permission ModelStatic policies attached to groupsDynamic roles with customizable policies
Security RiskHigh (permanent credentials)Low (temporary tokens)
Operational OverheadHigh (manual per-account management)Low (automated centralized management)
Audit ComplexityComplex (scattered across accounts)Simple (single control point)

Summary: Evolving Our Access Management Framework

Our redesigned framework builds upon a strong security foundation, now enhanced with Alibaba Cloud's RAM roles, STS tokens, and KMS integration. This evolution improves scalability, operational efficiency, and compliance while maintaining rigorous governance.

Key Enhancements:

  • Security: Exclusive use of temporary credentials replaces permanent AKs, with KMS ensuring secure storage and automated rotation.
  • Efficiency: Centralized management reduces manual configuration by ~60%, while standardized templates ensure consistency.
  • Scalability: Validated in large-scale applications (~6M users), supporting high concurrency and future cloud expansion.
  • Governance: Maintains separation of duties, provides centralized audit trails, and aligns with cloud best practices.

This evolution demonstrates how mature security design, combined with cloud-native capabilities, can meet enterprise-scale demands without compromising operational excellence.

Acronym

Signification

Details

VDC

Virtual data center

The concept of Michelin to describe the landing zone concept on cloud

CMS

Cloud management service

The team who responsible for cloud resource and service management inside of Michelin

PE

Platform engineering

The platform engineering team that manage the middle layer services, also provide the automation and partial DevOps