Enhancing Secret Management for Applications on Ali CaaS

Enhancing Secret Management for Applications on Ali CaaS

Context

In applications, sensitive data like database passwords, API tokens, and other secrets need to be securely managed. Historically, secrets were embedded in configuration files as plain text, exposing them to anyone with repository access, which poses a significant security risk.

Last year, we introduced SOPS, a tool for encrypting configuration files with AES256 keys, significantly reducing this exposure. However, secrets are still occasionally transferred in insecure ways, posing risks when shared between stakeholders.

Since July, after deploying our landing zone on AliCloud, we began designing a Container-as-a-Service (CaaS) solution on AliCloud. Part of this effort has involved improving secret management to enhance application security in this environment.

Problem Analysis

We consulted with DevOps teams and developers to understand their practices when using SOPS. Based on these discussions, we abstracted roles, identified key activities, and mapped workflows to focus on how secrets are handled after delivery to the Secret Custodian until deployment on CaaS.

We identified three key risks:

  1. Secret sharing: Potential leaks during the transfer of secrets or SOPS keys between the Secret Custodian and developers. [High]
  2. Insecure storage: Storing private SOPS keys in GitLab projects poses a risk. [Middle]
  3. Plain text injection: Secrets injected into config files in plain text at runtime. [Low]

While these risks vary in severity, our goal is to address the high and medium risks at a minimum.

Options & Compare

We evaluated multiple approaches to using HashiCorp Vault, which is an external solution and already employed across the group in various environments. The goal was to determine which solution would best mitigate the identified risks while balancing operational complexity and scalability.

The three approaches evaluated were:
• External Secrets Operator
• Vault Agent Injector
• Spring Cloud Vault

Each of these approaches offers distinct benefits and drawbacks. Below is a breakdown of how each works.

External Secret Operator

This method uses the External Secrets Operator to fetch secrets from Vault, which are then stored in Kubernetes Secrets. These secrets are injected as environment variables into the application pods.
For example, in the test application, secrets are mapped as environment variables and accessed at runtime.

While this addressed the high and medium risks associated with secret sharing and private key storage, plaintext configuration remains a vulnerability. Kubernetes Secrets are still widely accessible, such as through the ACK (Ali K8S) portal, and are available at the namespace level rather than being restricted to the pod level.

Vault Agent Injector

In this approach, the Vault Agent Injector runs as a sidecar within the application pod. It reads secrets from Vault and writes them directly into configuration files inside the pod, which the application accesses.

The code snap in our sample application

The workflow of Vault Agent Injector is as below

Though it resolves high-risk issues, the presence of decrypted secrets in local configuration still poses a low-level risk.

Spring Cloud Vault

For applications using the Spring Cloud framework, Spring Cloud Vault directly fetches secrets from Vault. It provides the most seamless integration, ensuring secrets are never stored in the Kubernetes environment. This approach entirely mitigates the risk of secret exposure in plain text during runtime.

To use it, first make sure you have the necessary Spring dependencies for HashiCorp Vault in your pom.xml:
6.png

And configure your application.yml to tell Spring how to connect to Vault and retrieve the secret.
7.png

The related workflow is as below

Security Level Comparision

The comparison of security levels across the three approaches is summarized below:

Aspect External Secrets Operator Vault Agent Injector Spring Cloud Vault
Secret Exposure Risk Higher risk if Kubernetes Secrets are compromised; relies on K8s security. Lower risk; secrets are not stored persistently in the cluster. Secrets are fetched at runtime and not stored in K8s Secrets.
Compliance and Auditing Secrets stored in etcd; requires etcd encryption for compliance. Enhanced compliance; secrets are transient and better controlled. High compliance; direct Vault integration allows for detailed auditing.
Access Control Managed via Kubernetes RBAC and Vault policies. Managed via Vault policies and Kubernetes service accounts. Managed via Vault policies and application-level authentication.

Operational Analysis

While the above analysis highlights the security mechanisms and risk levels of each approach, additional operational factors need to be considered. We extended our evaluation by analyzing the following aspects.

Operational Complexity

Aspect External Secrets Operator Vault Agent Injector Spring Cloud Vault
Ease of Implementation Easier to set up; uses standard K8s resources and patterns. More complex; requires configuring sidecars and annotations. Moderate complexity; requires application code changes and dependency management.
Application Changes Required Minimal; applications use existing methods to access secrets. May require application changes to read from new paths or environment variables. Requires changes to application code to integrate SCV libraries.

Performance and Scalability

Aspect External Secrets Operator Vault Agent Injector Spring Cloud Vault
Startup Time Faster; secrets are available at pod start from K8s Secrets. Slight delay; sidecar fetches secrets at startup. May introduce delay; application fetches secrets during startup.
Resource Consumption Lower; no additional containers per pod. Higher; adds a sidecar to each pod, consuming more resources. Minimal additional resource usage within the application process.
Scalability Scales well; leverages K8s control plane. Scales but increases cluster resource usage. Scales with application instances; Vault may need scaling for high load.

Secret Updates and Rotation

Aspect External Secrets Operator Vault Agent Injector Spring Cloud Vault
Secret Update Mechanism ESO syncs updates; may require pod restarts or application reloads. Can auto-refresh secrets without restarting the application. Can auto-refresh secrets; supports dynamic reconfiguration.
Dynamic Secrets Handling Less suited for dynamic secrets; better for static secrets. Ideal for dynamic secrets; supports automatic rotation and renewal. Excellent for dynamic secrets; direct integration with Vault's APIs. /span>

Guide for selection

After evaluating all three approaches, it’s clear there is no one-size-fits-all solution. Each method mitigates the high and medium risks, fulfilling our baseline security requirements. To assist application teams in selecting the right option, we propose the following decision tree.

Rollout

We have mandated the use of HashiCorp Vault for all applications migrating to or newly built on AliCloud. As part of this transition, we will follow standard change management processes, including communications, demos, and training sessions, to ensure all stakeholders are informed and compliant.
In addition to these measures, we intend to implement technical solutions to identify and prevent non-compliance. The options we've discussed are:

  1. Code Repository Scanning
    We will continue to scan code and configuration files to prevent hardcoded secrets. This step is crucial because if we discontinue using SOPS and developers find Vault too complex, they might resort to insecure practices like hardcoding secrets.
  2. Pre-Deployment Enforcement
    We plan to collaborate with application development teams to customize Open Policy Agent (OPA) policies for thie namespaces. These policies will enforce that only applications utilizing Vault (e.g., through annotations or sidecars) can be deployed.
  3. Vault Usage Monitoring
    To prevent developers from retrieving secret values from Vault and then bypassing it, we will monitor Vault usage on application-specific paths by enabling audit logs and setting up telemetry metrics. This approach not only helps in identifying violations but also provides usage data to better manage Vault's performance.
  4. Enable Dynamic Secret Capability
    By activating the relevant secret engines for example for DB, token and enforcing secret rotation, we can eliminate the need for manual secret handling. This makes hardcoding secrets impractical and enhances overall security.

Each of these approaches requires investment in development and maintenance. We are currently discussing with DevOps and technical leaders to determine which options are most efficient for our context. For now, as the number of applications is still limited, we will perform manual spot checks.

Open Topics

Although we've made significant progress, there are still open issues that need addressing, including:
• High Availability of Vault: As Vault is now critical for application deployment and operation, ensuring its high availability is vital.
• Storage Considerations: We must secure the underlying storage supporting Vault, especially given AliCloud NAS’s current limitations regarding encryption key rotation.

Moving forward, we will continue to enhance security and streamline secret management for applications on Ali CaaS.