Why do we study multi-cloud now?
“Multi-cloud? We already have it for years!”
That might be your first reaction when reading the title. But if I say the definition of multi-cloud here is NOT whether you use cloud services from multiple providers, but is as below, will you change your statement?
"Multi-cloud is a cloud adoption model that deliberately uses public IaaS or PaaS from multiple cloud providers for the SAME class of IT solutions or workloads, for example use VMs and database from both AWS and Azure"
If your company is like Michelin, who started cloud journey early without heavy heritage from past and acquisition, and has strict security and operation governance, multi-cloud is NOT a natural choice. You could be very cautious of the complexity, cost, new skill sets, etc introduced by multiple clouds and decide to focus on one cloud, at least starts from one cloud.
At Michelin China, we deployed our 1st cloud application on Azure late 2014. In the last 8 years, we have had several architecture evolutions of our cloud virtual Data Center (vDC) and moved many resources across different regions for internal or external changes. However, we always stayed with Azure China.
But things changed recently
First, the Russia-Ukraine war broken out. People saw most western companies stop services in Russia or even withdraw from Russia. A reasonable concern was raised, could this happen to Azure China services from a US company?
Second, after 2+ years pandemic control, especially when Covid lockdowns happened in Shanghai and ShenYang cities, Working-From-Home became the new normal. Cloud services, even collaboration tools like Office 365, became mission critical while partial malfunctions and slow response happened from time to time. We started to ask ourselves: are we able to survive if Azure has a large and long outage?
Third, we saw signs of capacity shortage in Azure East2, our main region. We may have to move our workload to North3 next year or the year after. This will impact most applications on the cloud. Business stakeholders are asking how can we avoid or minimize such efforts that do not have a direct business value?
All these remind us that public cloud is not an unlimited, unbroken ‘Holy land’. It is time to seriously evaluate these risks and define our own plan for long term. Since July, we've been working on a formal study of multi-cloud for Michelin China.
What did we learn?
We started from two high impact risks,
- Risk 1, Azure services in China are stopped for political reasons .
- Risk 2, Azure services in China have a long regional outage.
We collected facts and comments from different channels and compared to our Service Level Agreement of key cloud applications. This includes
- Gartner researches
- News and reports on public channels
- Survey and interview to other MNCs (Multinational Companies) in China
Our conclusions are:
- Both risks exist and have high impacts, but the possibility is low and there should be certain transition time. There is no need to build a redundant (mutual backup-ed) multi-cloud environment right now.
- It is necessary to identify and try another cloud to avoid vendor lock-in and enable access to innovative services.
- Proactively look for low-cost-of-failure opportunities to build internal knowledge of another cloud. Better within 1 year.
There are some interesting and important findings from the study to guide us and to create our multi-cloud roadmap.
Gartner advices: multi-cloud is a decision that requires strong justifications
Customers should carefully identify the costs and benefits for use cases along the following hierarchy: resiliency, lock-in, portability, opportunistic use-case placement and access to innovation. Assess whether business goals and requirements can be addressed with a single cloud provider before assuming multi-cloud is required.
Multi-cloud requires careful planning, new skill sets, operational maturity, and feature/service mapping and development.
Due to complex integration and operations challenges, even organizations currently using multiple public cloud providers, one provider tends to hold a majority of the overall workloads with the remaining workloads in a secondary provider or split among multiple complementary providers.
Other MNCs in China: adopt a composite model to cherry-picking the best cloud services
We surveyed to 31 MNCs in Shanghai. They are from various industries as below.
15, or half of the companies have been using multi-cloud more or less. The covered services are as below. A little positive surprise is that it includes security and big data platform related, which we thought are too cloud bound to use in multiple clouds.
They mentioned their architectures are more Composite model instead of Redundant model. That is basically aligned with their top reasons to adopt multi-cloud:
• vendor lock-in
• diversified offers
About the future, these companies have a diversified schedule and obviously not depend on whether they are multi-cloud or not today. 13 companies plan to deploy/expand multi-cloud in next 18 months. 5 of them have had multi-cloud. 8 do not have. 12 companies answered no plan to work on multi-cloud in foreseeable future. 5 of them do not have multi-cloud, while 7 have had.
We think this proves multi-cloud is not a general trend due to external pressures but driven by specific use cases.
Consistent with Gartner findings, and it is interesting to see that even some companies say they have an open strategy, but their investment is mainly in one cloud service provider.
So, these Gartner guidelines and peer references assured us it is time to add multi-cloud into our cloud adoption roadmap, but also reminded us taking time to do it in right way, i.e assess in our context, clarify main objectives, identify valuable use cases, implement step by step.
How would we add multi-cloud into our cloud roadmap?
Taking the learnings from above mentioned study, we defined our objectives of implementing multi-cloud (for IaaS & PaaS) as:
- Enable us avoid/mitigate vendor lock-in.
- Enable us quick access to innovative services from other clouds.
What is considered as ‘vendor lock-in’ depends on why & how we use cloud services. For cloud resource administration, operation, i.e CMS (Cloud Management Service) using Azure native services, for example Azure recovery service, insights, automation, we do not count it as ‘vendor lock-in’ as such binding is inevitable and overall more efficient. Of course, once we introduce another cloud, we will see new requirements to unify the management interfaces and simplify the operations.
So, we consider ‘vendor lock-in’ an application architecture question. We analyzed the existing 40+ applications on Azure China, listed their Azure services and classified them into 2 categories: CSP (Cloud Service Provider) Native and CSP Neutral. We find that majority is CSP neutral, which means the applications can be migrated to another cloud with limited efforts. But we do have some applications, especially those built in Azure Web App service, that would need considerable efforts to be refactored or even rebuilt on new cloud.
To understand potential innovative services from other clouds, we asked our tech leaders and architects their most desired cloud services that are not available today. Their answers fall into 2 categories
- Services do exist on Azure China, but either too expensive or immature. Like Redis (in-memory cache for high performances architectures), SLB (Layer 4 load balancing).
- E2E observability and SRE related services, like chaos engineering.
To utilize those services across clouds, we will need good network inter-connections, and of course knowledge & skills of such cloud services
In summary, we want to identify another approved cloud service provider that can fulfil above two objectives, and define the roadmap to deploy it step by step.
Cloud Adoption Principles
Before digging into details, to ensure the defined roadmap will be aligned with stakeholders, we re-clarified our cloud adoption principles as below
We then re-emphasized the balance to be kept between contradictory design principles, including
- Respect Michelin cloud security baselines AND operation models
- Align with global strategy AND follow local laws & regulations
- Maximize portability WITH minimum pre-investment
- Avoid vendor lock-in WHILE not introduce too many new tech stacks
- Keep quick access to innovative services WHILE not compromise security & operability
Shortlist candidates of other clouds
We tried to identify the other clouds that should/can be included in our multi-cloud roadmap (for IaaS & PaaS). Considering the fact that Google Cloud Platform is not available in China, AWS is not Michelin primary CSP and has concern on support level in China, we do not have much global players left. On the other hand, although from above mentioned study we agreed the political exit risk of Azure is low, it is always good if multi-cloud option can cover such risk.
Considering digital sovereignty concerns, we focus on China domestic cloud providers. Based on their market share, service offerings, enterprise support, industry experiences etc, we consider Aliyun (from Alibaba) and Huawei as our candidates.
But to further compare and decide the option, we need real hands-on experience. We are looking for so called low-cost-of-failure opportunities where we can test and compare Aliyun and Huawei. We use following criteria to identify potential new applications:
- Local solution/product optimized for this cloud
- Digital business solutions targeted for China market, especially using AI/ML, media services, IoT.
- Traditional VM based applications that prioritize cloud portability
- Modern application needs ready-for-use full observability (including Real User Monitoring), SRE/Chaos engineering
- Very cost sensitive while short lifecycle applications (and not covered by low-code platform)
We hope we can get 1-2 candidate identified this year.
Prerequisites: Network Connection
Thanks to Express Route deployment project finished early this year, we now have a CSP neutral network hub from CNC (China Unicom). The current topology as below. We can use it to connect other clouds once decided.
Target landscape at 3rd year
We try to picture the target landscape in the roadmap after 3 years. We split by domains.
Azure: 2 Production sites + 1 Disaster Recovery site
Another cloud: 1 Production site
- Computing Tier
Main workload should be CSP neutral using platform like Kubernetes
Light/short lifecycle workload can be CSP native one like serverless
Should be easy to integrate SaaS from 3rd cloud
- Storage Tier
A central storage hub on Azure respecting data classification.
Store & process raw data at local cloud or on-promise edge, but consolidate selected data to storage hub.
- Cloud Management
Use CSP native management features/tools + thin CSP neutral orchestrator
Keep Single Pane of Glass by integrating CSP native tools using self-built automations.
Key milestones & Entrance criteria
As matter of fact, we are still drawing the roadmap, and yet to define concrete dates. Instead we define entrance criteria of key milestones, which will be checked during regular budgeting period to see whether we are approaching any milestone.
Here is the current list:
• Milestone 1: Optimize existing Azure vDC (virtual Data Center) architecture to prepare integration with other clouds. For example, separate Firewall for North-South traffic and East-West traffic.
Entrance criteria: vDC components, for example Firewall instance reach thresholds for new cross-cloud flows from capability or security or operation perspectives.
• Milestone 2: Build & connect landing zone in the other cloud.
Entrance criteria: Hosting application is identified and will follow standard support model.
• Milestone 3: Setup Integration tools/platforms between clouds
Entrance criteria: landing zone has been built and intensive integration changes are expected.
• Milestone 4: Integrate the new cloud into existing Cloud Management System
Entrance criteria: enough # of applications on the new cloud and expect high Service Level Agreement, for example service recovery within 4 hours.
Whatever the steps, one critical factor to support this ambition is the competency to manage new clouds. As we experienced in Azure, we will need both internal knowledge and skills, and external consulting and support resources. This should be built ahead and improved with evolvement of our multi-cloud environment.
You can see we have just started working on the multi-cloud topic, but I believe it is on the right track. I hope this article can help you plan your own multi-cloud journey. Bon voyage!