A Kubernetes Journey

A Kubernetes Journey

All companies need software to work, and all companies need this software to be reliable.

As such, anyone who has worked in Information Technologies only ever had but one job : make these applications reliable.

Sysadmins and datacenter specialists make sure the hardware and operating systems that run the applications remain available. Developers and Software architects work towards making the applications themselves more robust.

Cloud Native Computing is a recent trend but at its core is not really any different. There is however something fundamentally new there, in the way both infrastructures and application architectures have to cooperate to achieve the best reliability.

More specifically, in the context of Kubernetes, an application is not made reliable by running on a cluster. Kubernetes will only be too happy to stop any part of your application at any time. The job of making an application robust is left to the developer, which can be done by developing software that is stateless, easy to start, easy to scale and load balance… Which in turn makes it the perfect candidate to run on a Kubernetes cluster.

By making sure that applications are built a certain way to leverage a platform - Kubernetes - that provides a specific set of services, Cloud Native Computing shifted the paradigm and spread the responsibility of making software reliable between the actual software and the platform that runs it .

This makes applications operate at an unprecedented level of reliability, and as such companies like Michelin have been trying to leverage Cloud Native concepts in all areas of its operations.

The beginnings

Michelin started investing in Kubernetes and the software that would run on it in 2018. Back then the roadmap was very simple: we needed to address the company’s critical business cases by building robust, Cloud Native applications, and we needed the Kubernetes clusters to run them. In this article, we will focus on the latter.

The scope of applications was well defined at that time, so the decision making process did not factor growth as a major driver. However it was extremely important, so early in the journey, to avoid vendor lock in which could make adjustments to the strategy difficult further down the road. So it was decided that a small team of people would use open source software (namely Kubespray) to build a fleet of Kubernetes clusters on which to deploy strategic applications. Said team would ensure the clusters are production ready by leveraging automation in every aspect of cluster operations (day 1 and day 2) with tools that were readily available and known within the company. Such tools included Ansible, Python, Gitlab-ci etc.

All of these clusters ran in Michelin’s virtual datacenter in the cloud, hosted on Azure. However we decided not to leverage Microsoft’s managed Kubernetes platform, AKS, at this time. The logic was that it would help Michelin build skills and knowledge on Kubernetes - a very clever choice as it turned out - and that the decision of committing with a partner could be taken further down the line when the company was more mature with the technology.

The early days were a success. Michelin deployed robust, state of the art clusters in the cloud, and used it to run applications that were specifically built for the Cloud. Such applications covered business lines such as Ordering or Logistics - arguably pillars of industrial companies like Michelin - and were known to keep operating even when transverse system failures brought down other applications in the company. You can get detailed informations on this version of our platform here.

However the scope remained limited for a while. It was a purposeful decision, made to ensure that only the best designed applications would find their way into a Kubernetes cluster.

Soon, the company decided to tap the unrealised potential of Cloud Native Computing by developing more applications. And as a result, Michelin had to rethink its relationship with Kubernetes.

The shift to proprietary solutions

After this successful start, it was time to take a step back and look at the broader strategy. Let’s look, for example, at two aspects that had evolved.

First, the application scope. Michelin is an industrial company, and the software it produces must address a broad range of subjects, from shipping goods to the customer to providing the correct set of parameters to tire building machines in a workshop. By limiting ourselves to the cloud, there was in effect a number of use cases that became more difficult to cover. Michelin now needed to have the ability to build and deploy Cloud Native applications directly in the factories. If there is one place where your applications must be rock-solid, it has to be the workshop floor.

The second point is that managing Cloud Native workloads anywhere within Michelin would probably involve deploying an arbitrarily large number of clusters, including on premise where there were no prior Kubernetes footprint. At the same time, the amount of container-based workload deployed into the cloud was sure to continue growing as the developments teams embraced a “Kubernetes First” strategy.

As serviceable as the scripts and automation that were built were, there was a strong sentiment that it would no longer be sufficient to manage dozens of Kubernetes clusters (Michelin has about 70 production plants, all of which were planed to require their own cluster at some point) including some that would be deployed on-premise with different underlying hypervisor technologies.

After careful considerations, Michelin decided it needed to enlist the help of an external editor to help manage our cluster fleet at scale, regardless of the place - cloud or on-premise - where it was deployed. With that in mind, in 2021 Michelin started shopping for the best off the shelf product, and soon selected Vmware through their Tanzu product line.

Working with Tanzu

The Tanzu products promised to bring everything Michelin needed to accomplish the second part of its strategy around cloud native applications.

Tanzu Kubernetes Grid is an opinionated kubernetes distribution that would bring cluster and add-ons lifecycle management options. It natively supported various cloud based deployment scenarios as well as deployment on Vmware Vsphere, which is the hosting technology that is available on premise in the factories.

On top of that Tanzu Mission Control is a SaaS control plane that promised to bring scalability by managing the lifecycle of a fleet of clusters and its policies and access control.

Finally, the rather large Tanzu portfolio (Tanzu Application Platform, Tanzu Observability…) could probably be leveraged further down the line to bring more value out of the platform.

So Michelin set out to integrate this new set of products and move over the application to a new, Tanzu based version of our Kubernetes platform. Right from the start some aspects of this integration foreshadowed issues that would become more problematic as time went by.

First and as I mentioned above, Tanzu Kubernetes Grid was an opinionated platform. But Michelin was already far enough in its Kubernetes journey that some of the choices that were implemented within TKG were at odds with Michelin strategy until then. So it became a matter of choosing between a painful change in the platform or deliberately refusing to use parts that were shipped with the solution. More often than not, we chose the latter.

Promises of managing the platform at scale were soon thwarted by the difference in philosophy between Tanzu tools and Michelin expectations. For instance, Mission Control and TKG were only scriptable through an imperative CLI which made any effort to implement an Infra As Code approach next to impossible.

Another issue was that Michelin had to relinquish the responsibility of the lifecycle management of various components, add-ons etc. to Vmware. Of course, this is precisely why you are paying a third party in such a context. However when the strategies of customer and editor are not perfectly aligned it can lead to major inefficiencies in the management of the platform.

Finally, frustration became a major issue when skilled engineers were parked in a passive role of opening tickets and interacting with the support teams when they could and most of the time had found out what the issue was and how to fix it. This became increasingly at odds with Michelin’s strategy to foster a climate were experts in Information Technologies can thrive and feel valued for their skills.

For a couple of years, Vmware and Michelin made the best of this less than ideal situation. This period was not without its success, as the Kubernetes footprint in factories grew according to our objectives and, frustrating as it was, engineers in the Kubernetes Platform team picked valuable skills and experience working with a complicated and multi-faceted product.

As 2023 came to an end, it was time to assess our strategy again in light of our latest experiences.

A swing back to open source

I should first acknowledge that any strong shift in an organisation’s strategy begins with personal conviction. People most knowledgeable with the products were convinced that Michelin as an organisation had the resources and know-how to implement the next steps of our strategy without the assistance of a third party. So it was now a matter of determining whether it was the best course of action for the company by making this case and getting the management to rule on it.

The first element of this case is purely technical. By implementing several proof of concepts, it was soon determined that it was, indeed, possible to leverage off the shelf open source technologies - namely, Cluster-API and ArgoCD - to manage a fleet of Kubernetes clusters, the policy that rule them and the users that use them, with an elegant Infra as Code / GitOPS pattern. It was demonstrably better than anything we were able to build before that. An interesting anecdote is that by using open source software and clever craftsmanship, it was proven possible to move the running applications to such an open source platform, allowing uninterrupted operations of our applications even in the event of a complete platform rebuild. This is something that even Vmware could not provide as their own product was scheduled to introduce breaking changes that would force us into a downtime and migration into the near future.

The second element was cost effectiveness. Just because we can, doesn't mean we should especially regarding total cost of ownership. It was, however a pretty simple argument to make: making the platform work with a third party still required a fair amount of engineering on Michelin’s side. It was estimated (and it still holds true today) that redirecting a fraction of the subscription cost of Tanzu products toward paying engineers within the company would allow the Kubernetes Platform teams to deliver a functionally identical platform. Furthermore, such talents once freed of the burden of interacting and waiting on a third party, could incrementally improve the platform by delivering much needed features.

Finally, Michelin continues to promote its strategy to build the best information system possible by making sure it employs the best people. To do that the IT department as a whole is geared toward being welcoming to people with a “doer” mindset. This goes hand in hand with Michelin’s strategy with Open Source software which state that the company will support any initiative to use, contribute to, or build open source software in all areas of its operations. 

In this context, and given the arguments above, it really was a no brainer for Michelin to go ahead and re-build its entire Kubernetes Platform using open source software. This conviction was soon shared by the rest of the organisation as well.

Build back better

At the dawn of 2024, we started working on this ambitious yet thrilling new project. Most of the technical uncertainties had been ironed out during the Proof of concept phase we mentioned above. But there still remained the matter of making everything work at scale and with the robustness that Michelin required for the operation of its Kubernetes clusters. Plus, while it was proven possible to move our applications from the previous Tanzu clusters to our new “MKS” (Michelin Kubernetes Services) platform without interruption, orchestrating such an operation spanning hundreds of applications running on dozens of clusters was a challenge that needed to be tackled. We also owed to our partners at Vmware to do so in a timely manner as the suscription to their products and the support that goes alongside it would end at the end of July 2024.

The Kubernetes Platform Team used the building blocks created during the PoC, based on a combination of Cluster-API, ArgoCD and Open Source Add-ons to rebuild a complete platform from scratch. For some “one-shot” operations, complex Ansible playbook were authored to reduce the time and attention needed to perform all kinds of migration operations. At the same time, a large communication operation was started to ensure all our internal clients were made aware of the change even though the existing applications would see no impacts in their regular operations.

Finally, at the beginning of June, we set out to perform the migration itself. Although the procedure was entirely automated and proven to be without impact, we decided to perform these migrations on a per-cluster basis, carefully scheduled with two purposes in mind: first, catch any potential issues that couldn't be caught during the testing phases and make sure their impact were minimal. Second, allow contingency for the unexpected to happen without compromising the entire migration planning.

Neither of these precautions proved to be necessary, however. By July, and with weeks to spare, all of Michelin’s Kubernetes clusters were migrated to our own Open Source based distribution. This success was a first very important milestone but it was not the time to celebrate yet. By focusing on building the new solution during the first half of the year we were quite late in terms of Kubernetes releases, which was not an acceptable situation for Michelin to be in. So the second half of the year was dedicated to upgrading our platform’s version of Kubernetes (and associated add-ons) up to a point when every elements of the platform were supported by the community to a level that is satisfactory to the company.

As I’m writing these lines, the platform is continuously being upgraded (“Evergreen” is an important moto at Michelin) and continues to be expanded, both in scope and features, to serve Michelin’s ambitions for its Information System.

Where are we at now?

Michelin's open source based Kubernetes platform is an asset for the company. As we mentioned above, Kubernetes' footprint within Michelin has grown continuously since its humble beginnings. As of writing, Kubernetes runs 441 business applications, big and small, on 62 different clusters managing close to 8000 CPU cores. All of these are managed by a team of just eleven engineers, who not only continuously build the solution but also run it and support the end users. While some of the manpower goes to simply "keeping the lights on", by ensuring the best platform availability and supportability, the move to open source allows us to start looking at the future and to what features had most value for our end users.

Conclusion

Michelin’s journey with Kubernetes was made of changes and introspection, but it ultimately turned out to be an Open Source journey.

When things kicked off and it was equally important to build robust solutions and to accumulate knowledge, the use of Open Source software was the best choice.

Even when we sought the assistance of a third party to continue growing our Kubernetes platform, one could argue that the overwhelming majority of software that we ran was, in fact, Open Source. Any required tools or practices that went beyond what Vmware could provide were sourced from the community.

Finally, when it was time to leverage our experience to bring about the next generation of Michelin’s Kubernetes platform, relying essentially on Open Source software and the company’s commitment to Open Source communities sounded like a no brainer (you can learn more about Michelin’s endorsement of free and open source software here).

In conclusion, the author hopes that by describing this journey we were able to convey the importance of making instead of buying, doing instead of delegating for a company like Michelin. The groundwork for this success story was actually laid out by the people in management positions that decided to foster a climate in which people with the corresponding mindset can thrive and be heard. It resulted in the best possible outcome in this scenario, and is sure to prove a valuable experience when similar challenges are to be met in the future.