
Connecting Kernel Panics to Kubernetes Pods: Keeping Track of Lost Nodes at Netflix
With a dedicated effort to enhance user experience on the Titus container platform, Netflix delved into the issue of “orphaned” pods – those left incomplete without a clear final status. Although this may not be a concern for Service job owners, it holds significant importance for Batch users. This blog post provides insights into our…
A “Krispr” Approach to Kubernetes Infrastructure: Keeping Pods Fresh and Rolling Out Updates Smoothly
Introduction In the demanding world of modern service-oriented architectures, maintaining fresh and up-to-date infrastructure is crucial for optimal performance and security. Airbnb, with its hundreds of services relying on Kubernetes, faced challenges in efficiently updating shared infrastructure components within their platform. Their existing approach, heavily dependent on service owner upgrades, led to version fragmentation, complexity,…

Scaling Kubernetes to 2,500 Nodes for Deep Learning at OpenAI
OpenAI, a pioneer in artificial intelligence, pushes the boundaries of Kubernetes by scaling it to manage massive deep learning workloads. While managing bare VMs remains an option for the largest tasks, Kubernetes shines for its rapid iteration cycles, reasonable scalability, and reduced development overhead. This blog dives into OpenAI’s journey building a 2,500-node Kubernetes cluster…