Back To Schedule
Thursday, November 12 • 9:30am - 10:30am
Scaling Databricks to Run Data and AI Workloads on Millions of VMs

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, I’ll present our experience building a very large-scale cloud service at Databricks, which provides a data and ML platform service used by many of the largest enterprises in the world. Databricks manages millions of cloud VMs that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, and Prometheus, and various design patterns and engineering processes that we learned along the way. In addition, I’ll describe how we have adapted data analytics systems themselves to improve reliability and manageability in the cloud, such as creating an ACID storage system that is as reliable as the underlying cloud object store (Delta Lake) and adding autoscaling and auto-shutdown features for Apache Spark.

avatar for Matei Zaharia

Matei Zaharia

Chief Technologist, Databricks
Matei Zaharia is an Assistant Professor of Computer Science at Stanford and Co-founder and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley, and has worked on other widely used open source data analytics and AI software including... Read More →

Thursday November 12, 2020 9:30am - 10:30am PST