5 tips for cheaper Databricks workloads

May 21, 2023

by R. Tyler Croy

databricks

aws

Cloud-based data platforms can cost big bucks if you're not careful, fortunately with Databricks there are some very easy cost optimizations within reach. Cloud providers and Databricks both charge on a utilization basis. The quickest way to save money is to reduce usage, but for many organizations that is not an option. When reviewing "total cost of ownership" it is tempting to look at the cost of Databricks itself and attempt to trim that usage, but I firmly believe that a Databricks-based platform is cheaper than an AWS, GCP, or Azure-only data platform. In this post, we'll review five tips that will help you squeeze even more performance and value out of your data!

Use Photon

Databricks has a heavily optimized Spark runtime called Photon which can greatly improve the run times for heavy Spark operations. Photon is faster for most SQL, Python, or Scala workloads that run on Apache Spark. Less time to complete the job, means fewer compute hours to pay to the cloud provider!

Photon does not support every operation in Apache Spark and has an automatic fallback to the JVM-based Spark engine, which makes it safe to simply drop-in for many workloads. That said, not all Spark jobs will be faster with Photon! In our observations almost all DataFrame-based jobs will see a speed up, but RDD-based workloads or those which monopolize CPU on the driver node typically don't see an increase in performance with Photon.

When launching a cluster, Photon is a checkbox away, so it's worth trying out!

Spot only workers with Fleet Clusters

The typical Databricks cluster we will see in AWS is provisioned with the driver node using an "On Demand" instance, followed by some number of Spot instances using the "Fallback to On Demand" setting. For a lot of workloads we have observed it is cheaper and safe to only use Spot for workers.

This is specially valuable when coupled with Fleet Clusters which provide some extra magic to find suitable Spot capacity across availability zones allowing your workloads to find the cheapest possible compute available for your workloads.

In the cases where insufficient Spot capacity exists, the job will simply run with fewer resources. In our tests we have seen that it is cheaper to let a job run a little longer with fewer Spot instances, than to "fall back" to more expensive On Demand instances.

NOTE: Depending on the region of your workloads, you may see increased Spot competition which can drive up prices. We recommend us-east-2 and us-west-2 for our North American customers.

Reduce additional EBS

For many of our customers the cost of EBS is almost as high as the compute resources used by their workloads! By default Databricks will provision two EBS volumes for every worker in a cluster:

30GB volume for the OS and Databricks services
150GB volume for the Spark worker and logs

Many Databricks users will inadvertently increase their costs by adding more EBS volumes in their configurations! Many workloads can be safely run without additional EBS volumes, which are typically used to handle workloads with excessive shuffles or other poor performance characteristics.

There is an argument to be made for adding additional EBS volumes to enable Delta caching, but I am of the opinion that the hit rates for most jobs are so low that Delta caching is not a performance gain and simply represents unnecessary cost.

Use SQL Serverless

In 2021 Databricks announced "serverless compute" for SQL workloads. Rather than compute running in your AWS account, SQL Serverless runs compute in Databricks' account. There are a number of reasons that this compute can be cheaper than running SQL workloads in your own AWS account, but SQL Serverless has a killer feature: extremely fast provision times.

For our customers we encourage setting up SQL Warehouses (formerly known as SQL Endpoints) with zero minimum capacity which costs them $0 most of the time. When a user starts running queries clusters are auto-provisioned in seconds on the Databricks backend, and can be configured to rapidly shutdown after use.

We expect Databricks to support more types of workloads in SQL Serverless in the future, which will allow many customers to take advantage of these rapid provisioning times, but also the discounts and optimizations a bigger cloud user like Databricks is able to command from AWS.

Avoid certain instances

Do not use i3 instances on AWS. There was a time when the i3 instance family was a requirement to use Photon, since it's direct attached NVMe storage allowed for high local disk performance. The downside of the i3 family is that they are highly saught after for a number of disk heavy applications and as such as one of the most frequently interrupted Spot instance families, leading clusters trying to use those instance types to require On Demand capacity.

If your Spark workloads benefit from fast locally attached storage, we recommend m6d or c5d instance families instead. Consult the AWS Spot instance advisor to get a sense of which instance types to avoid in your region.

In our work we have found that relying on Databricks Fleet Clusters rather than specifying instance types directly can provide a simple and future-proof way to select the best priced instance types available!

Following these five tips can have a noticeable positive impact on your monthly data platform bills. As is typically the case, once you have made the easy changes, additional cost optimizations require a bit more measurement, analysis, and experimentation. If you'd like to take your cost optimization project to the next level, we can help! Drop me an email and we'll chat!!

Buoyant Data