The Breakdown: Databricks, Snowflake, and Open Source Positioning in the Data World

This post will explore how Databricks and Snowflake are positioning against one another with a particular focus on using open source as a strategic tool.

Sep 23, 2022

A note before we get started. I spent 4 years at Databricks from when it was approximately 80 people to 1300. I worked on the initial release of Databricks Delta (now known as Delta Lake), a topic of this post. I also helped build the ‘partnership’ between Databricks and Snowflake in 2018.

This post provides commentary on what Databricks and Snowflake are trying to do but does so using only publicly available information. I am a Databricks shareholder but will do my best to provide commentary and not endorse any particular platform.

Read more about open source on Product of Data:

👉 How you want your open source business to work

👉 An overview of open source

👉 Picking markets and creating movements

Snowflake and Databricks

Now that Game of Thrones is back (kind of), I felt obligated to share a couple of dragons attacking one another. Like Databricks and Snowflake.

Snowflake and Databricks are in the midst of an epic battle for cloud data dominance. However, this hasn’t always been the case. The dynamics between the two companies shows the evolution of open source as a hacker ethos to a tool for mega-businesses to position against one another and cover their strategic flanks.

This post will cover a brief history of the two companies’ competition and positioning and share the crazy intersection of open source in the data world.

Mid & Late 2010’s: A tenuous truce

Free Toddlers Forming a Circle Stock Photo — Databricks and Snowlfake getting along. /s

In the mid to late 2010s, Snowflake and Databricks competed lightly in the the broader market of ‘data analytics’. Each company, however, had distinct product focuses and this minimized conflict. Snowflake focused on data warehousing for the cloud and Databricks focused on unified analytics.

❄️ Snowflake’s Positioning

At the time, Snowflake positioned in the traditional data warehouse market against companies like Teradata. Their marketing and positioning was all about the data warehouse, built for the cloud, as you can see from the Wayback Machine in 2018.

🧱 Databricks’ Positioning

Databricks’ positioning centered around ‘Unified Analytics’ and unifying various parts of an overall organization together. Databricks focused unification of data science and data engineering - two orgs that traditionally weren’t working all that well together.

As the companies matured, the spaces sought to collide more and more. In 2018, the two companies announced a partnership (which I worked on at Databricks) attempting to draw a clear boundary between the two systems. This was never really setup for success as both companies knew they’d be eating into one another’s users bases eventually.

Early 2020’s: Open Competition

In early 2020, the heat turned up on the competition. Snowflake was preparing for its IPO later that year and sought to expand TAM by becoming ‘The Data Cloud’. This is much broader than their original “data warehousing” positioning and speaks to the future of computing.

This contributed to Snowflake being one of the largest IPOs in history.

Enter the Lakehouse 🌊

At this time, the Data Lakehouse term emerged as the next generation data lake. Databricks first used the term lakehouse in late 2019. In early 2020, they published blog post on the term. Databricks established this as a new category.1

This set the stage for direct competition with Snowflake and the creation of a differentiated approach to cloud computing.

The Lakehouse position seeks to commoditize the data warehouse as a piece of the overall “data lake”. You have everything you do in the data lake, where the data warehouse is a use case on top of that. This positioning is important because it tries put Snowflake (and data warehousing broadly) into a box. The introduction of the lakehouse brought truly open competition between the two companies.

Here’s where competition gets interesting. If we take a step back, there are three key conceptual layers of all data workloads.2

There is the storage layer defining how and where data is stored.
Then there is the compute layer defining how and where data is moved around, and manipulated to create business value.
Then there is the application layer defining how a company gets value from data and compute.

The two companies are now competing head to head on all of these layers. Let’s take a look at how.

Data ⚔️ Battleground #1 ⚔️: Storage Formats

There’s a key difference to the (original) approaches of both companies to storing data with each approach having it’s trade offs.3

Snowflake forced users to store their data within Snowflake. Snowflake managed both storage and compute. Users could never control the actual files on cloud storage (for example, if they wanted to try to eke out some additional performance gains) - Snowflake tried to get “good enough” performance to meet user requirements.

Databricks did not. They focused exclusively on compute. The challenge with this is that users often needed to “tune” the layout of the data in order to get good performance and manage the complexities of doing so.

Tuning the layout of data to get good performance comes up in basically all data lakes and is a huge pain point. Projects like Apache Hudi and Apache Iceberg were built at organizations like Uber and Netflix to solve this problem. Databricks solved it with the (originally proprietary) Databricks Delta which subsequently became the open source Delta Lake.

Delta was a significant step for Databricks because now they started to control the data, not just the compute. This increases the stickiness of the system. This makes it more of a direct threat to Snowflake.

Snowflake knew that if you don’t have an open data format that others can use, you’re going to limit your adoption. This poses a problem for Snowflake because if it isn’t compatible with open storage formats (or don’t provide an option) then expanding into new use cases and being the “data cloud” (for all data) is a difficult position to back up.

Snowflake Targets Delta Lake with Open Source Apache Iceberg

In response to Delta Lake (I assume), Snowflake selected an open format to choose as their standard. However, instead of endorsing Delta Lake, they chose Apache Iceberg. What’s fascinating is the competitive terminology used in their announcement blog post:

The Iceberg project is inside of a well-known, transparent software foundation and is not dependent on one vendor for its success. Rather, Iceberg has seen organic interest based on its own merits. Likewise, Iceberg avoids complexity by not coupling itself to any specific processing framework, query engine, or file format.
…
While many table formats claim to be open, we believe Iceberg is more than just “open code,” it is an open and inclusive project. Based on its rapid growth and merits, customers have asked for us to bring Iceberg to our platform. Based on how Iceberg aligns to our goals with choosing open wisely, we think it makes sense to incorporate Iceberg into our platform.

If you’re not reading between the lines, Snowflake is basically saying that because Databricks is responsible for a majority of Delta Lake development, then Delta lake isn’t truly “open”.

Now, this is a nuanced argument to make and it’s curious that Snowflake, a completely proprietary system, is basically taking a holier-than-though stance on what open source really means. This continues in other blog posts on the topic as well.

Now strategically, this does pose a risk for Databricks. If Databricks doesn’t control the “open standard” positioning, they’re weaker in the marketplace.

Databricks Responds by Open Sourcing all of Delta Lake

In response (I assume), Databricks open sourced all of Delta Lake APIs as part of the Delta Lake 2.0 release this year.

We announced that Databricks will contribute all features and enhancements it has made to Delta Lake to the Linux Foundation and open source all Delta Lake APIs as part of the Delta Lake 2.0 release.

Databricks also positions it as the de facto standard with extremely high usage and adoption.

Delta Lake is the fastest and most advanced multi-engine storage format. We’ve seen incredible success and adoption thanks to the reliability and fastest performance it provides. Today, Delta Lake is the most widely used storage layer in the world, with over 7 million monthly downloads; growing by 10x in monthly downloads in just one year.

This “flex” on Iceberg (and Snowflake) attempts to demonstrate dominance in the marketplace when it comes to openness, in particular when it comes to data storage formats. This is critical because as more and more data ends up in data lakes and data warehouses, you want to control where that data lands. By endorsing an external project, Snowflake is both (a) taking a stand where “open data” should land in addition to (b) weakening Delta Lake’s positioning by calling into question its open source dominance.

I haven’t done sufficient market research to understand the overall adoption of the various projects but what’s interesting is the tit-for-tat positioning that we see in the space. Both companies are leveraging “open source” as a tool to compete.

Data ⚔️ Battleground #2 ⚔️: Compute Engines

The competition continues in the compute engine domain. While the storage format battleground has a non-open source company (Snowflake) using an open source project (Iceberg) to attack a company based around open source (Databricks).

The compute domain has almost the exact opposite! A company based around open source projects (Databricks), built a proprietary closed source compute engine (Photon) to attack a company based around a proprietary compute engine (Snowflake).

Databricks Targets Snowflake with Serverless SQL

Databricks attempting to roast Snowflake account executives.

We are excited to announce the availability of serverless compute for Databricks SQL (DBSQL) in Public Preview on AWS today at the Data + AI Summit! DB SQL Serverless makes it easy to get started with data warehousing on the lakehouse. (source)

Now Serverless SQL for Databricks is a great choice. Databricks can onboard more analysts onto the platform. The positioning is also clear - Why buy Snowflake and Databricks when data warehousing is a commodity workload and the lakehouse gives you that capability?

I won’t speak to the merits of the system as I am not familiar with it, but from a pure marketing position it makes for a great demonstration of trying to commoditize a competitor’s foundational use case. Additionally, with the Open API positioning you’ll find on the website - Databricks can play a trump card of closed source performance when necessary but leverage openness as a foundational messaging pillar.

The Next Data ⚔️ Battleground ⚔️: ML Compute & Data Apps

This post has explored the two key battlegrounds of compute and storage but this brings us a ‘third layer’ : the application layer, where business value is derived. This is the newest battleground it’s unfolding, but it’s worth mentioning.

Snowflake has historically eschewed machine learning use cases. Spark and Databricks have historically had stronger capabilities in this domain.

Snowpark tries to flip that and give Snowflake an entry point into python and data science workloads.

Snowpark is a DataFrame abstraction over remote data in Snowflake. Looking at the core API and documentation, it’s basically the Spark DataFrame API. Now this alone is competitive with Databricks but when you look at the broader Snowflake strategy it becomes that much more powerful.

Earlier this year, Snowflake bought a small open source data application company called Streamlit for a not so small price of $800M. Streamlit is a project that allows users to write python to build data applications.

The power of Streamlit and Snowflake comes from the combination of Snowpark and Streamlit (in my opinion). Using dataframes and python, you can build entire applications that use Snowflake’s data and compute.

This should demonstrate the power of the platform and the intelligence in the strategy of Snowflake - it’s a totally new modality for Snowflake users one that allows them to go end-to-end, on a single platform. Additionally, Snowflake can claim some “openness” along the way. Snowflake says that they’ll also continue to build and support the open source project and community (source).

This might obviate the need entirely for having OLTP databases and it will change the way applications are built. It also gives ML practitioners a simple way of “building data apps on the data cloud”. This is Snowflake stepping into a new domain beyond SQL. The bet seems to be that the open source project’s interface isn’t worth nearly as much as the data and compute - so Snowflake is fine ‘giving it away’ (e.g., keeping it open source).

Databricks has a lot more gravity in this domain having had notebooks and primarily code-level interfaces (in addition to SQL) but Snowpark is certainly a step in that direction. This space is evolving quickly and python and ML compute will be a battleground for years to come.

Conclusion

I don’t want to get into who does it better or what not, both companies have strengths and weaknesses. That is not the point of this post. The point of this post is to explore how both companies are competing using open source as a weapon against one another. Sometimes it’s open source, sometimes its not.

What I think other companies should take away from this is how open and closed source software can be used to compete in unique ways in the marketplace. Be it strengthening your position or weakening a competitor’s.

This saga will only get more intense in the years to come and I’m sure provide much writing opportunity for me in the future. 📝

Thank you for reading Product of Data. This post is public and if you found it interesting, please share it with a friend or colleague!

I originally (mistakenly) attributed the term “lakehouse” to elsewhere in industry. It was coined by Databricks in 2019 and I can’t find earlier references. If you find one, reach out!

At this level of abstraction, it’s probably all workloads of all types but work with me here.

This is historical. Both companies now offer various solutions.

Bill Chambers' Substack

Discussion about this post