Snowflake or Databricks? A Quick Primer
Data have never been more important as businesses turn to data-driven decisions and AI to pull ahead of their competitors. When talking about business intelligence (BI) or data analytics, Snowflake and Databricks are two names that keep coming up.
Which of them is better for data science? And do they compare in terms of capabilities? We take a closer look at both Snowflake and Databricks.
Snowflake is a fully managed relational database management system and analytics data warehouse designed for structured and semi-structured data. It is powered by a proprietary SQL query engine with an innovative architecture written for the cloud by database veterans.
The cloud-native architecture is designed with cloud-native services such as security, infrastructure management, and access controls sitting on top, while query processing and the actual databases sit below as part of a shared-nothing architecture.
On its part, Databricks is based on Apache Spark, and the organization itself was founded by the original creators of Spark. Under the hood, Databricks also uses Apache Spark’s distribute computing framework to ease the management of the underlying infrastructure.
Databricks is probably best known for its data lake innovation, which was developed in response to the limitations of data warehouses. A data lake holds data in its native, raw format without losing its ability to generate insights from it. This allows organizations to consolidate their data quickly and easily to a single location.
Snowflake was built as a serverless solution with separate storage and computing processing layers. It relies on MPP (massively parallel processing) compute clusters with each node in the cluster storing a portion of the entire data set locally, offering both simplicity and performance.
Data in Snowflake is accessed through SQL queries, and the Snowflake engine itself manages everything from file size, compression, structure, metadata, statistics, and other data objects. In a nutshell, Snowflake’s technology makes it simple and extremely fast for data analysts or data scientists to access data using SQL.
As noted earlier, Databricks is based on Spark, which is a multi-language engine built around single nodes or clusters that can be deployed in the cloud. Databricks operate with a control plane and a data plane.
The control plane includes all backend services managed by Databricks, including notebook commands and other workspace configurations. The data plane includes any data that is processed and resides in the customer’s cloud account.
Ease of use
According to a report on eWeek, Snowflake is said to be user-friendly, with an intuitive SQL interface that makes it easy to get up and running. It also has plenty of automation features to facilitate ease of use and is well integrated with top data analytics tools such as Cognos, Tableau, and Qlik.
With native support for notebooks, Databricks is arguably a better fit for data science collaboration and productivity. It also offers production-ready data tooling from data engineering to BI, AI, and ML. Moreover, Databricks loads external data sources from CSV to zip files, as well as connects to database systems such as Cassandra, MongoDB, and even Snowflake.
Both Databricks and Snowflake are designed as SaaS offerings, which translates to strong scalability. Extensive automation and a cloud-native design make scaling up and down easier with Snowflake.
And because processing and storage layers scale independently in Snowflake, real-time scaling is possible without disrupting existing queries. And because workloads are isolated on dedicated resources, Snowflake claims “near-infinite” scalability.
Databricks offers auto-scaling capabilities that react based on the workload, too, and will remove idle workers when the platform is completely idle for some time. However, making changes to Databricks requires substantially more effort.
To be clear, both Snowflake and Databricks are purpose-built to excel in different tasks, so a genuine apples-to-apples comparison is hardly possible. Moreover, the field is evolving rapidly, with new capabilities constantly added either through R&D efforts, or through the acquisition of smaller firms.
For instance, Snowflake in March acquired the popular machine learning and data science app framework Streamlit for USD800 million. The acquisition is expected to ease the path for data scientists to build data applications, making it easier for them to build and share data apps quickly without the need to become an expert in front-end development.
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: iStockphoto/Doucefleur