
Build a Data Lakehouse Reporting Structure with dbt and Starburst Galaxy

Join Starburst on May 28th for Launch Point, our new product summit showcasing the future of Starburst.
Evan Smith
Technical Content Manager
Starburst Data
Evan Smith
Technical Content Manager
Starburst Data
This article will help you compare the benefits of a data warehouse vs a data lake vs a data lakehouse. By the end of this article, you will understand the benefits of each and better understand which is best for your organization’s needs. Each architecture has its advantages, but adopting a data lakehouse architecture over a traditional data warehouse comes with many business benefits due to the more modernized structure offering the same performance at a lower cost.
To begin with, let’s explore the business case for adopting a data lakehouse and see how it can improve versatility, maintain performance, and reduce costs. We’ll compare the differences between a data warehouse and a data lake and then discuss the hybrid advantage of a data lakehouse.
Data lakehouses apply many of the features and benefits typically associated with data warehouses to data lakes. For businesses adopting this new architecture, it offers a powerful, value-added benefit, allowing for a best-of-both-worlds scenario.
But what do those features achieve in a business sense? What are the traditional benefits of a data warehouse, and how does a data lakehouse bring these benefits to the lake?
To answer these questions, let’s examine what made data warehouses beneficial to the businesses that used them.
Data warehouses are highly efficient and perform very well compared to other technologies. This results from the structured nature of the data inside them. Because all data entering the warehouse must conform to a predefined schema when it is written, the system does not have to account for divergent schemas, unstructured data, or other complexities. This limits the scope of the data warehouse and often implies expensive, time-consuming ETL. However, once setup is complete, the data warehouse performs well within this designated scope.
Data warehouses have traditionally been reliable. This also stems from the structured nature of the data inside them. Because all ambiguities and complexities are ironed out before data enters the warehouse, the resulting system is often very stable. Since all data is structured according to the same schema or schemas, both the system and the user know what to expect when new data arrives, which helps warehouses achieve high reliability.
SQL is used widely in data warehouses and offers an accessible, common language for querying. This has traditionally been a massive benefit to businesses as knowledge of SQL is more common in many organizations than alternatives. SQL has a long history in data science and analysis, and the language is versatile, adaptable, and agile compared with other options.
Explore the future of data lakes through our comprehensive technical how-to whitepaper.
A data lake is a more modern technology than a data warehouse. In fact, Data lakes offer an alternative approach to data storage that is less structured, less expensive, and more versatile. When they were first introduced, these changes revolutionized data science and kick-started big data as we know it today. In this sense, the movement towards data lakehouses is just the continuation of a longstanding shift away from traditional data warehouses towards data lakes based on cloud object storage.
Read through the list of benefits below to learn more about why organizations deploy data lakes. Data lake houses inherit these benefits and build additional functionality and value on top of them.
In the past, compute and storage resources were combined on the same machines. This was due to the prevalence of on-premises warehouse systems, and the practice was continued with early data lakes based on the Hadoop Distributed File System (HDFS).
In contrast, modern data lakes based on cloud object storage allow for the separation of compute and storage, ensuring that each resource can be scaled as needed. This is often one of the main ways that data lakes reduce cost using the cloud.
Unlike data warehouses, data lakes store data in many structures. This includes structured, semi-structured, and unstructured data. Additionally, data entering the lake does not necessarily need to be schematized in advance. Instead, it can be left in a raw format until needed, using a process known as schema on read. This advantage allows data lakes to house a wide variety of data structures more easily and more cost-effectively than the data warehouses of the past.
Many data lakes make use of cloud object storage, including AWS S3, Azure ADLS, or GCP Google Cloud Storage. Compared to traditional data warehouses, cloud object storage is very inexpensive, owing to the massive economies of scale involved in cloud operation and due to the nature of object storage itself. For this reason, data lakes are often by far the most economical options for businesses, especially when compared to costly data warehouses.
Data lakes are highly scalable, especially when using cloud object storage. This is true of both storage and the compute resources needed to query them.
For example, a business’s data needs change over time. As storage requirements increase, more cloud object storage can be added. At the same time, if querying increases and more compute resources are needed, these can be scaled independently to meet demand.
This agility—being able to tailor storage for storage needs and compute for compute needs—contrasts with previous systems that required significant data architecture and planning to scale effectively. Data lakes’ comparative agility is one of their primary advantages.
Today, data lakes often use cloud object storage, which is the most versatile and least expensive option for most businesses. However, data lakes may also use Hadoop HDFS, which can be an advantage for some legacy systems, especially those on-premises. One of their key advantages is the ability to create data lakes using either technology.
The ability of data lakes to record large amounts of raw data in a semi-structured or unstructured form makes them especially useful for machine learning. The data in the lake can be used to feed data science models or queried using Python, Scala, or R. The recent abundance of unstructured data, coupled with the desire to create insights from it, has made data lakes especially valued for these purposes.
This ability to harness unstructured data also makes data lakes an ideal technology for Artificial Intelligence (AI) modeling. In fact, AI and large language models (LLMs) are growing rapidly as an evolving use case of data lakes.
By combining the best of data lakes and the best of data warehouses, data lakehouses offer many benefits. Their emergence also represents the next stage in the evolution of the data lake, adding additional features and functionality to address a variety of business needs better.
The first thing to understand is that a data lake and a data lakehouse are not entirely different technologies. In fact, the underlying storage technology used in a data lakehouse is very similar to a data lake in many key ways. Both are built on the same cloud object storage, and both allow for inexpensive data storage in multiple structures.
However, the data lakehouse collects and stores more metadata using a modern table format like Iceberg. Because of this, it performs better than the traditional data lake in certain key areas. This key architectural difference allows organizations to gain additional functionality compared to a traditional lake while sacrificing nothing.
Data lakehouses make the data lake more accessible to different people in the organization who would not otherwise have benefited, and might have been forced to use a costly data warehouse instead. Data is no longer gated and is only suitable for data engineers. At the same time, this shift comes without sacrificing the original use cases that made data lakes popular in the first place. Overall, organizations that adopt a lakehouse architecture enjoy:
Data lakehouses create an improved governance layer between raw data and consumable data. This allows for a management style that is more in line with a data warehouse or database but achievable using data lake technology, with all of its inherent cost benefits.
Data lakehouses introduce ACID compliance and greater support for transactional data than traditional data lakes. This is particularly important in some industries and some use cases. It means that organizations that may have had to employ separate systems before can now consider using a single system.
Modern table formats like Iceberg allow for schema evolution and enhanced functionality when updating or deleting data from a table. Data lakes based around cloud object storage typically include immutable storage, causing problems when the data needs to be updated or deleted. Schema evolution, partition evolution, and time travel are some of the key features that draw people to data lakehouses, particularly those constructed using the Iceberg table format and other modern table formats like Delta Lake and Apache Hudi.
Lakehouses use modern, open table formats, including Iceberg, which involve fewer required operations compared to Hive. In the past, Hive was innovative, but Iceberg, Delta Lake, and Hudi are far superior and represent one of the main architectural reasons for the increased features and performance found in lakehouses.
Lakehouses also reduce reliance on the Hive Metastore (HMS) compared to traditional data lakes. This is particularly true when using Starburst Galaxy, which includes its metastore optimized for lakehouse table formats. AWS Glue is also supported, offering further customization.
The move to a data lakehouse comes with many organizational advantages. When businesses move to a modern lakehouse format, they are able to achieve the following benefits.
Data lakehouses offer much better support for transactional systems compared to traditional data lakes. This is achieved by the unique way that data lakehouses handle metadata.
This allows organizations that adopt a data lakehouse to take a more active approach to building business insights based on real time, updated data, which improves data reliability and the value of the insights derived.
In the past, the limitations of a data lake meant that organizations needed to run a costly data warehouse alongside it. Now, with its increased functionality, the data lakehouse either reduces or eliminates the need for a warehouse.
Lakehouses help to reduce costs by transitioning data from costly data warehouses to more efficient cloud object storage. Cloud object storage is by far the least expensive storage medium and helps drive efficiencies by separating compute and storage costs.
Lakehouses based around modern table formats like Iceberg are more performant than traditional data lakes. In fact, their performance is more comparable to that of data warehouses. Because of this, adopting a data lakehouse saves time and effort, reducing costs.
Data lakehouses are built on top of cloud object storage. These services are available from multiple cloud vendors, and the data stored inside them uses a standard, open table format, typically Iceberg. This means that it is easy to copy files from one vendor to another if needed. In this way, organizations can avoid licensing software from a single company, reducing vendor lock-in.
Adopting a data lakehouse is not an all-or-nothing proposition. Often, businesses do not fully replace their current solutions; instead, they augment them with the addition of a lakehouse at first. This allows them to test and optimize the solution to fit their needs.
Read through the table below to learn more.
Data warehouses vs. data lakehouses |
|
High performance | Lakehouses are highly performant, approaching the performance found in data warehouses. |
Reliability | Data lakehouses include an improved transactional layer, increasing reliability. This makes their performance similar to a database or data warehouse. |
Full CRUD | Lakehouses support full CRUD, including:
|
Data lakes vs data lakehouses |
|
Low cost | Like data lakes, lakehouses make use of low-cost cloud object storage.
Organizations can use a variety of solutions, including AWS, Azure and GCP. |
Multiple types of data | In common with data lakes, lakehouses can ingest:
|
Adaptability | Lakehouses are just as adaptable as data lakes and include:
|
Learn about data warehouses, data lakes, and data lakehouses.