This session discusses the data lakehouse, which is the new kid on the block in the world of data architectures. In a nutshell, the data lakehouse is a combination of a data warehouse and a data lake. In other words, this architecture is developed to support a typical data warehouse workload plus a data lake workload. It holds structured, semi-structured, and unstructured data. Technically, in a data lakehouse the data is stored in files that can be accessed by any type of tool and database server. The data is not kept hostage by a specific database server. SQL engines are also able to access that data efficiently for more traditional business intelligence workloads. And data scientists can create their descriptive and prescriptive models directly on the data.
It makes a lot of sense to combine these two worlds, because they are sharing the same data and they are sharing logic. But is this really possible? Is this all too good to be true? This session discusses various aspects of data warehouses and data lakes to determine if the data lakehouse is a marketing hype or whether this is really a valuable and realistic new data architecture.
- The importance of combining the BI use case and the data science use case in one architecture
- The relationship between the data lakehouse architecture and SQL-on-Hadoop engines
- Comparisons of the data warehouse, data lake, and data lakehouse are biased
- Missing components of the data lakehouse
- Storing data in open file formats has practical advantages
- Is the data lakehouse a business pull or technology push?