Around 2015, companies in the Netherlands started migrating their on-premise data warehouses to the public cloud. When doing this, it is important to realise that it is not always logical for the deployed physical data models to remain the same as they were in on-premise systems. The new technological possibilities not only allow new approaches, but can also cause anti-patterns within existing physical modelling techniques such as Dimensional modelling (Kimball) or Data Vault. Or you just need a slightly different approach to implement these techniques. The goal of this session is to give insight into the (im)possibilities in this area, looking at how this can be practically tackled within solutions like Snowflake or Google BigQuery.
Examples of physical data modelling topics we will cover:
- The use of the semi-structured data type VARIANT in Snowflake: the enabler for a true ‘separation of concerns’ between efficient initial data storage and the schema-on-read phase or its usage in Data Vault satellites
- The re-introduction of dimensional / denormalised structures in the integration layer if one also implements a historical staging layer
- The (im)possibilities of partitioning / clustering in BigQuery / Snowflake and why doing this properly is essential for scalable performance and controlling costs
- Whether to use concatenated business keys, hash keys or (integer) surrogate keys.
Session highlights
- What does a ‘modern way of data warehousing’ look like and how does it differ from the classical approach?
- What is the functional role of each of the four layers of this modern data warehouse?
- What are the main advantages/disadvantages of the most widely used cloud analytical databases: Snowflake, Google BigQuery, Amazon Redshift and Azure Synapse
- What are appropriate physical data modelling techniques to deploy per data warehouse layer (with a focus on Snowflake and BigQuery) and why?
- Lessons learned in Snowflake and BigQuery: what works and what doesn’t when physically implementing these data models?