Course description - Adept Events

The course starts at 09.30 am and ends at 5 pm. Registration commences at 08.30 am.

MODULE 1: STRATEGY & PLANNING

This session introduces the data lake together with the need for a data strategy and looks at the reasons why companies need it. It looks at what should be in your data strategy, the operating model needed to implement, the types of data you have to manage and the scope of implementation. It also looks at the policies and processes needed to bring your data under control

The ever increasing distributed data landscape
The siloed approach to managing and governing data
IT data integration, self-service data wrangling or both? – data governance or data chaos?
Key requirements for data management
- Structured data – master, reference and transaction data
- Semi-structured data – JSON, BSON, XML
- Unstructured data – text, video
- Re-usable services to manage data
Dealing with new data sources – cloud data, sensor data, social media data, smart products (the internet of things)
Understanding scope of your data lake
- OLTP system sources
- Data Warehouses
- Big Data systems e.g. Hadoop
- MDM and RDM systems
- Data virtualisation
- Streaming data
- Enterprise Content M’gmt
Building a business case for data management
Defining an enterprise data strategy
A new inclusive approach to governing and managing data
Introducing the data lake and data refinery
Data lake configurations – what are the options?
Centralised, distributed or logical data lakes
Information Supply Chain use cases – establishing a multi-purpose data lake
The rising importance of an Information catalog
Key technology components in a data lake
Hadoop as a data staging area and why it is not enough
Implementation run-time options – the need to execute in multiple environments
Integrating a data lake into your enterprise analytical architecture

MODULE 2: INFORMATION PRODUCTION METHODOLOGIES

Having understood strategy, this session looks at why information producers need to make use of multiple methodologies in a data lake information supply chain to product trusted structured and multi-structured data for information consumers to make use of, to drive business value

Information production and information consumption
A best practice step-by-step methodology structured data governance
Why the methodology has to change for semi-structured and unstructured data
Methodologies for structured vs multistructured data

MODULE 3: DATA STANDARDISATION, THE BUSINESS GLOSSARY AND THE INFORMATION CATALOG

This session looks at the need for data standardisation of structured data and of new insights from processing unstructured data. The key to making this happen is to create common data names and definitions for your data to establish a shared business vocabulary (SBV). The SBV should be defined and stored in a business glossary and is important for information consumers to understand published data in a data lake. It also looks at the emergence of more powerful information catalog software and how business glossaries have become part of what a catalog offers

Semantic data standardisation using a shared business vocabulary within an information catalog
The role of a common vocabulary in MDM, RDM, SOA, DW and data virtualisation
Why is a common vocabulary relevant in a data lake and a Logical Data Warehouse?
How does an SBV apply to data in a Hadoop data lake?
Approaches to creating a common vocabulary
Business glossary products storing common business data names, e.g. Alteryx Connect Glossary, ASG, Collibra, Global IDs, Informatica, IBM Information Governance Catalog, Microsoft Azure Data Catalog Business Glossary, SAP Information Steward Metapedia, SAS Business Data Network, TIBCO Information Server
Planning for a business glossary
Organising data definitions in a business glossary
Key roles and responsibilities – getting the operating model right to create and manage an SBV
Formalising governance of business data names, e.g. the dispute resolution process
Business involvement in SBV creation
Beyond structured data – from business glossary to information catalog
What is an Information Catalog?
Why are information catalogs becoming critical to data mangement?
Information catalog technologies, e.g. Alation, Alteryx Connect, Amazon Glue, Apache Atlas, Collibra Catalog, IBM Information Governance Catalog & Watson Knowledge Catalog, Informatica EIC & Live Data Map, Microsoft Azure Data Catalog, Podium Data, Waterline Data, Zaloni Mica
Information catalog capabilities

MODULE 4: ORGANISING AND OPERATING THE DATA LAKE

This session looks at how to organise data to still be able to manage it in a complex data landscape. It looks at zoning, versioning, the need for collaboration between business and IT and the use of an information catalog in managing the data

Organising data in a centralised or distributed data lake
Creating zones to manage data
New requirements for managing data in centralised and distributed data lakes
Creating collaborative data lake projects
Hadoop as a staging area for enterprise data cleansing and integration
Core processes in data lake operations
The data ingestion process
Tools and techniques for data ingestion
Implementing systematic disparate data and data relationship discovery using Information catalog software
Using domains and machine learning to automate and speed up data discovery and tagging
Alation, IBM Watson Knowledge Catalog, Informatica CLAIRE, Silwood, Waterline Data Smart Data Catalog
Automated profiling and tagging and cataloguing of data
Automated data mapping
The data classification and policy definition processes
Manual and automated data classification to enable governance
Using tag based policies to govern data

MODULE 5: THE DATA REFINERY PROCESS

This session looks at the process of refining data to get produce trusted information

What is a data refinery?
Key requirements for refining data
The need for multiple execution engines to run in multiple environments
Options for refining data – ETL versus self-service data preparation
Key approaches to scalable ETL data integration using Apache Spark
Self-service data preparation tools for Spark and Hadoop, e.g. Alteryx Designer, Informatica Intelligent Data Lake, IBM Data Refinery, Paxata, Tableau (Project Maestro), Tamr, Talend, Trifacta
Automated data profiling using analytics in data preparation tools
Executing data refinery jobs in a distributed data lake using Apache Beam to run anywhere
Approaches to integrating IT ETL and self-service data preparation
Apache Atlas Open Metadata & Governance
Joined up analytical processing from ETL to analytical workflows
Publishing data and data integration jobs to the information catalog
Mapping produced data of value into your DW and business vocabulary
Data provisioning – provisioning consistent information into data warehouses, MDM systems, NoSQL DBMSs and transaction systems
Provisioning consistent refined data using data virtualisation, a logical data warehouse and on-demand information services
Governing the provisioning process using rules-based metadata
Consistent data management across cloud and on-premise systems

MODULE 6: REFINING BIG DATA & DATA FOR DATA WAREHOUSES

This session looks at how the data refining processes can be applied to managing, governing and provisioning data in a Big Data analytical ecosystem and in traditional data warehouses. How do you deal with very large data volumes and different varieties of data? How do you load and process data in Hadoop? How should low-latency data be handled? Topics that will be covered include:

A walk through of end-to-end data lake operation to create a Single Customer View
Types of big data & small data needed for single customer view and the challenge of bringing it together
Connecting to Big Data sources, e.g. web logs, clickstream, sensor data, unstructured and semi-structured content
Ingesting and analysing clickstream data
The challenge of capturing external customer data from social networks
Dealing with unstructured data quality in a Big Data environment
Using graph analysis to identify new relationships
The need to combine big data, master data and data in your data warehouse
Matching big data with customer master data at scale
Governing data in a Data Science environment

MODULE 7: INFORMATION AUDIT & PROTECTION – THE FORGOTTON SIDE OF DATA GOVERNANCE

Over recent years we have seen many major brands suffer embarrassing publicity due to data security breaches that have damaged their brand and reduced customer confidence. With data now highly distributed and so many technologies in place that offer audit and security, many organisations end up with a piecemeal approach to information audit and protection. Policies are everywhere with no single view of the policies associated with securing data across the enterprise. The number of administrators involved is often difficult to determine and regulatory compliance is now demanding that data is protected and that organisations can prove this to their auditors. So how are organisations dealing with this problem? Are the same data privacy policies enforced everywhere? How is data access security co-ordinated across portals, processes, applications and data? Is anyone auditing privileged user activity? This session defines this problem, looks at the requirements needed for Enterprise Data Audit and Protection and then looks at what technologies are available to help you integrate this into you data strategy

What is Data Audit and Security and what is involved in managing it?
Status check – Where are we in data audit, access security and protection today?
What are the requirements for enterprise data audit, access security and protection?
What needs to be considered when dealing with the data audit and security challenge?
Automatic data discovery and the information catalog – a huge help in identifying sensitive data
What about privileged users?
Using a data management platform and information catalog to govern data across multiple data stores
Securing and protecting data using tag based policies in an information catalog
What technologies are available to protect data and govern it? – Apache Knox, Cloudera Sentry, Dataguise, Hortonworks Ranger, IBM (Watson Data Platform, Knowledge Catalog, Optim & Guardium), Imperva, Informatica Secure@Source, Micro Focus, Privitar
Can these technologies help in GDPR?
How do they integrate with Data Governance programs?
How to get started in securing, auditing and protecting your data.