KONFERENZPROGRAMM

Data Quality & Reliable Analytics: a use case for dbt Tests

Data quality is essential for reliable analytics and modelling. While transforming data, robust data pipelines are needed to guarantee that data is trustworthy and usable.

In this talk, I will go through different aspects of testing: 

- Avoid faulty data that causes unreliable dashboards and wrong predictions

- Spot unexpected patterns and trends

- Integrate data transformation tests into CI/CD pipelines

I will explain how we use dbt Core tests in our current project to approach these challenges, their capabilities and main limitations.

 

Target Audience: Data Engineers, Data Scientist and Analysts
Prerequisites: No particular technical prerequisites are required. The audience should be familiar with basic concept of software development (testing, CI/CD)
Level: Basic

Extended Abstract:
Data quality is essential for reliable analytics and modelling. While transforming data, robust data pipelines are needed to guarantee that data is trustworthy and usable. The whole data process from ingestion to analysis and outcomes involves the complete team: from engineers to data analysts to even business managers.

In this talk, I will go through different aspects of testing data:

  1. Avoid faulty data that causes unreliable dashboards and wrong predictions. How to enforce data quality, catching issues early in the pipeline and preventing bad data from moving downstream.
  2. Spot unexpected patterns and trends. Tests your KPIs and metrics to spot anomalies and unexpected patterns in your results.
  3. Integrate data transformation tests into CI/CD pipelines. As other application code, data transformation can also be tested in CI/CD pipelines to ensure robustness and avoid wrong operations in production.

I will focus on the current project I am working on, where we use a data model based on dbt Core (open source version of dbt) and BigQuery. Dbt (data build tool) is a data transformation tool based on modular data pipelines. It defines data transformations as modules that can be executed to create tables and that are built upon other modules or sources. It can be highly accessible to data analysts because it uses standard SQL while being close to the development mindset by treating analytics like software.

One of dbt most important functionality is the ability to test data in various scenarios. But what capabilities do dbt tests actually offer?

These are some of the ones that we use in our data model:

  1. Singular and (custom) generic data tests to check for potential issues in the tables (Ex: non-unique values, missing values, etc.)
  2. Source freshness to test data recency
  3. Elementary tests for metrics and KPIs
  4. Unit tests to test complex, incremental data transformations
  5. Integration tests incorporated in our CI/CD Pipelines

Among the challenges that we face, also due to dbt test limitations there are:

  1. Need to find a good compromise between cost and quality: dbt tests can be very costly as they usually do a full scan on tables.
  2. While generic tests are usually very simple and have limited possibilities, singular tests require lots of custom SQL queries.
  3. Unit tests in incremental models only test what will be merged (and not the results of the incremental model). Creating fixtures that are close to your real data is hard, can be very handy and cause complexity in your configuration
codecentric AG
Data Scientist

I am Dr. Francesca Diana, working for 8 years at codecentric as Data Scientist/Data Engineer. I have collected experience in a variety of fields: from fraud and anomaly detection for marketplaces to churn prediction for retail and document classification models. I enjoy going through the whole Data Science Pipeline: design the concept, analyse the Data, apply machine learning methodologies and implement the model in production.

Francesca Diana
18:00 - 18:45
Vortrag: Mi 1.6

Vortrag Teilen