The pointblank library is all about assessing the state of data quality in a table. You provide the validation rules and the library will dutifully interrogate the data and provide useful reporting. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or a selection of DB tables. Let’s walk through what table validation looks like in pointblank!
A Simple Validation Table
This is a validation table that is produced from a validation of a Polars DataFrame:
2024-12-20 15:09:48 UTC< 1 s2024-12-20 15:09:48 UTC
Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side outlines the validation rules and the right-hand side provides the results of each validation step. While simple in principle, there’s a lot of useful information packed into this validation table.
Here’s a diagram that describes a few of the important parts of the validation table:
There are three important sections of this table,
validation steps: each step is a distinct test on the table focused on a certain part of the table (here, the different columns)
validation rules: the validation type is provided here along with key constraints
validation results: post-interrogation results are provided here, with a breakdown of test units (total, passing, failing), threshold states, etc.
The intent is to provide the key information in one place, and have it be interpretable by a broad audience.
Example Code, Step-by-Step
Here’s the code that performs the validation on the Polars table.
2024-12-20 15:09:48 UTC< 1 s2024-12-20 15:09:48 UTC
Note these three key pieces in the code:
the Validate(data=...) argument takes a DataFrame that you want to validate
the methods starting with col_* specify validation steps that run on specific columns
the interrogate() method executes the validation plan on the table
That’s data validation with pointblank in a nutshell! In the next section we’ll go a bit further by introducing a means to gauge data quality with failure thresholds.
Understanding Test Units
Each validation step will execute a validation test. For example, col_vals_lt() tests that each value in a column is less than a specified number. One important piece that’s reported is the number of test units that pass or fail.
Test units are dependent on the test being run. The col_vals_* tests each value in a column, so each value will be a test unit.
This matters because you can set thresholds that signal WARN, STOP, and NOTIFY states based the proportion or number of failing test units.
Here’s a simple example that uses a single col_vals_lt() step along with thresholds.
2024-12-20 15:09:48 UTC< 1 s2024-12-20 15:09:48 UTC
The code uses thresholds=(2, 4) to set a WARN threshold of 2 and a STOP threshold of 4. Notice these pieces in the validation table:
The FAIL column shows that 2 tests units have failed
The W column (short for WARN) shows a filled yellow circle indicating it’s reached threshold
The S column (short for STOP) shows an open red circle indicating it’s below threshold
The one final threshold, N (NOTIFY), wasn’t set so appears on the validation table as a dash.
Thresholds let you take action at different levels of severity. The next section discusses setting and acting on thresholds in detail.
Using Threshold Levels
Thresholds enable you to signal failure at different severity levels. In the near future, thresholds will be able to trigger custom actions. For example, when testing a column for NULLs with col_vals_not_null() you might want to warn on any NULLs and stop where there are 20% NULLs in the column.
2024-12-20 15:09:48 UTC< 1 s2024-12-20 15:09:48 UTC
In this case, the thresholds= argument in the cols_vals_not_null() step was used, but we can also set thresholds globally by using Validate(thresholds=...).