Intro

The pointblank library is all about assessing the state of data quality in a table. You provide the validation rules and the library will dutifully interrogate the data and provide useful reporting. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or a selection of DB tables. Let’s walk through what table validation looks like in pointblank!

A Simple Validation Table

This is a validation table that is produced from a validation of a Polars DataFrame:

Code
import pointblank as pb

validation_1 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_lt(columns="a", value=10)
    .col_vals_between(columns="d", left=0, right=5000)
    .col_vals_in_set(columns="f", set=["low", "mid", "high"])
    .col_vals_regex(columns="b", pattern=r"^[0-9]-[a-z]{3}-[0-9]{3}$")
    .interrogate()
)

validation_1
Pointblank Validation
2024-12-20|15:09:48
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C 1
col_vals_lt
col_vals_lt()
a 10 13 13
1.00
0
0.00
#4CA64C66 2
col_vals_between
col_vals_between()
d [0, 5000] 13 12
0.92
1
0.08
#4CA64C 3
col_vals_in_set
col_vals_in_set()
f low, mid, high 13 13
1.00
0
0.00
#4CA64C 4
col_vals_regex
col_vals_regex()
b ^[0-9]-[a-z]{3}-[0-9]{3}$ 13 13
1.00
0
0.00
2024-12-20 15:09:48 UTC< 1 s2024-12-20 15:09:48 UTC

Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side outlines the validation rules and the right-hand side provides the results of each validation step. While simple in principle, there’s a lot of useful information packed into this validation table.

Here’s a diagram that describes a few of the important parts of the validation table:

There are three important sections of this table,

  • validation steps: each step is a distinct test on the table focused on a certain part of the table (here, the different columns)
  • validation rules: the validation type is provided here along with key constraints
  • validation results: post-interrogation results are provided here, with a breakdown of test units (total, passing, failing), threshold states, etc.

The intent is to provide the key information in one place, and have it be interpretable by a broad audience.

Example Code, Step-by-Step

Here’s the code that performs the validation on the Polars table.

import pointblank as pb

validation_2 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_lt(columns="a", value=10)
    .col_vals_between(columns="d", left=0, right=5000)
    .col_vals_in_set(columns="f", set=["low", "mid", "high"])
    .col_vals_regex(columns="b", pattern=r"^[0-9]-[a-z]{3}-[0-9]{3}$")
    .interrogate()
)

validation_2
Pointblank Validation
2024-12-20|15:09:48
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C 1
col_vals_lt
col_vals_lt()
a 10 13 13
1.00
0
0.00
#4CA64C66 2
col_vals_between
col_vals_between()
d [0, 5000] 13 12
0.92
1
0.08
#4CA64C 3
col_vals_in_set
col_vals_in_set()
f low, mid, high 13 13
1.00
0
0.00
#4CA64C 4
col_vals_regex
col_vals_regex()
b ^[0-9]-[a-z]{3}-[0-9]{3}$ 13 13
1.00
0
0.00
2024-12-20 15:09:48 UTC< 1 s2024-12-20 15:09:48 UTC

Note these three key pieces in the code:

  • the Validate(data=...) argument takes a DataFrame that you want to validate
  • the methods starting with col_* specify validation steps that run on specific columns
  • the interrogate() method executes the validation plan on the table

That’s data validation with pointblank in a nutshell! In the next section we’ll go a bit further by introducing a means to gauge data quality with failure thresholds.

Understanding Test Units

Each validation step will execute a validation test. For example, col_vals_lt() tests that each value in a column is less than a specified number. One important piece that’s reported is the number of test units that pass or fail.

Test units are dependent on the test being run. The col_vals_* tests each value in a column, so each value will be a test unit.

This matters because you can set thresholds that signal WARN, STOP, and NOTIFY states based the proportion or number of failing test units.

Here’s a simple example that uses a single col_vals_lt() step along with thresholds.

validation_3 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_lt(columns="a", value=7, thresholds=(2, 4))
    .interrogate()
)

validation_3
Pointblank Validation
2024-12-20|15:09:48
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#FFBF00 1
col_vals_lt
col_vals_lt()
a 7 13 11
0.85
2
0.15
2024-12-20 15:09:48 UTC< 1 s2024-12-20 15:09:48 UTC

The code uses thresholds=(2, 4) to set a WARN threshold of 2 and a STOP threshold of 4. Notice these pieces in the validation table:

  • The FAIL column shows that 2 tests units have failed
  • The W column (short for WARN) shows a filled yellow circle indicating it’s reached threshold
  • The S column (short for STOP) shows an open red circle indicating it’s below threshold

The one final threshold, N (NOTIFY), wasn’t set so appears on the validation table as a dash.

Thresholds let you take action at different levels of severity. The next section discusses setting and acting on thresholds in detail.

Using Threshold Levels

Thresholds enable you to signal failure at different severity levels. In the near future, thresholds will be able to trigger custom actions. For example, when testing a column for NULLs with col_vals_not_null() you might want to warn on any NULLs and stop where there are 20% NULLs in the column.

validation_4 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_not_null(columns="c", thresholds=(1, 0.2))
    .interrogate()
)

validation_4
Pointblank Validation
2024-12-20|15:09:48
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#FFBF00 1
col_vals_not_null
col_vals_not_null()
c 13 11
0.85
2
0.15
2024-12-20 15:09:48 UTC< 1 s2024-12-20 15:09:48 UTC

In this case, the thresholds= argument in the cols_vals_not_null() step was used, but we can also set thresholds globally by using Validate(thresholds=...).

For more on thresholds, see the Thresholds article.