Intro

The Pointblank library is all about assessing the state of data quality in a table. You provide the validation rules and the library will dutifully interrogate the data and provide useful reporting. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or various database tables (thanks to Ibis support). Let’s walk through what table validation looks like in Pointblank.

A Simple Validation Table

This is a validation report table that is produced from a validation of a Polars DataFrame:

Code
import pointblank as pb

validation_1 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_lt(columns="a", value=10)
    .col_vals_between(columns="d", left=0, right=5000)
    .col_vals_in_set(columns="f", set=["low", "mid", "high"])
    .col_vals_regex(columns="b", pattern=r"^[0-9]-[a-z]{3}-[0-9]{3}$")
    .interrogate()
)

validation_1
Pointblank Validation
2025-01-20|18:20:45
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C 1
col_vals_lt
col_vals_lt()
a 10 13 13
1.00
0
0.00
#4CA64C66 2
col_vals_between
col_vals_between()
d [0, 5000] 13 12
0.92
1
0.08
#4CA64C 3
col_vals_in_set
col_vals_in_set()
f low, mid, high 13 13
1.00
0
0.00
#4CA64C 4
col_vals_regex
col_vals_regex()
b ^[0-9]-[a-z]{3}-[0-9]{3}$ 13 13
1.00
0
0.00
2025-01-20 18:20:45 UTC< 1 s2025-01-20 18:20:45 UTC

Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side outlines the validation rules and the right-hand side provides the results of each validation step. While simple in principle, there’s a lot of useful information packed into this validation table.

Here’s a diagram that describes a few of the important parts of the validation table:

There are three things that should be noted here:

  • validation steps: each step is a separate test on the table, focused on a certain aspect of the table
  • validation rules: the validation type is provided here along with key constraints
  • validation results: interrogation results are provided here, with a breakdown of test units (total, passing, and failing), threshold states, and more

The intent is to provide the key information in one place, and have it be interpretable by data stakeholders.

Example Code, Step-by-Step

Here’s the code that performs the validation on the Polars table.

import pointblank as pb

validation_2 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_lt(columns="a", value=10)
    .col_vals_between(columns="d", left=0, right=5000)
    .col_vals_in_set(columns="f", set=["low", "mid", "high"])
    .col_vals_regex(columns="b", pattern=r"^[0-9]-[a-z]{3}-[0-9]{3}$")
    .interrogate()
)

validation_2
Pointblank Validation
2025-01-20|18:20:45
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C 1
col_vals_lt
col_vals_lt()
a 10 13 13
1.00
0
0.00
#4CA64C66 2
col_vals_between
col_vals_between()
d [0, 5000] 13 12
0.92
1
0.08
#4CA64C 3
col_vals_in_set
col_vals_in_set()
f low, mid, high 13 13
1.00
0
0.00
#4CA64C 4
col_vals_regex
col_vals_regex()
b ^[0-9]-[a-z]{3}-[0-9]{3}$ 13 13
1.00
0
0.00
2025-01-20 18:20:45 UTC< 1 s2025-01-20 18:20:45 UTC

Note these three key pieces in the code:

  • the Validate(data=...) argument takes a DataFrame or database table that you want to validate
  • the methods starting with col_* specify validation steps that run on specific columns
  • the interrogate() method executes the validation plan on the table

This common pattern is used in a validation workflow, where Validate() and interrogate() bookend a validation plan generated through calling validation methods. And that’s data validation with Pointblank in a nutshell! In the next section we’ll go a bit further by introducing a means to gauge data quality with failure thresholds.

Understanding Test Units

Each validation step will execute a separate validation test on the target table. For example, the col_vals_lt() validation tests that each value in a column is less than a specified number. A key thing that’s reported is the number of test units that pass or fail a validation step.

Test units are dependent on the test being run. The col_vals_* tests each value in a column, so each value will be a test unit (and the number of test units is the number of rows in the target table).

This matters because you can set thresholds that signal warn, stop, and notify states based the proportion or number of failing test units.

Here’s a simple example that uses a single col_vals_lt() step along with thresholds set in the thresholds= argument of the validation method.

validation_3 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_lt(columns="a", value=7, thresholds=(2, 4))
    .interrogate()
)

validation_3
Pointblank Validation
2025-01-20|18:20:45
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#FFBF00 1
col_vals_lt
col_vals_lt()
a 7 13 11
0.85
2
0.15
2025-01-20 18:20:45 UTC< 1 s2025-01-20 18:20:45 UTC

The code uses thresholds=(2, 4) to set a warn threshold of 2 and a stop threshold of 4. If you look at the validation report table, we can see:

  • the FAIL column shows that 2 tests units have failed
  • the W column (short for warn) shows a filled yellow circle indicating those failing test units reached that threshold value
  • the S column (short for stop) shows an open red circle indicating that the number of failing test units is below that threshold

The one final threshold, N (for notify), wasn’t set so it appears on the validation table as a long dash.

Thresholds let you take action at different levels of severity. The next section discusses setting and acting on thresholds in detail.

Using Threshold Levels

Thresholds enable you to signal failure at different severity levels. In the near future, thresholds will be able to trigger custom actions (should those actions be defined).

Here’s an example where we test if a certain column has Null/missing values with col_vals_not_null(). This is a case where we want to warn on any Null values and stop where there are 20% Nulls in the column.

validation_4 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_not_null(columns="c", thresholds=(1, 0.2))
    .interrogate()
)

validation_4
Pointblank Validation
2025-01-20|18:20:45
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#FFBF00 1
col_vals_not_null
col_vals_not_null()
c 13 11
0.85
2
0.15
2025-01-20 18:20:45 UTC< 1 s2025-01-20 18:20:45 UTC

In this case, the thresholds= argument in the cols_vals_not_null() step was set to (1, 0.2), indicating 1 failing test unit is set for warn and a 0.2 fraction of all failing test units is set to stop.

For more on thresholds, see the Thresholds article.