Intro

The Pointblank library is all about assessing the state of data quality in a table. You provide the validation rules and the library will dutifully interrogate the data and provide useful reporting. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or various database tables (thanks to Ibis support). Let’s walk through what table validation looks like in Pointblank.

A Simple Validation Table

This is a validation report table that is produced from a validation of a Polars DataFrame:

Code

import pointblank as pb

validation_1 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_lt(columns="a", value=10)
    .col_vals_between(columns="d", left=0, right=5000)
    .col_vals_in_set(columns="f", set=["low", "mid", "high"])
    .col_vals_regex(columns="b", pattern=r"^[0-9]-[a-z]{3}-[0-9]{3}$")
    .interrogate()
)

validation_1

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	S	N	EXT
Pointblank Validation
2025-01-20\|18:20:45 Polars
#4CA64C	1	col_vals_lt()	a	10	✓	13	13 1.00	0 0.00	—	—	—	—
#4CA64C66	2	col_vals_between()	d	[0, 5000]	✓	13	12 0.92	1 0.08	—	—	—
#4CA64C	3	col_vals_in_set()	f	low, mid, high	✓	13	13 1.00	0 0.00	—	—	—	—
#4CA64C	4	col_vals_regex()	b	^[0-9]-[a-z]{3}-[0-9]{3}$	✓	13	13 1.00	0 0.00	—	—	—	—
2025-01-20 18:20:45 UTC< 1 s2025-01-20 18:20:45 UTC

Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side outlines the validation rules and the right-hand side provides the results of each validation step. While simple in principle, there’s a lot of useful information packed into this validation table.

Here’s a diagram that describes a few of the important parts of the validation table:

There are three things that should be noted here:

validation steps: each step is a separate test on the table, focused on a certain aspect of the table
validation rules: the validation type is provided here along with key constraints
validation results: interrogation results are provided here, with a breakdown of test units (total, passing, and failing), threshold states, and more

The intent is to provide the key information in one place, and have it be interpretable by data stakeholders.

Example Code, Step-by-Step

Here’s the code that performs the validation on the Polars table.

import pointblank as pb

validation_2 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_lt(columns="a", value=10)
    .col_vals_between(columns="d", left=0, right=5000)
    .col_vals_in_set(columns="f", set=["low", "mid", "high"])
    .col_vals_regex(columns="b", pattern=r"^[0-9]-[a-z]{3}-[0-9]{3}$")
    .interrogate()
)

validation_2

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	S	N	EXT
Pointblank Validation
2025-01-20\|18:20:45 Polars
#4CA64C	1	col_vals_lt()	a	10	✓	13	13 1.00	0 0.00	—	—	—	—
#4CA64C66	2	col_vals_between()	d	[0, 5000]	✓	13	12 0.92	1 0.08	—	—	—
#4CA64C	3	col_vals_in_set()	f	low, mid, high	✓	13	13 1.00	0 0.00	—	—	—	—
#4CA64C	4	col_vals_regex()	b	^[0-9]-[a-z]{3}-[0-9]{3}$	✓	13	13 1.00	0 0.00	—	—	—	—
2025-01-20 18:20:45 UTC< 1 s2025-01-20 18:20:45 UTC

Note these three key pieces in the code:

the Validate(data=...) argument takes a DataFrame or database table that you want to validate
the methods starting with col_* specify validation steps that run on specific columns
the interrogate() method executes the validation plan on the table

This common pattern is used in a validation workflow, where Validate() and interrogate() bookend a validation plan generated through calling validation methods. And that’s data validation with Pointblank in a nutshell! In the next section we’ll go a bit further by introducing a means to gauge data quality with failure thresholds.

Understanding Test Units

Each validation step will execute a separate validation test on the target table. For example, the col_vals_lt() validation tests that each value in a column is less than a specified number. A key thing that’s reported is the number of test units that pass or fail a validation step.

Test units are dependent on the test being run. The col_vals_* tests each value in a column, so each value will be a test unit (and the number of test units is the number of rows in the target table).

This matters because you can set thresholds that signal warn, stop, and notify states based the proportion or number of failing test units.

Here’s a simple example that uses a single col_vals_lt() step along with thresholds set in the thresholds= argument of the validation method.

validation_3 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_lt(columns="a", value=7, thresholds=(2, 4))
    .interrogate()
)

validation_3

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	S	N
Pointblank Validation
2025-01-20\|18:20:45 Polars
#FFBF00	1	col_vals_lt()	a	7	✓	13	11 0.85	2 0.15	●	○	—
2025-01-20 18:20:45 UTC< 1 s2025-01-20 18:20:45 UTC

The code uses thresholds=(2, 4) to set a warn threshold of 2 and a stop threshold of 4. If you look at the validation report table, we can see:

the FAIL column shows that 2 tests units have failed
the W column (short for warn) shows a filled yellow circle indicating those failing test units reached that threshold value
the S column (short for stop) shows an open red circle indicating that the number of failing test units is below that threshold

The one final threshold, N (for notify), wasn’t set so it appears on the validation table as a long dash.

Thresholds let you take action at different levels of severity. The next section discusses setting and acting on thresholds in detail.

Using Threshold Levels

Thresholds enable you to signal failure at different severity levels. In the near future, thresholds will be able to trigger custom actions (should those actions be defined).

Here’s an example where we test if a certain column has Null/missing values with col_vals_not_null(). This is a case where we want to warn on any Null values and stop where there are 20% Nulls in the column.

validation_4 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_not_null(columns="c", thresholds=(1, 0.2))
    .interrogate()
)

validation_4

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	S	N
Pointblank Validation
2025-01-20\|18:20:45 Polars
#FFBF00	1	col_vals_not_null()	c	—	✓	13	11 0.85	2 0.15	●	○	—
2025-01-20 18:20:45 UTC< 1 s2025-01-20 18:20:45 UTC

In this case, the thresholds= argument in the cols_vals_not_null() step was set to (1, 0.2), indicating 1 failing test unit is set for warn and a 0.2 fraction of all failing test units is set to stop.

For more on thresholds, see the Thresholds article.