Validate.interrogate

Validate.interrogate(
    collect_extracts=True,
    collect_tbl_checked=True,
    get_first_n=None,
    sample_n=None,
    sample_frac=None,
    sample_limit=5000,
)

Execute each validation step against the table and store the results.

When a validation plan has been set with a series of validation steps, the interrogation process through interrogate() should then be invoked. Interrogation will evaluate each validation step against the table and store the results.

The interrogation process will collect extracts of failing rows if the collect_extracts option is set to True (the default). We can control the number of rows collected using the get_first_n=, sample_n=, and sample_frac= options. The sample_limit= option will enforce a hard limit on the number of rows collected when using the sample_frac= option.

After interrogation is complete, the Validate object will have gathered information, and we can use methods like n_passed(), f_failed(), etc., to understand how the table performed against the validation plan. A visual representation of the validation results can be viewed by printing the Validate object; this will display the validation table in an HTML viewing environment.

Parameters

collect_extracts : bool = True: An option to collect rows of the input table that didn’t pass a particular validation step. The default is True and further options (i.e., get_first_n=, sample_*=) allow for fine control of how these rows are collected.
collect_tbl_checked : bool = True: The processed data frames produced by executing the validation steps is collected and stored in the Validate object if collect_tbl_checked=True. This information is necessary for some methods (e.g., get_sundered_data()), but it potentially makes the object grow to a large size. To opt out of attaching this data, set this argument to False.
get_first_n : int | None = None: If the option to collect rows where test units is chosen, there is the option here to collect the first n rows. Supply an integer number of rows to extract from the top of subset table containing non-passing rows (the ordering of data from the original table is retained).
sample_n : int | None = None: If the option to collect non-passing rows is chosen, this option allows for the sampling of n rows. Supply an integer number of rows to sample from the subset table. If n happens to be greater than the number of non-passing rows, then all such rows will be returned.
sample_frac : int | float | None = None: If the option to collect non-passing rows is chosen, this option allows for the sampling of a fraction of those rows. Provide a number in the range of 0 and 1. The number of rows to return could be very large, however, the sample_limit= option will apply a hard limit to the returned rows.
sample_limit : int = 5000: A value that limits the possible number of rows returned when sampling non-passing rows using the sample_frac= option.

Returns

: Validate: The Validate object with the results of the interrogation.

Examples

Let’s use a built-in dataset ("game_revenue") to demonstrate some of the options of the interrogation process. A series of validation steps will populate our validation plan. After setting up the plan, the next step is to interrogate the table and see how well it aligns with our expectations. We’ll use the get_first_n= option so that any extracts of failing rows are limited to the first n rows.

import polars as pl
import pointblank as pb

validation = (
    pb.Validate(data=pb.load_dataset(dataset="game_revenue"))
    .col_vals_lt(columns="item_revenue", value=200)
    .col_vals_gt(columns="item_revenue", value=0)
    .col_vals_gt(columns="session_duration", value=5)
    .col_vals_in_set(columns="item_type", set=["iap", "ad"])
    .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}")
)

validation.interrogate(get_first_n=10)

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	S	N	EXT
Pointblank Validation
2024-12-20\|15:09:06 Polars
#4CA64C	1	col_vals_lt()	item_revenue	200	✓	2000	2000 1.00	0 0.00	—	—	—	—
#4CA64C	2	col_vals_gt()	item_revenue	0	✓	2000	2000 1.00	0 0.00	—	—	—	—
#4CA64C66	3	col_vals_gt()	session_duration	5	✓	2000	1982 0.99	18 0.01	—	—	—
#4CA64C	4	col_vals_in_set()	item_type	iap, ad	✓	2000	2000 1.00	0 0.00	—	—	—	—
#4CA64C	5	col_vals_regex()	player_id	[A-Z]{12}\d{3}	✓	2000	2000 1.00	0 0.00	—	—	—	—
2024-12-20 15:09:06 UTC< 1 s2024-12-20 15:09:06 UTC

The validation table shows that step 3 (checking for session_duration greater than 5) has 18 failing test units. This means that 18 rows in the table are problematic. We’d like to see the rows that failed this validation step and we can do that with the get_data_extracts() method.

validation.get_data_extracts(i=3, frame=True)

shape: (10, 11)

player_id	session_id	session_start	time	item_type	item_name	item_revenue	session_duration	start_day	acquisition	country
str	str	datetime[μs, UTC]	datetime[μs, UTC]	str	str	f64	f64	date	str	str
"QNLVRDEOXFYJ892"	"QNLVRDEOXFYJ892-lz5fmr6k"	2015-01-10 16:44:17 UTC	2015-01-10 16:45:29 UTC	"iap"	"gold3"	3.493	3.7	2015-01-09	"crosspromo"	"Australia"
"RMOSWHJGELCI675"	"RMOSWHJGELCI675-t4y8bjcu"	2015-01-11 07:24:24 UTC	2015-01-11 07:25:18 UTC	"iap"	"offer4"	17.991	5.0	2015-01-10	"other_campaign"	"France"
"RMOSWHJGELCI675"	"RMOSWHJGELCI675-t4y8bjcu"	2015-01-11 07:24:24 UTC	2015-01-11 07:26:24 UTC	"iap"	"offer5"	26.091	5.0	2015-01-10	"other_campaign"	"France"
"RMOSWHJGELCI675"	"RMOSWHJGELCI675-t4y8bjcu"	2015-01-11 07:24:24 UTC	2015-01-11 07:28:36 UTC	"ad"	"ad_15sec"	0.53	5.0	2015-01-10	"other_campaign"	"France"
"GFLYJHAPMZWD631"	"GFLYJHAPMZWD631-i2v1bl7a"	2015-01-11 16:13:24 UTC	2015-01-11 16:14:54 UTC	"iap"	"gems2"	3.996	3.6	2015-01-09	"organic"	"India"
"BFNLURISJXTH647"	"BFNLURISJXTH647-6o5hx27z"	2015-01-12 17:37:39 UTC	2015-01-12 17:39:27 UTC	"iap"	"offer5"	11.596	4.1	2015-01-10	"organic"	"India"
"BFNLURISJXTH647"	"BFNLURISJXTH647-6o5hx27z"	2015-01-12 17:37:39 UTC	2015-01-12 17:41:45 UTC	"iap"	"gems3"	9.996	4.1	2015-01-10	"organic"	"India"
"KILWZYHRSJEG316"	"KILWZYHRSJEG316-uke7dhqj"	2015-01-13 22:16:29 UTC	2015-01-13 22:17:35 UTC	"iap"	"offer2"	10.989	3.2	2015-01-04	"organic"	"Denmark"
"JUBDVFHCNQWT198"	"JUBDVFHCNQWT198-9h4xs2pb"	2015-01-14 16:08:25 UTC	2015-01-14 16:08:43 UTC	"iap"	"offer5"	8.697	3.3	2015-01-14	"organic"	"Philippines"
"JUBDVFHCNQWT198"	"JUBDVFHCNQWT198-9h4xs2pb"	2015-01-14 16:08:25 UTC	2015-01-14 16:11:01 UTC	"iap"	"offer4"	5.997	3.3	2015-01-14	"organic"	"Philippines"

The get_data_extracts() method will return a Polars DataFrame with the first 10 rows that failed the validation step. There are actually 18 rows that failed but we limited the collection of extracts with get_first_n=10.