Validate.interrogate

Validate.interrogate(
    collect_extracts=True,
    collect_tbl_checked=True,
    get_first_n=None,
    sample_n=None,
    sample_frac=None,
    sample_limit=5000,
)

Execute each validation step against the table and store the results.

When a validation plan has been set with a series of validation steps, the interrogation process through interrogate() should then be invoked. Interrogation will evaluate each validation step against the table and store the results.

The interrogation process will collect extracts of failing rows if the collect_extracts option is set to True (the default). We can control the number of rows collected using the get_first_n=, sample_n=, and sample_frac= options. The sample_limit= option will enforce a hard limit on the number of rows collected when using the sample_frac= option.

After interrogation is complete, the Validate object will have gathered information, and we can use methods like n_passed(), f_failed(), etc., to understand how the table performed against the validation plan. A visual representation of the validation results can be viewed by printing the Validate object; this will display the validation table in an HTML viewing environment.

Parameters

collect_extracts : bool = True

An option to collect rows of the input table that didn’t pass a particular validation step. The default is True and further options (i.e., get_first_n=, sample_*=) allow for fine control of how these rows are collected.

collect_tbl_checked : bool = True

The processed data frames produced by executing the validation steps is collected and stored in the Validate object if collect_tbl_checked=True. This information is necessary for some methods (e.g., get_sundered_data()), but it potentially makes the object grow to a large size. To opt out of attaching this data, set this argument to False.

get_first_n : int | None = None

If the option to collect rows where test units is chosen, there is the option here to collect the first n rows. Supply an integer number of rows to extract from the top of subset table containing non-passing rows (the ordering of data from the original table is retained).

sample_n : int | None = None

If the option to collect non-passing rows is chosen, this option allows for the sampling of n rows. Supply an integer number of rows to sample from the subset table. If n happens to be greater than the number of non-passing rows, then all such rows will be returned.

sample_frac : int | float | None = None

If the option to collect non-passing rows is chosen, this option allows for the sampling of a fraction of those rows. Provide a number in the range of 0 and 1. The number of rows to return could be very large, however, the sample_limit= option will apply a hard limit to the returned rows.

sample_limit : int = 5000

A value that limits the possible number of rows returned when sampling non-passing rows using the sample_frac= option.

Returns

: Validate

The Validate object with the results of the interrogation.

Examples

Let’s use a built-in dataset ("game_revenue") to demonstrate some of the options of the interrogation process. A series of validation steps will populate our validation plan. After setting up the plan, the next step is to interrogate the table and see how well it aligns with our expectations. We’ll use the get_first_n= option so that any extracts of failing rows are limited to the first n rows.

import polars as pl
import pointblank as pb

validation = (
    pb.Validate(data=pb.load_dataset(dataset="game_revenue"))
    .col_vals_lt(columns="item_revenue", value=200)
    .col_vals_gt(columns="item_revenue", value=0)
    .col_vals_gt(columns="session_duration", value=5)
    .col_vals_in_set(columns="item_type", set=["iap", "ad"])
    .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}")
)

validation.interrogate(get_first_n=10)
Pointblank Validation
2024-12-20|15:09:06
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C 1
col_vals_lt
col_vals_lt()
item_revenue 200 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_gt
col_vals_gt()
item_revenue 0 2000 2000
1.00
0
0.00
#4CA64C66 3
col_vals_gt
col_vals_gt()
session_duration 5 2000 1982
0.99
18
0.01
#4CA64C 4
col_vals_in_set
col_vals_in_set()
item_type iap, ad 2000 2000
1.00
0
0.00
#4CA64C 5
col_vals_regex
col_vals_regex()
player_id [A-Z]{12}\d{3} 2000 2000
1.00
0
0.00
2024-12-20 15:09:06 UTC< 1 s2024-12-20 15:09:06 UTC

The validation table shows that step 3 (checking for session_duration greater than 5) has 18 failing test units. This means that 18 rows in the table are problematic. We’d like to see the rows that failed this validation step and we can do that with the get_data_extracts() method.

validation.get_data_extracts(i=3, frame=True)
shape: (10, 11)
player_idsession_idsession_starttimeitem_typeitem_nameitem_revenuesession_durationstart_dayacquisitioncountry
strstrdatetime[μs, UTC]datetime[μs, UTC]strstrf64f64datestrstr
"QNLVRDEOXFYJ892""QNLVRDEOXFYJ892-lz5fmr6k"2015-01-10 16:44:17 UTC2015-01-10 16:45:29 UTC"iap""gold3"3.4933.72015-01-09"crosspromo""Australia"
"RMOSWHJGELCI675""RMOSWHJGELCI675-t4y8bjcu"2015-01-11 07:24:24 UTC2015-01-11 07:25:18 UTC"iap""offer4"17.9915.02015-01-10"other_campaign""France"
"RMOSWHJGELCI675""RMOSWHJGELCI675-t4y8bjcu"2015-01-11 07:24:24 UTC2015-01-11 07:26:24 UTC"iap""offer5"26.0915.02015-01-10"other_campaign""France"
"RMOSWHJGELCI675""RMOSWHJGELCI675-t4y8bjcu"2015-01-11 07:24:24 UTC2015-01-11 07:28:36 UTC"ad""ad_15sec"0.535.02015-01-10"other_campaign""France"
"GFLYJHAPMZWD631""GFLYJHAPMZWD631-i2v1bl7a"2015-01-11 16:13:24 UTC2015-01-11 16:14:54 UTC"iap""gems2"3.9963.62015-01-09"organic""India"
"BFNLURISJXTH647""BFNLURISJXTH647-6o5hx27z"2015-01-12 17:37:39 UTC2015-01-12 17:39:27 UTC"iap""offer5"11.5964.12015-01-10"organic""India"
"BFNLURISJXTH647""BFNLURISJXTH647-6o5hx27z"2015-01-12 17:37:39 UTC2015-01-12 17:41:45 UTC"iap""gems3"9.9964.12015-01-10"organic""India"
"KILWZYHRSJEG316""KILWZYHRSJEG316-uke7dhqj"2015-01-13 22:16:29 UTC2015-01-13 22:17:35 UTC"iap""offer2"10.9893.22015-01-04"organic""Denmark"
"JUBDVFHCNQWT198""JUBDVFHCNQWT198-9h4xs2pb"2015-01-14 16:08:25 UTC2015-01-14 16:08:43 UTC"iap""offer5"8.6973.32015-01-14"organic""Philippines"
"JUBDVFHCNQWT198""JUBDVFHCNQWT198-9h4xs2pb"2015-01-14 16:08:25 UTC2015-01-14 16:11:01 UTC"iap""offer4"5.9973.32015-01-14"organic""Philippines"

The get_data_extracts() method will return a Polars DataFrame with the first 10 rows that failed the validation step. There are actually 18 rows that failed but we limited the collection of extracts with get_first_n=10.