import polars as pl
import pointblank as pb
= (
validation =pb.load_dataset(dataset="game_revenue"))
pb.Validate(data="item_revenue", value=200)
.col_vals_lt(columns="item_revenue", value=0)
.col_vals_gt(columns="session_duration", value=5)
.col_vals_gt(columns="item_type", set=["iap", "ad"])
.col_vals_in_set(columns="player_id", pattern=r"[A-Z]{12}\d{3}")
.col_vals_regex(columns
)
=10) validation.interrogate(get_first_n
Validate.interrogate
Validate.interrogate(=True,
collect_extracts=True,
collect_tbl_checked=None,
get_first_n=None,
sample_n=None,
sample_frac=5000,
sample_limit )
Execute each validation step against the table and store the results.
When a validation plan has been set with a series of validation steps, the interrogation process through interrogate()
should then be invoked. Interrogation will evaluate each validation step against the table and store the results.
The interrogation process will collect extracts of failing rows if the collect_extracts
option is set to True
(the default). We can control the number of rows collected using the get_first_n=
, sample_n=
, and sample_frac=
options. The sample_limit=
option will enforce a hard limit on the number of rows collected when using the sample_frac=
option.
After interrogation is complete, the Validate
object will have gathered information, and we can use methods like n_passed()
, f_failed()
, etc., to understand how the table performed against the validation plan. A visual representation of the validation results can be viewed by printing the Validate
object; this will display the validation table in an HTML viewing environment.
Parameters
collect_extracts : bool = True
-
An option to collect rows of the input table that didn’t pass a particular validation step. The default is
True
and further options (i.e.,get_first_n=
,sample_*=
) allow for fine control of how these rows are collected. collect_tbl_checked : bool = True
-
The processed data frames produced by executing the validation steps is collected and stored in the
Validate
object ifcollect_tbl_checked=True
. This information is necessary for some methods (e.g.,get_sundered_data()
), but it potentially makes the object grow to a large size. To opt out of attaching this data, set this argument toFalse
. get_first_n : int | None = None
-
If the option to collect rows where test units is chosen, there is the option here to collect the first
n
rows. Supply an integer number of rows to extract from the top of subset table containing non-passing rows (the ordering of data from the original table is retained). sample_n : int | None = None
-
If the option to collect non-passing rows is chosen, this option allows for the sampling of
n
rows. Supply an integer number of rows to sample from the subset table. Ifn
happens to be greater than the number of non-passing rows, then all such rows will be returned. sample_frac : int | float | None = None
-
If the option to collect non-passing rows is chosen, this option allows for the sampling of a fraction of those rows. Provide a number in the range of
0
and1
. The number of rows to return could be very large, however, thesample_limit=
option will apply a hard limit to the returned rows. sample_limit : int = 5000
-
A value that limits the possible number of rows returned when sampling non-passing rows using the
sample_frac=
option.
Returns
: Validate
-
The
Validate
object with the results of the interrogation.
Examples
Let’s use a built-in dataset ("game_revenue"
) to demonstrate some of the options of the interrogation process. A series of validation steps will populate our validation plan. After setting up the plan, the next step is to interrogate the table and see how well it aligns with our expectations. We’ll use the get_first_n=
option so that any extracts of failing rows are limited to the first n
rows.
The validation table shows that step 3 (checking for session_duration
greater than 5
) has 18 failing test units. This means that 18 rows in the table are problematic. We’d like to see the rows that failed this validation step and we can do that with the get_data_extracts()
method.
=3, frame=True) validation.get_data_extracts(i
player_id | session_id | session_start | time | item_type | item_name | item_revenue | session_duration | start_day | acquisition | country |
---|---|---|---|---|---|---|---|---|---|---|
str | str | datetime[μs, UTC] | datetime[μs, UTC] | str | str | f64 | f64 | date | str | str |
"QNLVRDEOXFYJ892" | "QNLVRDEOXFYJ892-lz5fmr6k" | 2015-01-10 16:44:17 UTC | 2015-01-10 16:45:29 UTC | "iap" | "gold3" | 3.493 | 3.7 | 2015-01-09 | "crosspromo" | "Australia" |
"RMOSWHJGELCI675" | "RMOSWHJGELCI675-t4y8bjcu" | 2015-01-11 07:24:24 UTC | 2015-01-11 07:25:18 UTC | "iap" | "offer4" | 17.991 | 5.0 | 2015-01-10 | "other_campaign" | "France" |
"RMOSWHJGELCI675" | "RMOSWHJGELCI675-t4y8bjcu" | 2015-01-11 07:24:24 UTC | 2015-01-11 07:26:24 UTC | "iap" | "offer5" | 26.091 | 5.0 | 2015-01-10 | "other_campaign" | "France" |
"RMOSWHJGELCI675" | "RMOSWHJGELCI675-t4y8bjcu" | 2015-01-11 07:24:24 UTC | 2015-01-11 07:28:36 UTC | "ad" | "ad_15sec" | 0.53 | 5.0 | 2015-01-10 | "other_campaign" | "France" |
"GFLYJHAPMZWD631" | "GFLYJHAPMZWD631-i2v1bl7a" | 2015-01-11 16:13:24 UTC | 2015-01-11 16:14:54 UTC | "iap" | "gems2" | 3.996 | 3.6 | 2015-01-09 | "organic" | "India" |
"BFNLURISJXTH647" | "BFNLURISJXTH647-6o5hx27z" | 2015-01-12 17:37:39 UTC | 2015-01-12 17:39:27 UTC | "iap" | "offer5" | 11.596 | 4.1 | 2015-01-10 | "organic" | "India" |
"BFNLURISJXTH647" | "BFNLURISJXTH647-6o5hx27z" | 2015-01-12 17:37:39 UTC | 2015-01-12 17:41:45 UTC | "iap" | "gems3" | 9.996 | 4.1 | 2015-01-10 | "organic" | "India" |
"KILWZYHRSJEG316" | "KILWZYHRSJEG316-uke7dhqj" | 2015-01-13 22:16:29 UTC | 2015-01-13 22:17:35 UTC | "iap" | "offer2" | 10.989 | 3.2 | 2015-01-04 | "organic" | "Denmark" |
"JUBDVFHCNQWT198" | "JUBDVFHCNQWT198-9h4xs2pb" | 2015-01-14 16:08:25 UTC | 2015-01-14 16:08:43 UTC | "iap" | "offer5" | 8.697 | 3.3 | 2015-01-14 | "organic" | "Philippines" |
"JUBDVFHCNQWT198" | "JUBDVFHCNQWT198-9h4xs2pb" | 2015-01-14 16:08:25 UTC | 2015-01-14 16:11:01 UTC | "iap" | "offer4" | 5.997 | 3.3 | 2015-01-14 | "organic" | "Philippines" |
The get_data_extracts()
method will return a Polars DataFrame with the first 10 rows that failed the validation step. There are actually 18 rows that failed but we limited the collection of extracts with get_first_n=10
.