API Reference

Validate

When peforming data validation, you’ll need the Validate class to get the process started. It’s given the target table and you can optionally provide some metadata and/or failure thresholds (using the Thresholds class or through shorthands for this task). The Validate class has numerous methods for defining validation steps and for obtaining post-interrogation metrics and data.

Validate	Workflow for defining a set of validations on a table and interrogating for results.
Thresholds	Definition of threshold values.
Schema	Definition of a schema object.

Validation Steps

Validation steps can be thought of as sequential validations on the target data. We call Validate’s validation methods to build up a validation plan: a collection of steps that, in the aggregate, provides good validation coverage.

Validate.col_vals_gt	Are column data greater than a fixed value or data in another column?
Validate.col_vals_lt	Are column data less than a fixed value or data in another column?
Validate.col_vals_ge	Are column data greater than or equal to a fixed value or data in another column?
Validate.col_vals_le	Are column data less than or equal to a fixed value or data in another column?
Validate.col_vals_eq	Are column data equal to a fixed value or data in another column?
Validate.col_vals_ne	Are column data not equal to a fixed value or data in another column?
Validate.col_vals_between	Do column data lie between two specified values or data in other columns?
Validate.col_vals_outside	Do column data lie outside of two specified values or data in other columns?
Validate.col_vals_in_set	Validate whether column values are in a set of values.
Validate.col_vals_not_in_set	Validate whether column values are not in a set of values.
Validate.col_vals_null	Validate whether values in a column are NULL.
Validate.col_vals_not_null	Validate whether values in a column are not NULL.
Validate.col_vals_regex	Validate whether column values match a regular expression pattern.
Validate.col_vals_expr	Validate column values using a custom expression.
Validate.col_exists	Validate whether one or more columns exist in the table.
Validate.rows_distinct	Validate whether rows in the table are distinct.
Validate.col_schema_match	Do columns in the table (and their types) match a predefined schema?
Validate.row_count_match	Validate whether the row count of the table matches a specified count.
Validate.col_count_match	Validate whether the column count of the table matches a specified count.

Column Selection

A flexible way to select columns for validation is to use the col() function along with column selection helper functions. A combination of col() + starts_with(), matches(), etc., allows for the selection of multiple target columns (mapping a validation across many steps). Furthermore, the col() function can be used to declare a comparison column (e.g., for the value= argument in many col_vals_*() methods) when you can’t use a fixed value for comparison.

col	Helper function for referencing a column in the input table.
starts_with	Select columns that start with specified text.
ends_with	Select columns that end with specified text.
contains	Select columns that contain specified text.
matches	Select columns that match a specified regular expression pattern.
everything	Select all columns.
first_n	Select the first `n` columns in the column list.
last_n	Select the last `n` columns in the column list.

Interrogation and Reporting

The validation plan is put into action when interrogate() is called. The workflow for performing a comprehensive validation is then: (1) Validate(), (2) adding validation steps, (3) interrogate(). After interrogation of the data, we can view a validation report table (by printing the object or using get_tabular_report()), extract key metrics, or we can split the data based on the validation results (with get_sundered_data()).

Validate.interrogate	Execute each validation step against the table and store the results.
Validate.get_tabular_report	Validation report as a GT table.
Validate.get_step_report	Get a detailed report for a single validation step.
Validate.get_json_report	Get a report of the validation results as a JSON-formatted string.
Validate.get_sundered_data	Get the data that passed or failed the validation steps.
Validate.get_data_extracts	Get the rows that failed for each validation step.
Validate.all_passed	Determine if every validation step passed perfectly, with no failing test units.
Validate.n	Provides a dictionary of the number of test units for each validation step.
Validate.n_passed	Provides a dictionary of the number of test units that passed for each validation step.
Validate.n_failed	Provides a dictionary of the number of test units that failed for each validation step.
Validate.f_passed	Provides a dictionary of the fraction of test units that passed for each validation step.
Validate.f_failed	Provides a dictionary of the fraction of test units that failed for each validation step.
Validate.warn	Provides a dictionary of the warning status for each validation step.
Validate.stop	Provides a dictionary of the stopping status for each validation step.
Validate.notify	Provides a dictionary of the notification status for each validation step.

Utilities

The utilities group contains functions that are helpful for the validation process. We can load datasets with load_dataset(), preview a table with preview(), and set global configuration parameters with config().

load_dataset	Load a dataset hosted in the library as specified DataFrame type.
preview	Display a table preview that shows some rows from the top, some from the bottom.
get_column_count	Get the number of columns in a table.
get_row_count	Get the number of rows in a table.
config	Configuration settings for the pointblank library.