API Reference

Validate

When peforming data validation, you’ll need the Validate class to get the process started. It’s given the target table and you can optionally provide some metadata and/or failure thresholds (using the Thresholds class or through shorthands for this task). The Validate class has numerous methods for defining validation steps and for obtaining post-interrogation metrics and data.

Validate Workflow for defining a set of validations on a table and interrogating for results.
Thresholds Definition of threshold values.
Schema Definition of a schema object.

Validation Steps

Validation steps can be thought of as sequential validations on the target data. We call Validate’s validation methods to build up a validation plan: a collection of steps that, in the aggregate, provides good validation coverage.

Validate.col_vals_gt Are column data greater than a fixed value or data in another column?
Validate.col_vals_lt Are column data less than a fixed value or data in another column?
Validate.col_vals_ge Are column data greater than or equal to a fixed value or data in another column?
Validate.col_vals_le Are column data less than or equal to a fixed value or data in another column?
Validate.col_vals_eq Are column data equal to a fixed value or data in another column?
Validate.col_vals_ne Are column data not equal to a fixed value or data in another column?
Validate.col_vals_between Do column data lie between two specified values or data in other columns?
Validate.col_vals_outside Do column data lie outside of two specified values or data in other columns?
Validate.col_vals_in_set Validate whether column values are in a set of values.
Validate.col_vals_not_in_set Validate whether column values are not in a set of values.
Validate.col_vals_null Validate whether values in a column are NULL.
Validate.col_vals_not_null Validate whether values in a column are not NULL.
Validate.col_vals_regex Validate whether column values match a regular expression pattern.
Validate.col_vals_expr Validate column values using a custom expression.
Validate.col_exists Validate whether one or more columns exist in the table.
Validate.rows_distinct Validate whether rows in the table are distinct.
Validate.col_schema_match Do columns in the table (and their types) match a predefined schema?
Validate.row_count_match Validate whether the row count of the table matches a specified count.
Validate.col_count_match Validate whether the column count of the table matches a specified count.

Column Selection

A flexible way to select columns for validation is to use the col() function along with column selection helper functions. A combination of col() + starts_with(), matches(), etc., allows for the selection of multiple target columns (mapping a validation across many steps). Furthermore, the col() function can be used to declare a comparison column (e.g., for the value= argument in many col_vals_*() methods) when you can’t use a fixed value for comparison.

col Helper function for referencing a column in the input table.
starts_with Select columns that start with specified text.
ends_with Select columns that end with specified text.
contains Select columns that contain specified text.
matches Select columns that match a specified regular expression pattern.
everything Select all columns.
first_n Select the first n columns in the column list.
last_n Select the last n columns in the column list.

Interrogation and Reporting

The validation plan is put into action when interrogate() is called. The workflow for performing a comprehensive validation is then: (1) Validate(), (2) adding validation steps, (3) interrogate(). After interrogation of the data, we can view a validation report table (by printing the object or using get_tabular_report()), extract key metrics, or we can split the data based on the validation results (with get_sundered_data()).

Validate.interrogate Execute each validation step against the table and store the results.
Validate.get_tabular_report Validation report as a GT table.
Validate.get_step_report Get a detailed report for a single validation step.
Validate.get_json_report Get a report of the validation results as a JSON-formatted string.
Validate.get_sundered_data Get the data that passed or failed the validation steps.
Validate.get_data_extracts Get the rows that failed for each validation step.
Validate.all_passed Determine if every validation step passed perfectly, with no failing test units.
Validate.n Provides a dictionary of the number of test units for each validation step.
Validate.n_passed Provides a dictionary of the number of test units that passed for each validation step.
Validate.n_failed Provides a dictionary of the number of test units that failed for each validation step.
Validate.f_passed Provides a dictionary of the fraction of test units that passed for each validation step.
Validate.f_failed Provides a dictionary of the fraction of test units that failed for each validation step.
Validate.warn Provides a dictionary of the warning status for each validation step.
Validate.stop Provides a dictionary of the stopping status for each validation step.
Validate.notify Provides a dictionary of the notification status for each validation step.

Utilities

The utilities group contains functions that are helpful for the validation process. We can load datasets with load_dataset(), preview a table with preview(), and set global configuration parameters with config().

load_dataset Load a dataset hosted in the library as specified DataFrame type.
preview Display a table preview that shows some rows from the top, some from the bottom.
get_column_count Get the number of columns in a table.
get_row_count Get the number of rows in a table.
config Configuration settings for the pointblank library.