API Reference
Validate
When peforming data validation, you’ll need the Validate
class to get the process started. It’s given the target table and you can optionally provide some metadata and/or failure thresholds (using the Thresholds
class or through shorthands for this task). The Validate
class has numerous methods for defining validation steps and for obtaining post-interrogation metrics and data.
Validate | Workflow for defining a set of validations on a table and interrogating for results. |
Thresholds | Definition of threshold values. |
Schema | Definition of a schema object. |
Validation Steps
Validation steps can be thought of as sequential validations on the target data. We call Validate
’s validation methods to build up a validation plan: a collection of steps that, in the aggregate, provides good validation coverage.
Validate.col_vals_gt | Are column data greater than a fixed value or data in another column? |
Validate.col_vals_lt | Are column data less than a fixed value or data in another column? |
Validate.col_vals_ge | Are column data greater than or equal to a fixed value or data in another column? |
Validate.col_vals_le | Are column data less than or equal to a fixed value or data in another column? |
Validate.col_vals_eq | Are column data equal to a fixed value or data in another column? |
Validate.col_vals_ne | Are column data not equal to a fixed value or data in another column? |
Validate.col_vals_between | Do column data lie between two specified values or data in other columns? |
Validate.col_vals_outside | Do column data lie outside of two specified values or data in other columns? |
Validate.col_vals_in_set | Validate whether column values are in a set of values. |
Validate.col_vals_not_in_set | Validate whether column values are not in a set of values. |
Validate.col_vals_null | Validate whether values in a column are NULL. |
Validate.col_vals_not_null | Validate whether values in a column are not NULL. |
Validate.col_vals_regex | Validate whether column values match a regular expression pattern. |
Validate.col_vals_expr | Validate column values using a custom expression. |
Validate.col_exists | Validate whether one or more columns exist in the table. |
Validate.rows_distinct | Validate whether rows in the table are distinct. |
Validate.col_schema_match | Do columns in the table (and their types) match a predefined schema? |
Validate.row_count_match | Validate whether the row count of the table matches a specified count. |
Validate.col_count_match | Validate whether the column count of the table matches a specified count. |
Column Selection
A flexible way to select columns for validation is to use the col()
function along with column selection helper functions. A combination of col()
+ starts_with()
, matches()
, etc., allows for the selection of multiple target columns (mapping a validation across many steps). Furthermore, the col()
function can be used to declare a comparison column (e.g., for the value=
argument in many col_vals_*()
methods) when you can’t use a fixed value for comparison.
col | Helper function for referencing a column in the input table. |
starts_with | Select columns that start with specified text. |
ends_with | Select columns that end with specified text. |
contains | Select columns that contain specified text. |
matches | Select columns that match a specified regular expression pattern. |
everything | Select all columns. |
first_n | Select the first n columns in the column list. |
last_n | Select the last n columns in the column list. |
Interrogation and Reporting
The validation plan is put into action when interrogate()
is called. The workflow for performing a comprehensive validation is then: (1) Validate()
, (2) adding validation steps, (3) interrogate()
. After interrogation of the data, we can view a validation report table (by printing the object or using get_tabular_report()
), extract key metrics, or we can split the data based on the validation results (with get_sundered_data()
).
Validate.interrogate | Execute each validation step against the table and store the results. |
Validate.get_tabular_report | Validation report as a GT table. |
Validate.get_step_report | Get a detailed report for a single validation step. |
Validate.get_json_report | Get a report of the validation results as a JSON-formatted string. |
Validate.get_sundered_data | Get the data that passed or failed the validation steps. |
Validate.get_data_extracts | Get the rows that failed for each validation step. |
Validate.all_passed | Determine if every validation step passed perfectly, with no failing test units. |
Validate.n | Provides a dictionary of the number of test units for each validation step. |
Validate.n_passed | Provides a dictionary of the number of test units that passed for each validation step. |
Validate.n_failed | Provides a dictionary of the number of test units that failed for each validation step. |
Validate.f_passed | Provides a dictionary of the fraction of test units that passed for each validation step. |
Validate.f_failed | Provides a dictionary of the fraction of test units that failed for each validation step. |
Validate.warn | Provides a dictionary of the warning status for each validation step. |
Validate.stop | Provides a dictionary of the stopping status for each validation step. |
Validate.notify | Provides a dictionary of the notification status for each validation step. |
Utilities
The utilities group contains functions that are helpful for the validation process. We can load datasets with load_dataset()
, preview a table with preview()
, and set global configuration parameters with config()
.
load_dataset | Load a dataset hosted in the library as specified DataFrame type. |
preview | Display a table preview that shows some rows from the top, some from the bottom. |
get_column_count | Get the number of columns in a table. |
get_row_count | Get the number of rows in a table. |
config | Configuration settings for the pointblank library. |