Get the rows that failed for each validation step.
After the interrogate() method has been called, the get_data_extracts() method can be used to extract the rows that failed in each row-based validation step (e.g., col_vals_gt(), etc.). The method returns a dictionary of tables containing the rows that failed in every row-based validation function. If frame=True and i= is a scalar, the value is conveniently returned as a table (forgoing the dictionary structure).
Parameters
i:int | list[int] | None=None
The validation step number(s) from which the failed rows are obtained. Can be provided as a list of integers or a single integer. If None, all steps are included.
frame:bool=False
If True and i= is a scalar, return the value as a DataFrame instead of a dictionary.
Returns
:dict[int, FrameT | None] | FrameT | None
A dictionary of tables containing the rows that failed in every row-based validation step or a DataFrame.
Validation Methods that are Row-Based
The following validation methods are row-based and will have rows extracted when there are failing test units.
col_vals_gt()
col_vals_ge()
col_vals_lt()
col_vals_le()
col_vals_eq()
col_vals_ne()
col_vals_between()
col_vals_outside()
col_vals_in_set()
col_vals_not_in_set()
col_vals_null()
col_vals_not_null()
col_vals_regex()
An extracted row means that a test unit failed for that row in the validation step. The extracted rows are a subset of the original table and are useful for further analysis or for understanding the nature of the failing test units.
Examples
Let’s perform a series of validation steps on a Polars DataFrame. We’ll use the col_vals_gt() in the first step, col_vals_lt() in the second step, and col_vals_ge() in the third step. The interrogate() method executes the validation; then, we can extract the rows that failed for each validation step.
The get_data_extracts() method returns a dictionary of tables, where each table contains a subset of rows from the table. These are the rows that failed for each validation step.
In the first step, the col_vals_gt() method was used to check if the values in column a were greater than 4. The extracted table shows the rows where this condition was not met; look at the a column: all values are less than 4.
In the second step, the col_vals_lt() method was used to check if the values in column c were less than 5. In the extracted two-row table, we see that the values in column c are greater than 5.
The third step (col_vals_ge()) checked if the values in column b were greater than or equal to 1. There were no failing test units, so the extracted table is empty (i.e., has columns but no rows).
The i= argument can be used to narrow down the extraction to one or more steps. For example, to extract the rows that failed in the first step only:
Note that the first validation step is indexed at 1 (not 0). This 1-based indexing is in place here to match the step numbers reported in the validation table. What we get back is still a dictionary, but it only contains one table (the one for the first step).
If you want to get the extracted table as a DataFrame, set frame=True and provide a scalar value for i. For example, to get the extracted table for the second step as a DataFrame:
validation.get_data_extracts(i=2, frame=True)
shape: (2, 3)
a
b
c
i64
i64
i64
6
2
7
3
5
6
The extracted table is now a DataFrame, which can serve as a more convenient format for further analysis or visualization.