Validate.get_data_extracts

Validate.get_data_extracts(i=None, frame=False)

Get the rows that failed for each validation step.

After the interrogate() method has been called, the get_data_extracts() method can be used to extract the rows that failed in each row-based validation step (e.g., col_vals_gt(), etc.). The method returns a dictionary of tables containing the rows that failed in every row-based validation function. If frame=True and i= is a scalar, the value is conveniently returned as a table (forgoing the dictionary structure).

Parameters

i : int | list[int] | None = None

The validation step number(s) from which the failed rows are obtained. Can be provided as a list of integers or a single integer. If None, all steps are included.

frame : bool = False

If True and i= is a scalar, return the value as a DataFrame instead of a dictionary.

Returns

: dict[int, FrameT | None] | FrameT | None

A dictionary of tables containing the rows that failed in every row-based validation step or a DataFrame.

Validation Methods that are Row-Based

The following validation methods are row-based and will have rows extracted when there are failing test units.

  • col_vals_gt()
  • col_vals_ge()
  • col_vals_lt()
  • col_vals_le()
  • col_vals_eq()
  • col_vals_ne()
  • col_vals_between()
  • col_vals_outside()
  • col_vals_in_set()
  • col_vals_not_in_set()
  • col_vals_null()
  • col_vals_not_null()
  • col_vals_regex()

An extracted row means that a test unit failed for that row in the validation step. The extracted rows are a subset of the original table and are useful for further analysis or for understanding the nature of the failing test units.

Examples

Let’s perform a series of validation steps on a Polars DataFrame. We’ll use the col_vals_gt() in the first step, col_vals_lt() in the second step, and col_vals_ge() in the third step. The interrogate() method executes the validation; then, we can extract the rows that failed for each validation step.

import polars as pl
import pointblank as pb

tbl = pl.DataFrame(
    {
        "a": [5, 6, 5, 3, 6, 1],
        "b": [1, 2, 1, 5, 2, 6],
        "c": [3, 7, 2, 6, 3, 1],
    }
)

validation = (
    pb.Validate(data=tbl)
    .col_vals_gt(columns="a", value=4)
    .col_vals_lt(columns="c", value=5)
    .col_vals_ge(columns="b", value=1)
    .interrogate()
)

validation.get_data_extracts()
{1: shape: (2, 3)
 ┌─────┬─────┬─────┐
 │ a   ┆ b   ┆ c   │
 │ --- ┆ --- ┆ --- │
 │ i64 ┆ i64 ┆ i64 │
 ╞═════╪═════╪═════╡
 │ 3   ┆ 5   ┆ 6   │
 │ 1   ┆ 6   ┆ 1   │
 └─────┴─────┴─────┘,
 2: shape: (2, 3)
 ┌─────┬─────┬─────┐
 │ a   ┆ b   ┆ c   │
 │ --- ┆ --- ┆ --- │
 │ i64 ┆ i64 ┆ i64 │
 ╞═════╪═════╪═════╡
 │ 6   ┆ 2   ┆ 7   │
 │ 3   ┆ 5   ┆ 6   │
 └─────┴─────┴─────┘,
 3: shape: (0, 3)
 ┌─────┬─────┬─────┐
 │ a   ┆ b   ┆ c   │
 │ --- ┆ --- ┆ --- │
 │ i64 ┆ i64 ┆ i64 │
 ╞═════╪═════╪═════╡
 └─────┴─────┴─────┘}

The get_data_extracts() method returns a dictionary of tables, where each table contains a subset of rows from the table. These are the rows that failed for each validation step.

In the first step, the col_vals_gt() method was used to check if the values in column a were greater than 4. The extracted table shows the rows where this condition was not met; look at the a column: all values are less than 4.

In the second step, the col_vals_lt() method was used to check if the values in column c were less than 5. In the extracted two-row table, we see that the values in column c are greater than 5.

The third step (col_vals_ge()) checked if the values in column b were greater than or equal to 1. There were no failing test units, so the extracted table is empty (i.e., has columns but no rows).

The i= argument can be used to narrow down the extraction to one or more steps. For example, to extract the rows that failed in the first step only:

validation.get_data_extracts(i=1)
{1: shape: (2, 3)
 ┌─────┬─────┬─────┐
 │ a   ┆ b   ┆ c   │
 │ --- ┆ --- ┆ --- │
 │ i64 ┆ i64 ┆ i64 │
 ╞═════╪═════╪═════╡
 │ 3   ┆ 5   ┆ 6   │
 │ 1   ┆ 6   ┆ 1   │
 └─────┴─────┴─────┘}

Note that the first validation step is indexed at 1 (not 0). This 1-based indexing is in place here to match the step numbers reported in the validation table. What we get back is still a dictionary, but it only contains one table (the one for the first step).

If you want to get the extracted table as a DataFrame, set frame=True and provide a scalar value for i. For example, to get the extracted table for the second step as a DataFrame:

validation.get_data_extracts(i=2, frame=True)
shape: (2, 3)
abc
i64i64i64
627
356

The extracted table is now a DataFrame, which can serve as a more convenient format for further analysis or visualization.