Validate.get_sundered_data

Validate.get_sundered_data(type='pass')

Get the data that passed or failed the validation steps.

Validation of the data is one thing but, sometimes, you want to use the best part of the input dataset for something else. The get_sundered_data() method works with a Validate object that has been interrogated (i.e., the interrogate() method was used). We can get either the ‘pass’ data piece (rows with no failing test units across all row-based validation functions), or, the ‘fail’ data piece (rows with at least one failing test unit across the same series of validations).

Details

There are some caveats to sundering. The validation steps considered for this splitting will only involve steps where:

  • of certain check types, where test units are cells checked row-by-row (e.g., the col_vals_*() methods)
  • active= is not set to False
  • pre= has not been given an expression for modify the input table

So long as these conditions are met, the data will be split into two constituent tables: one with the rows that passed all validation steps and another with the rows that failed at least one validation step.

Parameters

type : = 'pass'

The type of data to return. Options are "pass" or "fail", where the former returns a table only containing rows where test units always passed validation steps, and the latter returns a table only containing rows had test units that failed in at least one validation step.

Returns

: FrameT

A table containing the data that passed or failed the validation steps.

Examples

Let’s create a Validate object with three validation steps and then interrogate the data.

import polars as pl
import pointblank as pb

tbl = pl.DataFrame(
    {
        "a": [7, 6, 9, 7, 3, 2],
        "b": [9, 8, 10, 5, 10, 6],
        "c": ["c", "d", "a", "b", "a", "b"]
    }
)

validation = (
    pb.Validate(data=tbl)
    .col_vals_gt(columns="a", value=5)
    .col_vals_in_set(columns="c", set=["a", "b"])
    .interrogate()
)

validation
Pointblank Validation
2024-12-20|15:09:20
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C66 1
col_vals_gt
col_vals_gt()
a 5 6 4
0.67
2
0.33
#4CA64C66 2
col_vals_in_set
col_vals_in_set()
c a, b 6 4
0.67
2
0.33
2024-12-20 15:09:20 UTC< 1 s2024-12-20 15:09:20 UTC

From the validation table, we can see that the first and second steps each had 4 passing test units. A failing test unit will mark the entire row as failing in the context of the get_sundered_data() method. We can use this method to get the rows of data that passed the during interrogation.

validation.get_sundered_data()
shape: (2, 3)
abc
i64i64str
910"a"
75"b"

The returned DataFrame contains the rows that passed all validation steps. From the six-row input DataFrame, the first two rows and the last two rows had test units that failed validation. Thus the middle two rows are the only ones that passed all validation steps and that’s what we see in the returned DataFrame.