Validate.rows_distinct

Validate.rows_distinct(
    columns_subset=None,
    pre=None,
    thresholds=None,
    active=True,
)

Validate whether rows in the table are distinct.

The rows_distinct() method checks whether rows in the table are distinct. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any pre= mutation has been applied).

Parameters

columns_subset : str | list[str] | None = None

A single column or a list of columns to use as a subset for the distinct comparison. If None, then all columns in the table will be used for the comparison. If multiple columns are supplied, the distinct comparison will be made over the combination of values in those columns.

pre : Callable | None = None

A pre-processing function or lambda to apply to the data table for the validation step.

thresholds : int | float | bool | tuple | dict | Thresholds = None

Failure threshold levels so that the validation step can react accordingly when exceeding the set levels for different states (warn, stop, and notify). This can be created simply as an integer or float denoting the absolute number or fraction of failing test units for the ‘warn’ level. Otherwise, you can use a tuple of 1-3 values, a dictionary of 1-3 entries, or a Thresholds object.

active : bool = True

A boolean value indicating whether the validation step should be active. Using False will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged).

Returns

: Validate

The Validate object with the added validation step.

Examples

For the examples here, we’ll use a simple Polars DataFrame with three string columns (col_1, col_2, and col_3). The table is shown below:

import pointblank as pb
import polars as pl

tbl = pl.DataFrame(
    {
        "col_1": ["a", "b", "c", "d"],
        "col_2": ["a", "a", "c", "d"],
        "col_3": ["a", "a", "d", "e"],
    }
)

pb.preview(tbl)
col_1
String
col_2
String
col_3
String
1 a a a
2 b a a
3 c c d
4 d d e

Let’s validate that the rows in the table are distinct with rows_distinct(). We’ll determine if this validation had any failing test units (there are four test units, one for each row). A failing test units means that a given row is not distinct from every other row.

validation = (
    pb.Validate(data=tbl)
    .rows_distinct()
    .interrogate()
)

validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C 1
rows_distinct
rows_distinct()
ALL COLUMNS 4 4
1.00
0
0.00

From this validation table we see that there are no failing test units. All rows in the table are distinct from one another.

We can also use a subset of columns to determine distinctness. Let’s specify the subset using columns col_2 and col_3 for the next validation.

validation = (
    pb.Validate(data=tbl)
    .rows_distinct(columns_subset=["col_2", "col_3"])
    .interrogate()
)

validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C66 1
rows_distinct
rows_distinct()
col_2, col_3 4 2
0.50
2
0.50

The validation table reports two failing test units. The first and second rows are duplicated when considering only the values in columns col_2 and col_3. There’s only one set of duplicates but there are two failing test units since each row is compared to all others.