import pointblank as pb
import polars as pl
= pl.DataFrame(
tbl
{"col_1": ["a", "b", "c", "d"],
"col_2": ["a", "a", "c", "d"],
"col_3": ["a", "a", "d", "e"],
}
)
pb.preview(tbl)
Validate.rows_distinct
Validate.rows_distinct(=None,
columns_subset=None,
pre=None,
thresholds=True,
active )
Validate whether rows in the table are distinct.
The rows_distinct()
method checks whether rows in the table are distinct. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any pre=
mutation has been applied).
Parameters
columns_subset : str | list[str] | None = None
-
A single column or a list of columns to use as a subset for the distinct comparison. If
None
, then all columns in the table will be used for the comparison. If multiple columns are supplied, the distinct comparison will be made over the combination of values in those columns. pre : Callable | None = None
-
A pre-processing function or lambda to apply to the data table for the validation step.
thresholds : int | float | bool | tuple | dict | Thresholds = None
-
Failure threshold levels so that the validation step can react accordingly when exceeding the set levels for different states (
warn
,stop
, andnotify
). This can be created simply as an integer or float denoting the absolute number or fraction of failing test units for the ‘warn’ level. Otherwise, you can use a tuple of 1-3 values, a dictionary of 1-3 entries, or a Thresholds object. active : bool = True
-
A boolean value indicating whether the validation step should be active. Using
False
will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged).
Returns
: Validate
-
The
Validate
object with the added validation step.
Examples
For the examples here, we’ll use a simple Polars DataFrame with three string columns (col_1
, col_2
, and col_3
). The table is shown below:
Let’s validate that the rows in the table are distinct with rows_distinct()
. We’ll determine if this validation had any failing test units (there are four test units, one for each row). A failing test units means that a given row is not distinct from every other row.
= (
validation =tbl)
pb.Validate(data
.rows_distinct()
.interrogate()
)
validation
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 4 | 4 1.00 |
0 0.00 |
— | — | — | — |
From this validation table we see that there are no failing test units. All rows in the table are distinct from one another.
We can also use a subset of columns to determine distinctness. Let’s specify the subset using columns col_2
and col_3
for the next validation.
= (
validation =tbl)
pb.Validate(data=["col_2", "col_3"])
.rows_distinct(columns_subset
.interrogate()
)
validation
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C66 | 1 |
|
✓ | 4 | 2 0.50 |
2 0.50 |
— | — | — | — |
The validation table reports two failing test units. The first and second rows are duplicated when considering only the values in columns col_2
and col_3
. There’s only one set of duplicates but there are two failing test units since each row is compared to all others.