Validate

Validate(self, data, tbl_name=None, label=None, thresholds=None)

Workflow for defining a set of validations on a table and interrogating for results.

The Validate class is used for defining a set of validation steps on a table and interrogating the table with the validation plan. This class is the main entry point for the data quality reporting workflow. The overall aim of this workflow is to generate comprehensive reporting information to assess the level of data quality for a target table.

We can supply as many validation steps as needed, and having a large number of them should increase the validation coverage for a given table. The validation methods (e.g., col_vals_gt(), col_vals_between(), etc.) translate to discrete validation steps, where each step will be sequentially numbered (useful when viewing the reporting data). This process of calling validation methods is known as developing a validation plan.

The validation methods, when called, are merely instructions up to the point the concluding interrogate() method is called. That kicks off the process of acting on the validation plan by querying the target table getting reporting results for each step. Once the interrogation process is complete, we can say that the workflow now has reporting information. We can then extract useful information from the reporting data to understand the quality of the table. For instance get_tabular_report() method which will return a table with the results of the interrogation and get_sundered_data() allows for the splitting of the table based on passing and failing rows.

Parameters

data : FrameT

The table to validate. Can be any of the table types described in the Supported Input Table Types section.

tbl_name : str | None = None

A optional name to assign to the input table object. If no value is provided, a name will be generated based on whatever information is available. This table name will be displayed in the header area of the HTML report generated by using the get_tabular_report() method.

label : str | None = None

An optional label for the validation plan. If no value is provided, a label will be generated based on the current system date and time. Markdown can be used here to make the label more visually appealing (it will appear in the header area of the HTML report).

thresholds : int | float | bool | tuple | dict | Thresholds | None = None

Generate threshold failure levels so that all validation steps can report and react accordingly when exceeding the set levels. This is to be created using one of several valid input schemes: (1) single integer/float denoting absolute number or fraction of failing test units for the ‘warn’ level, (2) a tuple of 1-3 values, (3) a dictionary of 1-3 entries, or a Thresholds object.

Returns

: Validate

A Validate object with the table and validations to be performed.

Supported Input Table Types

The data= parameter can be given any of the following table types:

  • Polars DataFrame ("polars")
  • Pandas DataFrame ("pandas")
  • DuckDB table ("duckdb")*
  • MySQL table ("mysql")*
  • PostgreSQL table ("postgresql")*
  • SQLite table ("sqlite")*
  • Parquet table ("parquet")*

The table types marked with an asterisk need to be prepared as Ibis tables (with type of ibis.expr.types.relations.Table). Furthermore, the use of Validate with such tables requires the Ibis library v9.5.0 and above to be installed. If the input table is a Polars or Pandas DataFrame, the Ibis library is not required.

Examples

Creating a validation plan and interrogating

Let’s walk through a data quality analysis of an extremely small table. It’s actually called small_table and it’s accessible through the load_dataset() function.

import pointblank as pb

# Load the small_table dataset
small_table = pb.load_dataset()

small_table
shape: (13, 8)
date_timedateabcdef
datetime[μs]datei64stri64f64boolstr
2016-01-04 11:00:002016-01-042"1-bcd-345"33423.29true"high"
2016-01-04 00:32:002016-01-043"5-egh-163"89999.99true"low"
2016-01-05 13:32:002016-01-056"8-kdg-938"32343.23true"high"
2016-01-06 17:23:002016-01-062"5-jdo-903"null3892.4false"mid"
2016-01-09 12:36:002016-01-098"3-ldm-038"7283.94true"low"
2016-01-20 04:30:002016-01-203"5-bce-642"9837.93false"high"
2016-01-20 04:30:002016-01-203"5-bce-642"9837.93false"high"
2016-01-26 20:07:002016-01-264"2-dmx-010"7833.98true"low"
2016-01-28 02:51:002016-01-282"7-dmx-010"8108.34false"low"
2016-01-30 11:23:002016-01-301"3-dka-303"null2230.09true"high"

We ought to think about what’s tolerable in terms of data quality so let’s designate proportional failure thresholds to the warn, stop, and notify states. This can be done by using the Thresholds class.

thresholds = pb.Thresholds(warn_at=0.10, stop_at=0.25, notify_at=0.35)

Now, we use the Validate class and give it the thresholds object (which serves as a default for all validation steps but can be overridden). The static thresholds provided in thresholds will make the reporting a bit more useful. We also need to provide a target table and we’ll use small_table for this.

validation = (
    pb.Validate(
        data=small_table,
        tbl_name="small_table",
        label="`Validate` example.",
        thresholds=thresholds
    )
)

Then, as with any Validate object, we can add steps to the validation plan by using as many validation methods as we want. To conclude the process (and actually query the data table), we use the interrogate() method.

validation = (
    validation
    .col_vals_gt(columns="d", value=100)
    .col_vals_le(columns="c", value=5)
    .col_vals_between(columns="c", left=3, right=10, na_pass=True)
    .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}")
    .col_exists(columns=["date", "date_time"])
    .interrogate()
)

The validation object can be printed as a reporting table.

validation
Pointblank Validation
`Validate` example.
Polarssmall_tableWARN0.1STOP0.25NOTIFY0.35
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
d 100 13 13
1.00
0
0.00
#CF142B 2
col_vals_lte
col_vals_le()
c 5 13 5
0.38
6
0.46
#4CA64C66 3
col_vals_between
col_vals_between()
c [3, 10] 13 12
0.92
1
0.08
#4CA64C 4
col_vals_regex
col_vals_regex()
b [0-9]-[a-z]{3}-[0-9]{3} 13 13
1.00
0
0.00
#4CA64C 5
col_exists
col_exists()
date 1 1
1.00
0
0.00
#4CA64C 6
col_exists
col_exists()
date_time 1 1
1.00
0
0.00
2024-12-20 15:08:52 UTC< 1 s2024-12-20 15:08:52 UTC