Workflow for defining a set of validations on a table and interrogating for results.
The Validate class is used for defining a set of validation steps on a table and interrogating the table with the validation plan. This class is the main entry point for the data quality reporting workflow. The overall aim of this workflow is to generate comprehensive reporting information to assess the level of data quality for a target table.
We can supply as many validation steps as needed, and having a large number of them should increase the validation coverage for a given table. The validation methods (e.g., col_vals_gt(), col_vals_between(), etc.) translate to discrete validation steps, where each step will be sequentially numbered (useful when viewing the reporting data). This process of calling validation methods is known as developing a validation plan.
The validation methods, when called, are merely instructions up to the point the concluding interrogate() method is called. That kicks off the process of acting on the validation plan by querying the target table getting reporting results for each step. Once the interrogation process is complete, we can say that the workflow now has reporting information. We can then extract useful information from the reporting data to understand the quality of the table. For instance get_tabular_report() method which will return a table with the results of the interrogation and get_sundered_data() allows for the splitting of the table based on passing and failing rows.
Parameters
data:FrameT
The table to validate. Can be any of the table types described in the Supported Input Table Types section.
tbl_name:str | None=None
A optional name to assign to the input table object. If no value is provided, a name will be generated based on whatever information is available. This table name will be displayed in the header area of the HTML report generated by using the get_tabular_report() method.
label:str | None=None
An optional label for the validation plan. If no value is provided, a label will be generated based on the current system date and time. Markdown can be used here to make the label more visually appealing (it will appear in the header area of the HTML report).
Generate threshold failure levels so that all validation steps can report and react accordingly when exceeding the set levels. This is to be created using one of several valid input schemes: (1) single integer/float denoting absolute number or fraction of failing test units for the ‘warn’ level, (2) a tuple of 1-3 values, (3) a dictionary of 1-3 entries, or a Thresholds object.
Returns
:Validate
A Validate object with the table and validations to be performed.
Supported Input Table Types
The data= parameter can be given any of the following table types:
Polars DataFrame ("polars")
Pandas DataFrame ("pandas")
DuckDB table ("duckdb")*
MySQL table ("mysql")*
PostgreSQL table ("postgresql")*
SQLite table ("sqlite")*
Parquet table ("parquet")*
The table types marked with an asterisk need to be prepared as Ibis tables (with type of ibis.expr.types.relations.Table). Furthermore, the use of Validate with such tables requires the Ibis library v9.5.0 and above to be installed. If the input table is a Polars or Pandas DataFrame, the Ibis library is not required.
Examples
Creating a validation plan and interrogating
Let’s walk through a data quality analysis of an extremely small table. It’s actually called small_table and it’s accessible through the load_dataset() function.
import pointblank as pb# Load the small_table datasetsmall_table = pb.load_dataset()small_table
shape: (13, 8)
date_time
date
a
b
c
d
e
f
datetime[μs]
date
i64
str
i64
f64
bool
str
2016-01-04 11:00:00
2016-01-04
2
"1-bcd-345"
3
3423.29
true
"high"
2016-01-04 00:32:00
2016-01-04
3
"5-egh-163"
8
9999.99
true
"low"
2016-01-05 13:32:00
2016-01-05
6
"8-kdg-938"
3
2343.23
true
"high"
2016-01-06 17:23:00
2016-01-06
2
"5-jdo-903"
null
3892.4
false
"mid"
2016-01-09 12:36:00
2016-01-09
8
"3-ldm-038"
7
283.94
true
"low"
…
…
…
…
…
…
…
…
2016-01-20 04:30:00
2016-01-20
3
"5-bce-642"
9
837.93
false
"high"
2016-01-20 04:30:00
2016-01-20
3
"5-bce-642"
9
837.93
false
"high"
2016-01-26 20:07:00
2016-01-26
4
"2-dmx-010"
7
833.98
true
"low"
2016-01-28 02:51:00
2016-01-28
2
"7-dmx-010"
8
108.34
false
"low"
2016-01-30 11:23:00
2016-01-30
1
"3-dka-303"
null
2230.09
true
"high"
We ought to think about what’s tolerable in terms of data quality so let’s designate proportional failure thresholds to the warn, stop, and notify states. This can be done by using the Thresholds class.
Now, we use the Validate class and give it the thresholds object (which serves as a default for all validation steps but can be overridden). The static thresholds provided in thresholds will make the reporting a bit more useful. We also need to provide a target table and we’ll use small_table for this.
Then, as with any Validate object, we can add steps to the validation plan by using as many validation methods as we want. To conclude the process (and actually query the data table), we use the interrogate() method.