import pointblank as pb
import polars as pl
= pl.DataFrame({
df "name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"height": [5.6, 6.0, 5.8]
})
Schema
self, columns=None, tbl=None, **kwargs) Schema(
Definition of a schema object.
The schema object defines the structure of a table. Once it is defined, the object can be used in a validation workflow, using Validate
and its methods, to ensure that the structure of a table matches the expected schema. The validation method that works with the schema object is called col_schema_match()
.
A schema for a table can be constructed with the Schema
class in a number of ways:
- providing a list of column names to
columns=
(to check only the column names) - using a list of two-element tuples in
columns=
(to check both column names and dtypes, should be in the form of[(column_name, dtype), ...]
) - providing a dictionary to
columns=
, where the keys are column names and the values are dtypes - providing individual column arguments in the form of keyword arguments (constructed as
column_name=dtype
)
The schema object can also be constructed by providing a DataFrame or Ibis table object (using the tbl=
parameter) and the schema will be collected from either type of object. The schema object can be printed to display the column names and dtypes. Note that if tbl=
is provided then there shouldn’t be any other inputs provided through either columns=
or **kwargs
.
Parameters
columns : str | list[str] | list[tuple[str, str]] | list[tuple[str]] | dict[str, str] | None = None
-
A list of strings (representing column names), a list of tuples (for column names and column dtypes), or a dictionary containing column and dtype information. If any of these inputs are provided here, it will take precedence over any column arguments provided via
**kwargs
. tbl : any | None = None
-
A DataFrame (Polars or Pandas) or an Ibis table object from which the schema will be collected.
****kwargs** : = {}
-
Individual column arguments that are in the form of
[column]=[dtype]
. These will be ignored if thecolumns=
parameter is notNone
.
Returns
: Schema
-
A schema object.
Examples
A schema can be constructed via the Schema
class in multiple ways. Let’s use the following Polars DataFrame as a basis for constructing a schema:
You could provide Schema(columns=)
a list of tuples containing column names and data types:
= pb.Schema(columns=[("name", "String"), ("age", "Int64"), ("height", "Float64")]) schema
Alternatively, a dictionary containing column names and dtypes also works:
= pb.Schema(columns={"name": "String", "age": "Int64", "height": "Float64"}) schema
Another input method involves using individual column arguments in the form of keyword arguments:
= pb.Schema(name="String", age="Int64", height="Float64") schema
Finally, could also provide a DataFrame (Polars and Pandas) or an Ibis table object to tbl=
and the schema will be collected:
= pb.Schema(tbl=df) schema
Whichever method you choose, you can verify the schema inputs by printing the schema
object:
print(schema)
Pointblank Schema
name: String
age: Int64
height: Float64
The Schema
object can be used to validate the structure of a table against the schema. The relevant Validate
method for this is col_schema_match()
. In a validation workflow, you’ll have a target table (defined at the beginning of the workflow) and you might want to ensure that your expectations of the table structure are met. The col_schema_match()
method works with a Schema
object to validate the structure of the table. Here’s an example of how you could use the col_schema_match()
method in a validation workflow:
# Define the schema
= pb.Schema(name="String", age="Int64", height="Float64")
schema
# Define a validation that checks the schema against the table (`df`)
= (
validation =df)
pb.Validate(data=schema)
.col_schema_match(schema
.interrogate()
)
# Display the validation results
validation
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
The col_schema_match()
validation method will validate the structure of the table against the schema during interrogation. If the structure of the table does not match the schema, the single test unit will fail. In this case, the defined schema matched the structure of the table, so the validation passed.
We can also choose to check only the column names of the target table. This can be done by providing a simplified Schema
object, which is given a list of column names:
= pb.Schema(columns=["name", "age", "height"])
schema
= (
validation =df)
pb.Validate(data=schema)
.col_schema_match(schema
.interrogate()
)
validation
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | S | N | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
In this case, the schema only checks the column names of the table against the schema during interrogation. If the column names of the table do not match the schema, the single test unit will fail. In this case, the defined schema matched the column names of the table, so the validation passed.