Schema

Schema(self, columns=None, tbl=None, **kwargs)

Definition of a schema object.

The schema object defines the structure of a table. Once it is defined, the object can be used in a validation workflow, using Validate and its methods, to ensure that the structure of a table matches the expected schema. The validation method that works with the schema object is called col_schema_match().

A schema for a table can be constructed with the Schema class in a number of ways:

  1. providing a list of column names to columns= (to check only the column names)
  2. using a list of two-element tuples in columns= (to check both column names and dtypes, should be in the form of [(column_name, dtype), ...])
  3. providing a dictionary to columns=, where the keys are column names and the values are dtypes
  4. providing individual column arguments in the form of keyword arguments (constructed as column_name=dtype)

The schema object can also be constructed by providing a DataFrame or Ibis table object (using the tbl= parameter) and the schema will be collected from either type of object. The schema object can be printed to display the column names and dtypes. Note that if tbl= is provided then there shouldn’t be any other inputs provided through either columns= or **kwargs.

Parameters

columns : str | list[str] | list[tuple[str, str]] | list[tuple[str]] | dict[str, str] | None = None

A list of strings (representing column names), a list of tuples (for column names and column dtypes), or a dictionary containing column and dtype information. If any of these inputs are provided here, it will take precedence over any column arguments provided via **kwargs.

tbl : any | None = None

A DataFrame (Polars or Pandas) or an Ibis table object from which the schema will be collected.

****kwargs** : = {}

Individual column arguments that are in the form of [column]=[dtype]. These will be ignored if the columns= parameter is not None.

Returns

: Schema

A schema object.

Examples

A schema can be constructed via the Schema class in multiple ways. Let’s use the following Polars DataFrame as a basis for constructing a schema:

import pointblank as pb
import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "height": [5.6, 6.0, 5.8]
})

You could provide Schema(columns=) a list of tuples containing column names and data types:

schema = pb.Schema(columns=[("name", "String"), ("age", "Int64"), ("height", "Float64")])

Alternatively, a dictionary containing column names and dtypes also works:

schema = pb.Schema(columns={"name": "String", "age": "Int64", "height": "Float64"})

Another input method involves using individual column arguments in the form of keyword arguments:

schema = pb.Schema(name="String", age="Int64", height="Float64")

Finally, could also provide a DataFrame (Polars and Pandas) or an Ibis table object to tbl= and the schema will be collected:

schema = pb.Schema(tbl=df)

Whichever method you choose, you can verify the schema inputs by printing the schema object:

print(schema)
Pointblank Schema
  name: String
  age: Int64
  height: Float64

The Schema object can be used to validate the structure of a table against the schema. The relevant Validate method for this is col_schema_match(). In a validation workflow, you’ll have a target table (defined at the beginning of the workflow) and you might want to ensure that your expectations of the table structure are met. The col_schema_match() method works with a Schema object to validate the structure of the table. Here’s an example of how you could use the col_schema_match() method in a validation workflow:

# Define the schema
schema = pb.Schema(name="String", age="Int64", height="Float64")

# Define a validation that checks the schema against the table (`df`)
validation = (
    pb.Validate(data=df)
    .col_schema_match(schema=schema)
    .interrogate()
)

# Display the validation results
validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00

The col_schema_match() validation method will validate the structure of the table against the schema during interrogation. If the structure of the table does not match the schema, the single test unit will fail. In this case, the defined schema matched the structure of the table, so the validation passed.

We can also choose to check only the column names of the target table. This can be done by providing a simplified Schema object, which is given a list of column names:

schema = pb.Schema(columns=["name", "age", "height"])

validation = (
    pb.Validate(data=df)
    .col_schema_match(schema=schema)
    .interrogate()
)

validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00

In this case, the schema only checks the column names of the table against the schema during interrogation. If the column names of the table do not match the schema, the single test unit will fail. In this case, the defined schema matched the column names of the table, so the validation passed.