Skip to contents

Planning and Prep

Should you need to understand your data a bit more, use the scan_data() function. It provides a comprehensive report on the data you might be planning on validating. If going down the road of preparing a data quality analysis, call on an agent to carry out that eventual plan with create_agent(). The agent here is to be told which table is the target, and, we devise a stepwise validation plan with validation functions. If collecting and publishing table information is your goal, then begin with the create_informant() function.

scan_data()
Thoroughly scan a table to better understand it
create_agent()
Create a pointblank agent object
create_informant()
Create a pointblank informant object
validate_rmd()
Perform pointblank validation testing within R Markdown documents
action_levels() warn_on_fail() stop_on_fail()
Set action levels: failure thresholds and functions to invoke
db_tbl()
Get a table from a database
file_tbl()
Get a table from a local or remote file
tbl_store()
Define a store of tables with table-prep formulas: a table store
tbl_source()
Obtain a table-prep formula from a table store
tbl_get()
Obtain a materialized table via a table store
draft_validation()
Draft a starter pointblank validation .R/.Rmd file with a data table

Validation, Expectation, and Test Functions

Validation functions are either used with an agent object or, more simply, just with the table of interest. When used with an agent, each validation works to build up a validation plan (which is executed with the interrogate() function). If one or more validation functions are used directly on data (that is, no agent is involved whatsoever), then the data is checked and passed through if there are no problems. We can fine tune the warn_* and/or stop_* thresholds and so that if the level of failed validation units exceeds those set levels, then we’ll get either a warning or an error. Each validation function is associated with an expectation function (of the form expect_*()). These expectation functions are equivalent in usage and behavior to those in the testthat package. Finally, each validation function has an associated test function (of the form test_*()) that always returns a logical value.

col_vals_lt() expect_col_vals_lt() test_col_vals_lt()
Are column data less than a fixed value or data in another column?
col_vals_lte() expect_col_vals_lte() test_col_vals_lte()
Are column data less than or equal to a fixed value or data in another column?
col_vals_equal() expect_col_vals_equal() test_col_vals_equal()
Are column data equal to a fixed value or data in another column?
col_vals_not_equal() expect_col_vals_not_equal() test_col_vals_not_equal()
Are column data not equal to a fixed value or data in another column?
col_vals_gte() expect_col_vals_gte() test_col_vals_gte()
Are column data greater than or equal to a fixed value or data in another column?
col_vals_gt() expect_col_vals_gt() test_col_vals_gt()
Are column data greater than a fixed value or data in another column?
col_vals_between() expect_col_vals_between() test_col_vals_between()
Do column data lie between two specified values or data in other columns?
col_vals_not_between() expect_col_vals_not_between() test_col_vals_not_between()
Do column data lie outside of two specified values or data in other columns?
col_vals_in_set() expect_col_vals_in_set() test_col_vals_in_set()
Are column data part of a specified set of values?
col_vals_not_in_set() expect_col_vals_not_in_set() test_col_vals_not_in_set()
Are data not part of a specified set of values?
col_vals_make_set() expect_col_vals_make_set() test_col_vals_make_set()
Is a set of values entirely accounted for in a column of values?
col_vals_make_subset() expect_col_vals_make_subset() test_col_vals_make_subset()
Is a set of values a subset of a column of values?
col_vals_increasing() expect_col_vals_increasing() test_col_vals_increasing()
Are column data increasing by row?
col_vals_decreasing() expect_col_vals_decreasing() test_col_vals_decreasing()
Are column data decreasing by row?
col_vals_null() expect_col_vals_null() test_col_vals_null()
Are column data NULL/NA?
col_vals_not_null() expect_col_vals_not_null() test_col_vals_not_null()
Are column data not NULL/NA?
col_vals_regex() expect_col_vals_regex() test_col_vals_regex()
Do strings in column data match a regex pattern?
col_vals_within_spec() expect_col_vals_within_spec() test_col_vals_within_spec()
Do values in column data fit within a specification?
col_vals_expr() expect_col_vals_expr() test_col_vals_expr()
Do column data agree with a predicate expression?
rows_distinct() expect_rows_distinct() test_rows_distinct()
Are row data distinct?
rows_complete() expect_rows_complete() test_rows_complete()
Are row data complete?
col_is_character() expect_col_is_character() test_col_is_character()
Do the columns contain character/string data?
col_is_numeric() expect_col_is_numeric() test_col_is_numeric()
Do the columns contain numeric values?
col_is_integer() expect_col_is_integer() test_col_is_integer()
Do the columns contain integer values?
col_is_logical() expect_col_is_logical() test_col_is_logical()
Do the columns contain logical values?
col_is_date() expect_col_is_date() test_col_is_date()
Do the columns contain R Date objects?
col_is_posix() expect_col_is_posix() test_col_is_posix()
Do the columns contain POSIXct dates?
col_is_factor() expect_col_is_factor() test_col_is_factor()
Do the columns contain R factor objects?
col_exists() expect_col_exists() test_col_exists()
Do one or more columns actually exist?
col_schema_match() expect_col_schema_match() test_col_schema_match()
Do columns in the table (and their types) match a predefined schema?
row_count_match() expect_row_count_match() test_row_count_match()
Does the row count match that of a different table?
col_count_match() expect_col_count_match() test_col_count_match()
Does the column count match that of a different table?
tbl_match() expect_tbl_match() test_tbl_match()
Does the target table match a comparison table?
conjointly() expect_conjointly() test_conjointly()
Perform multiple rowwise validations for joint validity
serially() expect_serially() test_serially()
Run several tests and a final validation in a serial manner
specially() expect_specially() test_specially()
Perform a specialized validation with a user-defined function

Information Functions

We can progressively add information to an informant object by using the collection of info_*() functions. We can add more table-based properties with the info_tabular() function, details about the nature of each column with the info_columns() function, and add sections of our own choosing (and the info that make sense for those sections) with the info_section() function. Snippets of information can be gleaned from the target table by using the info_snippet() function. These bits of information can be incorporated in text defined by the other info_*() functions via { }. Some great snip_*() functions are included to make info-snipping as easy (and useful!) as can be.

info_tabular()
Add information that focuses on aspects of the data table as a whole
info_columns()
Add information that focuses on aspects of a data table's columns
info_columns_from_tbl()
Add column information from another data table
info_section()
Add information that focuses on some key aspect of the data table
info_snippet()
Generate a useful text 'snippet' from the target table
snip_list()
A fn for info_snippet(): get a list of column categories
snip_stats()
A fn for info_snippet(): get an inline statistical summary
snip_lowest()
A fn for info_snippet(): get the lowest value from a column
snip_highest()
A fn for info_snippet(): get the highest value from a column

Emailing

Sometimes we want to email a report of a validation because of the importance of the information contained therein. The email_blast() function can be used within the end_fns argument of create_agent(), giving us options to send a customizable message only if specified conditions are met.

email_blast()
Send email at a validation step or at the end of an interrogation
email_create()
Create an email object from a pointblank agent or informant
stock_msg_body()
Provide simple email message body components: body
stock_msg_footer()
Provide simple email message body components: footer

Logging

Logging validation failure conditions makes for a good practice during data quality analysis. The log4r_step() function allows for simple generation of log entries and specification of logging destinations.

log4r_step()
Enable logging of failure conditions at the validation step level

Agent: Interrogate and Report

If we have an agent object that has a plan (i.e., validation steps), the interrogate() function instructs the agent to interrogate the target table. The agent will go to work and also perform specified side-effect functions at the step level and upon completion (if those are functions are defined). After interrogation, we can get a report through printing, however, we can take advantage of more options by using the get_agent_report() function.

interrogate()
Given an agent that has a validation plan, perform an interrogation
get_agent_report()
Get a summary report from an agent

Informant: Incorporate and Report

If we have an informant object that has been loaded with information from using the info_*() functions, the incorporate() function works to regenerate snippets and integrate those into the info text. After refreshing the table information, we can get an information report through printing, or, by using get_informant_report() function.

incorporate()
Given an informant object, update and incorporate table snippets
get_informant_report()
Get a table information report from an informant object

Post-interrogation

The agent always has a special list called an x-list. Access that by invoking the get_agent_x_list() and you’ll then have a smorgasbord of information about how the validation went down. Table extracts are collected by default for failed rows (up to a limit) and we can access those with get_data_extracts(). Table rows can be sundered into ‘pass’ and ‘fail’ pieces. Access those table fragments with the get_sundered_data() function. Want to know if all validation steps have passed with flying colors? Sometimes that could happen; use the all_passed() function to find out.

get_agent_x_list()
Get the agent's x-list
get_data_extracts()
Collect data extracts from a validation step
get_sundered_data()
Sunder the data, splitting it into 'pass' and 'fail' pieces
all_passed()
Did all of the validations fully pass?
write_testthat_file()
Transform a pointblank agent to a testthat test file

Object Ops

We have options for writing an agent or informant to disk with the x_write_disk() function. The on-disk object can be retrieved with the x_read_disk() function. You can export a report as an HTML file with export_report(). A set of functions are also available for setting a data table to an existing object, and, for editing an agent’s validation steps.

x_write_disk()
Write an agent, informant, multiagent, or table scan to disk
x_read_disk()
Read an agent, informant, multiagent, or table scan from disk
export_report()
Export an agent, informant, multiagent, or table scan to HTML
set_tbl()
Set a data table to an agent or an informant
activate_steps()
Activate one or more of an agent's validation steps
deactivate_steps()
Deactivate one or more of an agent's validation steps
remove_steps()
Remove one or more of an agent's validation steps

The Multiagent

The multiagent is a group of agents, all bundled together into a single object. With this grouping, we can generate unified reporting across the component agents with the get_multiagent_report() function. One style of reporting ("long") provides a serial listing of agent reports. The other option ("wide") is useful for tracking the evolution of data quality checks over time since common steps across all interrogations will form individual rows (and each interrogation will form a column).

create_multiagent()
Create a pointblank multiagent object
read_disk_multiagent()
Read pointblank agents stored on disk as a multiagent
get_multiagent_report()
Get a summary report using multiple agents

pointblank YAML

YAML files can be used in pointblank for two distinct purposes: (1) to define agents and their validation plans, and (2) to define information for tables. The yaml_write() function allows us write agent or informant YAML from the namesake objects. We can read them back from disk by using the yaml_read_agent() and yaml_read_informant() functions. As a nice shortcut, we read agent YAML and interrogate immediately with yaml_agent_interrogate(); in a similar manner, we can read informant YAML and incorporate table information with yaml_informant_incorporate().

yaml_write()
Write pointblank objects to YAML files
yaml_read_agent()
Read a pointblank YAML file to create an agent object
yaml_read_informant()
Read a pointblank YAML file to create an informant object
yaml_agent_interrogate()
Get an agent from pointblank YAML and interrogate()
yaml_agent_string()
Display pointblank YAML using an agent or a YAML file
yaml_agent_show_exprs()
Display validation expressions using pointblank YAML
yaml_informant_incorporate()
Get an informant from pointblank YAML and incorporate()
yaml_exec()
Execute all agent and informant YAML tasks

Table Transformers

Table Transformer functions can radically transform a data table and either provide a wholly different table (like a summary table or table properties table) or do some useful filtering in a single step. This can be useful for preparing the target table for validation, creating a temporary table for a few validation steps, or even as something used outside of the pointblank workflows. As a nice bonus these transformer functions will work equally well with data frames, database tables, and Spark tables.

tt_summary_stats()
Table Transformer: obtain a summary stats table for numeric columns
tt_string_info()
Table Transformer: obtain a summary table for string columns
tt_tbl_dims()
Table Transformer: get the dimensions of a table
tt_tbl_colnames()
Table Transformer: get a table's column names
tt_time_shift()
Table Transformer: shift the times of a table
tt_time_slice()
Table Transformer: slice a table with a slice point on a time column
get_tt_param()
Get a parameter value from a summary table

Utility and Helper Functions

col_schema()
Generate a table column schema manually or with a reference table
has_columns()
Determine if one or more columns exist in a table
affix_date()
Put the current date into a file name
affix_datetime()
Put the current datetime into a file name
stop_if_not()
The next generation of stopifnot()-type functions: stop_if_not()
from_github()
Specify a file for download from GitHub

Datasets

small_table
A small table that is useful for testing
small_table_sqlite()
An SQLite version of the small_table dataset
specifications
A table containing data pertaining to various specifications
game_revenue
A table with game revenue data
game_revenue_info
A table with metadata for the game_revenue dataset