The create_agent() function creates an agent object, which is used in a data quality reporting workflow. The overall aim of this workflow is to generate useful reporting information for assessing the level of data quality for the target table. We can supply as many validation functions as the user wishes to write, thereby increasing the level of validation coverage for that table. The agent assigned by the create_agent() call takes validation functions, which expand to validation steps (each one is numbered). This process is known as developing a validation plan.

The validation functions, when called on an agent, are merely instructions up to the point the interrogate() function is called. That kicks off the process of the agent acting on the validation plan and getting results for each step. Once the interrogation process is complete, we can say that the agent has intel. Calling the agent itself will result in a reporting table. This reporting of the interrogation can also be accessed with the get_agent_report() function, where there are more reporting options.

create_agent(
  tbl = NULL,
  read_fn = NULL,
  tbl_name = NULL,
  label = NULL,
  actions = NULL,
  end_fns = NULL,
  embed_report = FALSE,
  lang = NULL,
  locale = NULL
)

Arguments

tbl

The input table. This can be a data frame, a tibble, a tbl_dbi object, or a tbl_spark object. Alternatively, a function can be used to read in the input data table with the read_fn argument (in which case, tbl can be NULL).

read_fn

A function that's used for reading in the data. Even if a tbl is provided, this function will be invoked to obtain the data (i.e., the read_fn takes priority). There are two ways to specify a read_fn: (1) using a function (e.g., function() { <table reading code> }) or, (2) with an R formula expression (e.g., ~ { <table reading code> }).

tbl_name

A optional name to assign to the input table object. If no value is provided, a name will be generated based on whatever information is available. This table name will be displayed in the header area of the agent report generated by printing the agent or calling get_agent_report().

label

An optional label for the validation plan. If no value is provided, a label will be generated based on the current system time. Markdown can be used here to make the label more visually appealing (it will appear in the header area of the agent report).

actions

A option to include a list with threshold levels so that all validation steps can react accordingly when exceeding the set levels. This is to be created with the action_levels() helper function. Should an action levels list be used for a specific validation step, the default set specified here will be overridden.

end_fns

A list of function calls that should be performed at the end of an interrogation. Each function call should be in the form of a one-sided R formula expression, so overall this construction should be used: end_fns = list(~ <R statements>, ~ <R statements>, ...). An example of a function that can be sensibly used here is email_blast(), where an email of the validation report is generated and sent based on sending condition.

embed_report

An option to embed a gt-based validation report into the ptblank_agent object. If FALSE (the default) then the table object will be not generated and available with the agent upon returning from the interrogation.

lang

The language to use for automatic creation of briefs (short descriptions for each validation step) and for the agent report (a summary table that provides the validation plan and the results from the interrogation. By default, NULL will create English ("en") text. Other options include French ("fr"), German ("de"), Italian ("it"), Spanish ("es"), Portuguese, ("pt"), Chinese ("zh"), and Russian ("ru").

locale

An optional locale ID to use for formatting values in the agent report summary table according the locale's rules. Examples include "en_US" for English (United States) and "fr_FR" for French (France); more simply, this can be a language identifier without a country designation, like "es" for Spanish (Spain, same as "es_ES").

Value

A ptblank_agent object.

Details

A very detailed list object, known as the x-list, can be obtained by using the get_agent_x_list() function on the agent. This font of information can be taken as a whole, or, broken down by the step number (with the i argument).

Sometimes it is useful to see which rows were the failing ones. By using the get_data_extracts() function on the agent, we either get a list of tibbles (for those steps that have data extracts) or one tibble if the validation step is specified with the i argument.

If we just need to know whether all validations completely passed (i.e., all steps had no failing test units), the all_passed() function could be used on the agent. However, in practice, it's not often the case that all data validation steps are free from any failing units.

Figures

Function ID

1-2

See also

Other Planning and Prep: action_levels(), create_informant(), db_tbl(), file_tbl(), scan_data(), validate_rmd()

Examples

# Let's walk through a data quality # analysis of an extremely small table; # it's actually called `small_table` and # we can find it as a dataset in this # package small_table
#> # A tibble: 13 x 8 #> date_time date a b c d e f #> <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr> #> 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high #> 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low #> 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high #> 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid #> 5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low #> 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid #> 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high #> 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low #> 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high #> 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high #> 11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 834. TRUE low #> 12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108. FALSE low #> 13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
# We ought to think about what's # tolerable in terms of data quality so # let's designate proportional failure # thresholds to the `warn`, `stop`, and # `notify` states using `action_levels()` al <- action_levels( warn_at = 0.10, stop_at = 0.25, notify_at = 0.35 ) # Now create a pointblank `agent` object # and give it the `al` object (which # serves as a default for all validation # steps which can be overridden); the # static thresholds provided by `al` will # make the reporting a bit more useful agent <- create_agent( read_fn = ~ small_table, label = "An example.", actions = al ) # Then, as with any `agent` object, we # can add steps to the validation plan by # using as many validation functions as we # want; then, we use `interrogate()` to # physically perform the validations and # gather intel agent <- agent %>% col_exists(vars(date, date_time)) %>% col_vals_regex( vars(b), "[0-9]-[a-z]{3}-[0-9]{3}" ) %>% rows_distinct() %>% col_vals_gt(vars(d), 100) %>% col_vals_lte(vars(c), 5) %>% col_vals_equal( vars(d), vars(d), na_pass = TRUE ) %>% col_vals_between( vars(c), left = vars(a), right = vars(d), na_pass = TRUE ) %>% interrogate() # Calling `agent` in the console # prints the agent's report; but we # can get a `gt_tbl` object directly # with `get_agent_report(agent)` report <- get_agent_report(agent) class(report)
#> [1] "gt_tbl" "list"
# What can you do with the report? # Print it from an R Markdown code # chunk, use it in a **blastula** email, # put it in a webpage, or further # modify it with the **gt** package # From the report we know that Step # 4 had two test units (rows, really) # that failed; we can see those rows # with `get_data_extracts()` agent %>% get_data_extracts(i = 4)
#> # A tibble: 2 x 8 #> date_time date a b c d e f #> <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr> #> 1 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high #> 2 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
# We can get an x-list for the whole # validation (8 steps), or, just for # the 4th step with `get_agent_x_list()` xl_step_4 <- agent %>% get_agent_x_list(i = 4) # And then we can peruse the different # parts of the list; let's get the # fraction of test units that failed xl_step_4$f_failed
#> [1] 0.15385
# Just printing the x-list will tell # us what's available therein xl_step_4
#> ── The x-list for `` ───────────────────────────────────────────────── STEP 4 ──
#> $time_start $time_end (POSIXct [1])
#> $label $tbl_name $tbl_src $tbl_src_details (chr [1])
#> $tbl (spec_tbl_df, tbl_df, tbl, and data.frame)
#> $col_names $col_types (chr [8])
#> $i $type $columns $values $label $briefs (mixed [1])
#> $eval_error $eval_warning (lgl [1])
#> $capture_stack (list [1])
#> $n $n_passed $n_failed $f_passed $f_failed (num [1])
#> $warn $stop $notify (lgl [1])
#> $lang (chr [1])
#> ────────────────────────────────────────────────── NO INTERROGATION PERFORMED ──
# An x-list not specific to any step # will have way more information and a # slightly different structure; see # `help(get_agent_x_list)` for more info # get_agent_x_list(agent)