Introduction to the Expectations in Unit Tests Workflow (VALID-III)Source:
The VALID-III: Expectations in Unit Tests workflow is all
about checking your data alongside your usual testthat
tests. The functions used for this workflow all begin with the
expect_ prefix and are based on the set of validation
functions used in the VALID-I and VALID-II
workflows. Here’s the complete list of functions with a phrase for each
expect_col_vals_lt(): Expect that column data are less than a specified value.
expect_col_vals_lte(): Expect that column data are less than or equal to a specified value.
expect_col_vals_equal(): Expect that column data are equal to a specified value.
expect_col_vals_not_equal(): Expect that column data are not equal to a specified value.
expect_col_vals_gte(): Expect that column data are greater than or equal to a specified value.
expect_col_vals_gt(): Expect that column data are greater than a specified value.
expect_col_vals_between(): Expect that column data are between two specified values.
expect_col_vals_not_between(): Expect that column data are not between two specified values.
expect_col_vals_in_set(): Expect that column data are part of a specified set of values.
expect_col_vals_not_in_set(): Expect that column data are not part of a specified set of values.
expect_col_vals_null(): Expect that column data are
expect_col_vals_not_null(): Expect that column data are not
expect_col_vals_regex(): Expect that strings in column data match a regex pattern.
expect_col_vals_expr(): Expect that column data agree with a predicate expression.
expect_conjointly(): Expect that multiple rowwise validations result in joint validity.
expect_rows_distinct(): Expect that row data are distinct.
expect_col_is_character(): Expect that the columns contain character/string data.
expect_col_is_numeric(): Expect that the columns contain numeric values.
expect_col_is_integer(): Expect that the columns contain integer values.
expect_col_is_logical(): Expect that the columns contain logical values.
expect_col_is_date(): Expect that the columns contain R
expect_col_is_posix(): Expect that the columns contain
expect_col_is_factor(): Expect that the columns contain R
expect_col_exists(): Expect that one or more columns actually exist.
expect_col_schema_match(): Expect that columns in the table (and their types) match a predefined schema.
Now that we know that the functions are similar in intent but different in name, let’s learn how to use these functions effectively.
The testthat package has collection of functions
that begin with
expect_. It’s no coincidence that the
pointblank for the VALID-III workflow
adopts the same naming convention. The idea is to use these functions
interchangeably with those from testthat in the
standard testthat workflow (in a
test-<name>.R file, inside the
tests/testthat folder). The big difference here is that
instead of testing function outputs, we are testing data tables.
However, tables may be returned from function calls and the
expect_*() functions offered up by
pointblank might offer more flexibility for testing
that data. For instance
expect_col_vals_between() allows us
to write an expectation with fine control on boundary values (and
whether they are inclusive bounds), whether
should be ignored, and we can even set a failure threshold if that makes
sense for the expectation.
Here’s an example of how to generate some tests on data with
testthat and also with pointblank. For
small_table dataset, let’s write expectations that show
that non-NA values in column
c are between
testthat::expect_true(all(na.omit(small_table$c) >= 2)) testthat::expect_true(all(na.omit(small_table$c) <= 9))
There is no testthat function that tests for values
between two values. The original strategy was to use
small_table$c as the
object in both,
however, that doesn’t work because that results in a logical vector
greater than length 1. Also, there is no allowance for
values to be skipped. The best I could do was the above.
The pointblank version of this task makes for a more succinct and understandable expectation expression:
The arguments in the
expect_col_vals_between() give us
everything we need to check tabular data without having to do subsetting
and perform other transformations. There are a few added benefits.
Should data come from a data source other than a local data frame, the
SQL expressions are handled internally and they have been tested
extensively across all the supported database types and in Spark
DataFrames as well.
All of the
expect_*() functions have the same leading
arguments of their validation function counterparts but they omit the
following arguments at the end of their signatures:
While we lose the
actions argument, we get in its place
threshold argument. This is a simple failure threshold
value for use with the expectation (
expect_*) and the test
test_*) functions. By default,
1 which means that any single test unit failing will
result in an overall failure (i.e., the expectation will fail).
As with the thresholds set in the
functions (or the shortcut functions
stop_on_fail()), whole numbers beyond
indicate that any failing units up to that absolute threshold value will
result in a succeeding expectation. Likewise, fractional values (between
1) act as a proportional failure
0.25 means that 25% of failing test units
results in a failed expectation.
preconditions argument can be used to transform the
input data before evaluation of the expectation. This is useful is some
cases where you might need to summarize the input data table, mutate
columns, perform some filtering, or even perform table joins