vmc.Rmd
This vignette introduces the vmc
package, which
facilitates model check visualizations (i.e., comparing observed data to
model predictions) for Bayesian models in R. This vignette follows a
standard model checking workflow to show how to use vmc
.
For more detailed description of geoms and comparative layouts, see
vignette("uncertainty-representation")
and
vignette("comparative-layout")
.
The following libraries are required to run this vignette:
library(dplyr)
library(purrr)
library(vmc)
library(ggplot2)
library(ggdist)
library(cowplot)
library(rstan)
library(brms)
library(gganimate)
theme_set(theme_tidybayes() + panel_border())
These options help Stan run faster:
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
Let’s fit a simple model on datasets::mtcars
with
default priors and four predictors:
model = brm(
bf(mpg ~ disp + carb + vs + am),
init = "0",
data = mtcars,
iter = 6000,
sample_prior = TRUE,
file = "models/get-started_model.rds" # cache model (can be removed)
)
The result looks like this:
model
#> Family: gaussian
#> Links: mu = identity; sigma = identity
#> Formula: mpg ~ disp + carb + vs + am
#> Data: mtcars (Number of observations: 32)
#> Draws: 4 chains, each with iter = 6000; warmup = 3000; thin = 1;
#> total post-warmup draws = 12000
#>
#> Regression Coefficients:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept 26.36 2.98 20.52 32.22 1.00 6901 7124
#> disp -0.02 0.01 -0.04 -0.01 1.00 7061 8104
#> carb -1.26 0.39 -2.05 -0.50 1.00 8987 7811
#> vs 1.21 1.67 -2.13 4.45 1.00 7621 8050
#> am 4.13 1.43 1.31 6.95 1.00 7844 8398
#>
#> Further Distributional Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma 2.76 0.40 2.12 3.66 1.00 8124 7810
#>
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
Now that we have the model, we can begin to construct model checks.
This often starts with a density plot comparing the posterior predictive
distribution and observed data distribution. mcplot()
here
generates a density plot by default if no other options are defined.
model %>%
mcplot()
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's fill values.
vmc
puts the response variable on the y-axis by default.
You can flip the plot to generate a canonical posterior predictive check
using a density plot. vmc
enables you to do this by using
mc_gglayer
to add a ggplot2
layer to model
check.
model %>%
mcplot() +
mc_gglayer(coord_flip())
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's fill values.
mcplot()
uses defaults to make a model check
visualization when users don’t explicitly define everything. The above
lines are roughly equivalent to following:
model %>%
mcplot() +
mc_draw("prediction") +
mc_model_line(stat = "density", alpha = .1) +
mc_obs_line(stat = "density") +
mc_layout_superposition() +
mc_gglayer(coord_flip())
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's fill values.
mc_draw()
defines how vmc
would draw the
posterior distribution, where "prediction"
says to draw
from the posterior predictive distribution and
newdata = NULL
says to use the observed dataset to
generate. mc_model_line()
and mc_obs_line()
define the geom of model predictions and data observations and choose a
suitable geom for users based on data types.
mc_layout_superposition()
defines the comparative layout of
the model check as superposition, i.e. to overlay model predictions and
data observations. mc_gglayer()
adds
ggplot2::layer()
to the model checking visualization. We
will discuss more details about these functions in later sections.
Model checking is not restricted to a single variable check, but can
also involve conditional variables for checking the response variable’s
marginal effects on them. For example, we want to check the impact of
disp
on mpg
in the above model. By using
vmc
, it’s pretty easy to do this by just adding one line of
code:
model %>%
mcplot() +
mc_condition_on(x = vars(disp))
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's fill values.
Then you can apply some uncertainty representations to model
predictions in vmc
. For example, lineribbon
and point+interval
are shown below.
model %>%
mcplot() +
mc_model_lineribbon() +
mc_condition_on(x = vars(disp))
model %>%
mcplot() +
mc_model_pointinterval() +
mc_condition_on(x = vars(disp))
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's fill values.
You can also show different draws by animation (HOPs.).
model %>%
mcplot() +
mc_model_point(group_sample = "hops") +
mc_condition_on(x = vars(disp))
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's fill values.
Or you can use HOPs and other uncertainty representations together.
model %>%
mcplot() +
mc_model_pointinterval(group_sample = "hops", n_sample = 50) +
mc_condition_on(x = vars(disp))
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's fill values.
#> `nframes` and `fps` adjusted to match transition
Since model check visualizations involve checking the model’s
performance by comparing model predictions with observed data,
vmc
also enables users to change comparative layouts for
the model predictions and data observations using the function
mc_comp_layout()
. vmc
has four predefined
comparative layouts: juxtaposition, superposition, nested juxtaposition,
and explicit-encoding.
Juxtaposition puts the model predictions and data observations side by side. Juxtaposition is especially useful when model predictions and data observations overlap in superposition so they can not be distinguished well.
model %>%
mcplot() +
mc_model_eye() +
mc_obs_eye() +
mc_condition_on(x = vars(carb)) +
mc_layout_juxtaposition()
Superposition is the default comparative layout by vmc
.
It superimposes the model predictions and data observations in a single
plot.
model %>%
mcplot() +
mc_model_eye() +
mc_obs_eye() +
mc_condition_on(x = vars(carb)) +
mc_layout_superposition()
When checking model conditional on a discrete variable, there may be cases where you want to avoid superposition, since that would make data points overlap, as well as juxtaposition, since the visual distance between themodel predictions and data observations would be too large. In these cases, nested juxtaposition can be a good choice, which plots model predictions and data observations in one plot but dodges them to avoid overlap.
model %>%
mcplot() +
mc_model_eye() +
mc_obs_eye() +
mc_condition_on(x = vars(carb)) +
mc_layout_nested()
Explicit-encoding focuses on revealing the relationship between model
predictions and data observations by explicitly encoding and plotting
the difference. Some common examples are residual plots, Q-Q plots, and
wormplots. vmc
supports explicitly-encoding by enabling
users to pass in the operation (exp_op
) when they want to
do in explicitly-encoding comparison. vmc
predefines
several common explicitly-encodings (e.g. residual plot and Q-Q plot)
but users can also pass in a customized operation function.
Residual plot with uncertainty representation lineribbon:
model %>%
mcplot() +
mc_model_lineribbon() +
mc_layout_encoding(transform = "residual") +
mc_condition_on(x = vars(disp)) +
mc_gglayer(geom_hline(yintercept = 0))
Q-Q plot:
model %>%
mcplot() +
mc_draw(ndraws = 1) +
mc_layout_encoding(transform = "qq") +
mc_gglayer(geom_abline())
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's fill values.
You can also cutomize the operation function to make the
explicitly-encoding into any desired explicit comparison. The customized
function should take a data frame including two columns for predictions
and observations as input and return a data frame with columns named
y_axis
and x_axis
(optional).
Standardized residual plot (using customized operation function):
std_res_func = function(data) {
data %>%
mutate(y_axis = prediction - observation) %>%
mutate(y_axis = y_axis / sd(y_axis))
}
model %>%
mcplot() +
mc_model_lineribbon() +
mc_layout_encoding(transform = std_res_func) +
mc_condition_on(x = vars(disp)) +
mc_gglayer(geom_hline(yintercept = 0))