Execution of multiverse analysis --- Sequential and Parallel
Abhraneel Sarma
2024-10-06
Source:vignettes/execution-multiverse.Rmd
execution-multiverse.Rmd
Introduction
As some user-specificed multiverses can be quite large, it is only
natural that we should make use of the rich parallel processing
resources that are widely available. multiverse
makes use
of futures using the future library,
which “provides a very simple and uniform way of evaluating R
expressions asynchronously…”. This allows both the user and us (as the
creators of this library) greater flexibility in supporting how
execution and parallel execution may be supported. How and when
evaluation takes place depends on the strategy chosen by the user of
executing the multiverse. These strategies include sequential execution
in the current R session, or asynchronous parallel execution on the
current machine or on a compute cluster.
In this document, we show how you can execute each of the distinct analyses in your specified multiverse.
Data and analysis
We will use the hurricane dataset that has been discussed in greater detail in the README as well as in the vignette hurricane. If you are already familiar with this dataset and analysis, please feel free to skip this section, and continue from the next section
data(hurricane)
hurricane_data <- hurricane |>
# rename some variables
rename(
year = Year,
name = Name,
dam = NDAM,
death = alldeaths,
female = Gender_MF,
masfem = MasFem,
category = Category,
pressure = Minpressure_Updated_2014,
wind = HighestWindSpeed
) |>
# create new variables
mutate(
post = ifelse(year>1979, 1, 0),
zcat = as.numeric(scale(category)),
zpressure = -scale(pressure),
zwind = as.numeric(scale(wind)),
z3 = as.numeric((zpressure + zcat + zwind) / 3)
)
M = multiverse()
We implement the same analysis as in the
vignette(hurricane)
. Below we briefly outline the steps
involved.
Outlier exclusion: We implement different alternative choices on how to exclude outliers based on extreme observations of death and damages.
inside(M, {
df <- hurricane_data |>
filter(branch(death_outliers,
"no_exclusion" ~ TRUE,
"most_extreme_deaths" ~ name != "Katrina",
"most_extreme_two_deaths" ~ ! (name %in% c("Katrina", "Audrey"))
)) |>
filter(branch(damage_outliers,
"no_exclusion" ~ TRUE,
"most_extreme_one_damage" ~ ! (name %in% c("Sandy")),
"most_extreme_two_damage" ~ ! (name %in% c("Sandy", "Andrew")),
"most_extreme_three_damage" ~ ! (name %in% c("Sandy", "Andrew", "Donna"))
))
})
Identifying independent variables: How is femininity of the name of a hurricane operationalised? Simonsohn et al. identify two distinct ways. First, using the 11 point scale that was used in the original analysis; or second using a binary scale.
Data transformations: The dollar amount in damages
caused by a hurricane follows a long tailed, positive only valued
distribution. The decision involved is whether or not to transform
damages
.
inside(M, {
df <- df |>
mutate(
femininity = branch(femininity_calculation,
"masfem" ~ masfem,
"female" ~ female
),
damage = branch(damage_transform,
"no_transform" ~ identity(dam),
"log_transform" ~ log(dam)
)
)
})
Alternative specifications of regression model: The next step is to
fit the model. We can use either a log-linear model or a poisson model
for the step. Both are reasonable alternatives for this dataset. We also
have to make a choice on whether we want to include an interaction
between femininity
and damage
. This results in
the following specification:
inside(M, {
fit <- glm(branch(model, "linear" ~ log(death + 1), "poisson" ~ death) ~
branch(main_interaction,
"no" ~ femininity + damage,
"yes" ~ femininity * damage
) + branch(other_predictors,
"none" ~ NULL,
"pressure" %when% (main_interaction == "yes") ~ femininity * zpressure,
"wind" %when% (main_interaction == "yes") ~ femininity * zwind,
"category" %when% (main_interaction == "yes") ~ femininity * zcat,
"all" %when% (main_interaction == "yes") ~ femininity * z3,
"all_no_interaction" %when% (main_interaction == "no") ~ z3
) + branch(covariates, "1" ~ NULL, "2" ~ year:damage, "3" ~ post:damage),
family = branch(model, "linear" ~ "gaussian", "poisson" ~ "poisson"),
data = df)
res <- broom::tidy(fit)
})
Execution
Sequential
The most simple execution strategy would be to perform each
computation sequentially, on one’s current machine. This is the default
strategy, which can be used by simply calling
execute_multiverse()
on the current multiverse object.
However, some studies have created multiverse analyses with thousands or even millions of unique specifications (universes). In such cases, the optimisation to avoid redundant computation that we have built into our solution for execution is insufficient, and sequential execution fails to make use of the embarrassingly abundant parallel processing resources.
Parallel: separate R sessions
To process multiverses in parallel, we make use of the future library.
future
allows the user to declare different strategies for
resolving futures asynchronously and in parallel using
future::plan()
. The most general approach, which would work
on both unix and non-unix based systems is to use a
multisession
future, which resolves futures asynchronously
(in parallel) in separate R sessions running in the background on the
same machine:
plan(multisession, workers = availableCores())
execute_multiverse(M, parallel = TRUE)
plan(sequential) # explicitly closes multisession workers by switching plan
Note: this vignette uses the inside()
syntax to
implement a multiverse. However, futures can be used with multiverse
code blocks as well, and the steps involved in setup of asynchronous
futures and execution would remain the same.
Parallel: separate forked processes
This strategy is similar to the mc*apply suite of functions in the
parallel
library. It resolves futures “asynchronously (in
parallel) in separate forked R processes running in the background on
the same machine”. However, this functionality is not supported on
Windows (non-unix based system). Thus, we recommend using
multisession
instead.
What if I want to execute only a subset of the specifications?
For debugging purposes or otherwise, one might wonder if it is
possible to execute only a small subset of N
universes from
the larger specified multiverse. Although we do not provide a specific
function which supports this behavior, such a behavior can be easily
achieved using the following workflow using the lapply()
functions.
N = 5
lapply(1:5, function(x) execute_universe(M, x))
Alternatively, the same result can be obtained using the
purrr::map()
function:
purrr::map(1:N, ~ execute_universe(M, .))
If we want to perform this operation in parallel, we could use
furrr::future_map()
as follows:
plan(multisession, workers = availableCores())
furrr::future_map(1:5, function(x) execute_universe(M, x))
plan(sequential)
For a detailed description on asynchronous resolution of futures and
how to set up clusters, please refer to the documentation of the
future
library.