Execution of multiverse analysis --- Sequential and Parallel

Introduction

As some user-specificed multiverses can be quite large, it is only natural that we should make use of the rich parallel processing resources that are widely available. multiverse makes use of futures using the future library, which “provides a very simple and uniform way of evaluating R expressions asynchronously…”. This allows both the user and us (as the creators of this library) greater flexibility in supporting how execution and parallel execution may be supported. How and when evaluation takes place depends on the strategy chosen by the user of executing the multiverse. These strategies include sequential execution in the current R session, or asynchronous parallel execution on the current machine or on a compute cluster.

In this document, we show how you can execute each of the distinct analyses in your specified multiverse.

Data and analysis

We will use the hurricane dataset that has been discussed in greater detail in the README as well as in the vignette hurricane. If you are already familiar with this dataset and analysis, please feel free to skip this section, and continue from the next section

data(hurricane)

hurricane_data <- hurricane |>
  # rename some variables
  rename(
      year = Year,
      name = Name,
      dam = NDAM,
      death = alldeaths,
      female = Gender_MF,
      masfem = MasFem,
      category = Category,
      pressure = Minpressure_Updated_2014,
      wind = HighestWindSpeed
  ) |>
  # create new variables
  mutate(
      post = ifelse(year>1979, 1, 0),
      zcat = as.numeric(scale(category)),
      zpressure = -scale(pressure),
      zwind = as.numeric(scale(wind)),
      z3 = as.numeric((zpressure + zcat + zwind) / 3)
  )

M = multiverse()

We implement the same analysis as in the vignette(hurricane). Below we briefly outline the steps involved.

Outlier exclusion: We implement different alternative choices on how to exclude outliers based on extreme observations of death and damages.

inside(M, {
  df <- hurricane_data |>
    filter(branch(death_outliers, 
        "no_exclusion" ~ TRUE,
        "most_extreme_deaths" ~ name != "Katrina",
        "most_extreme_two_deaths" ~ ! (name %in% c("Katrina", "Audrey"))
    )) |>
    filter(branch(damage_outliers,
        "no_exclusion" ~ TRUE,
        "most_extreme_one_damage" ~ ! (name %in% c("Sandy")),
        "most_extreme_two_damage" ~ ! (name %in% c("Sandy", "Andrew")),
        "most_extreme_three_damage" ~ ! (name %in% c("Sandy", "Andrew", "Donna"))
    ))
})

Identifying independent variables: How is femininity of the name of a hurricane operationalised? Simonsohn et al. identify two distinct ways. First, using the 11 point scale that was used in the original analysis; or second using a binary scale.

Data transformations: The dollar amount in damages caused by a hurricane follows a long tailed, positive only valued distribution. The decision involved is whether or not to transform damages.

inside(M, {
  df <- df |>
    mutate(
        femininity = branch(femininity_calculation,
          "masfem" ~ masfem,
          "female" ~ female
        ),
        damage = branch(damage_transform,
          "no_transform" ~ identity(dam),
          "log_transform" ~ log(dam)
        )
    )
})

Alternative specifications of regression model: The next step is to fit the model. We can use either a log-linear model or a poisson model for the step. Both are reasonable alternatives for this dataset. We also have to make a choice on whether we want to include an interaction between femininity and damage. This results in the following specification:

inside(M, {
  fit <- glm(branch(model, "linear" ~ log(death + 1), "poisson" ~ death) ~ 
            branch(main_interaction,
                "no" ~ femininity + damage,
                "yes" ~ femininity * damage
            ) + branch(other_predictors,
                "none" ~ NULL,
                "pressure" %when% (main_interaction == "yes") ~ femininity * zpressure,
                "wind" %when% (main_interaction == "yes") ~ femininity * zwind,
                "category" %when% (main_interaction == "yes") ~ femininity * zcat,
                "all" %when% (main_interaction == "yes") ~ femininity * z3,
                "all_no_interaction" %when% (main_interaction == "no") ~ z3
            ) + branch(covariates, "1" ~ NULL, "2" ~ year:damage, "3" ~ post:damage), 
            family = branch(model, "linear" ~ "gaussian", "poisson" ~ "poisson"),  
            data = df)
  
  res <- broom::tidy(fit)
})

Execution

Sequential

The most simple execution strategy would be to perform each computation sequentially, on one’s current machine. This is the default strategy, which can be used by simply calling execute_multiverse() on the current multiverse object.

execute_multiverse(M)

However, some studies have created multiverse analyses with thousands or even millions of unique specifications (universes). In such cases, the optimisation to avoid redundant computation that we have built into our solution for execution is insufficient, and sequential execution fails to make use of the embarrassingly abundant parallel processing resources.

Parallel: separate R sessions

To process multiverses in parallel, we make use of the future library. future allows the user to declare different strategies for resolving futures asynchronously and in parallel using future::plan(). The most general approach, which would work on both unix and non-unix based systems is to use a multisession future, which resolves futures asynchronously (in parallel) in separate R sessions running in the background on the same machine:

plan(multisession, workers = availableCores())
execute_multiverse(M, parallel = TRUE)
plan(sequential) # explicitly closes multisession workers by switching plan

Note: this vignette uses the inside() syntax to implement a multiverse. However, futures can be used with multiverse code blocks as well, and the steps involved in setup of asynchronous futures and execution would remain the same.

Parallel: separate forked processes

This strategy is similar to the mc*apply suite of functions in the parallel library. It resolves futures “asynchronously (in parallel) in separate forked R processes running in the background on the same machine”. However, this functionality is not supported on Windows (non-unix based system). Thus, we recommend using multisession instead.

Parallel: compute clusters

Future also supports resolution in separate R sessions running on a compute cluster. “A cluster future is a future that uses cluster evaluation, which means that its value is computed and resolved in parallel in another process.”

What if I want to execute only a subset of the specifications?

For debugging purposes or otherwise, one might wonder if it is possible to execute only a small subset of N universes from the larger specified multiverse. Although we do not provide a specific function which supports this behavior, such a behavior can be easily achieved using the following workflow using the lapply() functions.

N = 5
lapply(1:5, function(x) execute_universe(M, x))

Alternatively, the same result can be obtained using the purrr::map() function:

purrr::map(1:N, ~ execute_universe(M, .))

If we want to perform this operation in parallel, we could use furrr::future_map() as follows:

plan(multisession, workers = availableCores())
furrr::future_map(1:5, function(x) execute_universe(M, x))
plan(sequential)

For a detailed description on asynchronous resolution of futures and how to set up clusters, please refer to the documentation of the future library.

Abhraneel Sarma

2024-10-06