Example multiverse implementation: Effect of fertility on religiosity and political attitudes

Multiverse case study #1

In this document, we outline an initial approach to conducting a multiverse analysis in R. We will show how our package can be used to perform the multiverse analysis outlined by Steegen et al. in Increasing Transparency Through a Multiverse Analysis.

Introduction

Data analysis can involve several decisions involving two or more options. In most statistical analysis, these decisions are taken by the researcher based on some reasonable justification. However, for several decisions, there can be more than one reasonable options to choose from. A multiverse analysis is a form of analysis which makes all such decisions explicit and conducts the complete analysis for all combinations of options (of each decision).

Below, we illustrate an example of a single analysis for a dataset. And then extend it to a multiverse analysis.

library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(broom)
library(gganimate)
library(cowplot)
library(multiverse)

The data

The first step is to read the raw data from the file and store it as a tibble. We will be following the tidy data format here. The data is stored in two text files, and we can use readr to read the files into R. In this example, we will use the data collected by Durante et al. in The fluctuating female vote: Politics, religion, and the ovulatory cycle, which investigated the effect of fertility on religiosity and political attitudes. We will focus on their second study (which we store in data.raw.study2).

data("durante")

data.raw.study2 <- durante %>%
  mutate(
    Abortion = abs(7 - Abortion) + 1,
    StemCell = abs(7 - StemCell) + 1,
    Marijuana = abs(7 - Marijuana) + 1,
    RichTax = abs(7 - RichTax) + 1,
    StLiving = abs(7 - StLiving) + 1,
    Profit = abs(7 - Profit) + 1,
    FiscConsComp = FreeMarket + PrivSocialSec + RichTax + StLiving + Profit,
    SocConsComp = Marriage + RestrictAbortion + Abortion + StemCell + Marijuana
  )

The data look like this:

data.raw.study2 %>%
  head(10)

The original paper looked at the relationship between fertility, relationship status, and religiosity. But there are many reasonable ways to have defined each of these three variables from this dataset, so it is a good candidate for multiverse analysis.

A single data set analysis: one possible analysis among many

The data collected needs to be processed before it can be modeled. Preparing the data set for analysis can involve several steps and decisions regarding how to encode the different raw values. The following is one example of data processing that can be performed for this study.

one_universe = data.raw.study2 %>%
  mutate( ComputedCycleLength = StartDateofLastPeriod - StartDateofPeriodBeforeLast ) %>%
  mutate( NextMenstrualOnset = StartDateofLastPeriod + ComputedCycleLength ) %>%
  mutate(
    CycleDay = 28 - (NextMenstrualOnset - DateTesting),
    CycleDay = ifelse(WorkerID == 15, 11, ifelse(WorkerID == 16, 18, CycleDay)),
    CycleDay = ifelse(CycleDay > 1 & CycleDay < 28, CycleDay, ifelse(CycleDay < 1, 1, 28))
  ) %>%
  mutate(
    Relationship = factor(ifelse(Relationship==1 | Relationship==2, "Single", "Relationship"))
  ) %>%
  filter( ComputedCycleLength > 25 & ComputedCycleLength < 35) %>%
  filter( Sure1 > 6 | Sure2 > 6 ) %>%
  mutate( Fertility = factor( ifelse(CycleDay >= 7 & CycleDay <= 14, "high", ifelse(CycleDay >= 17 & CycleDay <= 25, "low", "medium")) ) )

The transformed data for this one universe looks like this:

one_universe %>%
  select( NextMenstrualOnset, Relationship, Sure1, Sure2, Fertility, everything() ) %>%
  head(10)

one_universe %>%
  ggplot(aes(x = Relationship, y = Rel1 + Rel2 + Rel3, color = Fertility)) +
  stat_summary(position = position_dodge(width = .1), fun.data = "mean_se")

However, there also exists other valid processing options: instead of calculating NextMenstrualOnset = StartDateofLastPeriod + ComputedCycleLength, it can also be calculated as StartDateofLastPeriod + ReportedCycleLength. Such alternate processing options can exist for several decisions that a researcher makes in the data processing, analysis and presentation stages. This can thus result in a multiverse of analysis, with the one described above representing a single universe.

Below, we describe how our package allows you to conduct a multiverse analysis with ease.

Multiverse implementation

multiverse provides flexible functions which can be used to perform a multiverse analysis.

The first step is to define a new multiverse. We will use the multiverse object to create a set of universes, each representing a different way of analysing our data.

M <- multiverse()

The next step is to define our possible analyses inside the multiverse. The multiverse package includes functions that aim to make it easy to write multiverse analyses in as close a way to a single universe analysis as possible (as seen in the single analysis shown above).

Consider these first few lines from the transformation code in the single analysis above:

df <- data.raw.study2 %>%
  mutate(ComputedCycleLength = StartDateofLastPeriod - StartDateofPeriodBeforeLast) %>%
  mutate(NextMenstrualOnset = StartDateofLastPeriod + ComputedCycleLength)

But NextMenstrualOnset could be calculated in at least two other reasonable ways:

NextMenstrualOnset = StartDateofLastPeriod + ReportedCycleLength
NextMenstrualOnset = StartDateNext

To create a multiverse that includes these three possible processing options, we can use the branch() function. The branch() function defines a parameter (here menstrual_calculation) and the different options that the parameter can take (here, "mc_option1", "mc_option2", "mc_option3"). Each option corresponds to a different chunk of code that would be executed in a different universe.

NextMenstrualOnset = branch(menstrual_calculation, 
  "mc_option1" ~ StartDateofLastPeriod + ComputedCycleLength,
  "mc_option2" ~ StartDateofLastPeriod + ReportedCycleLength,
  "mc_option3" ~ StartDateNext
)

The branch() function indicates that, in our multiverse, NextMenstrualOnset can take either of the three options (here, "mc_option1", "mc_option2", "mc_option3"). Thus, we need to declare this data processing step inside the multiverse. We do this by using the inside() function. The inside() function takes in two arguments: 1. the multiverse object, M; and 2. the code for the analysis (including branches). Note that if you are passing multiple expressions, they should be enclosed within {}.

# here we just create the variable `df` in the multiverse
inside(M, df <- data.raw.study2)

# here, we perform two `mutate` operations in the multiverse.
# although they could have been chained, this illustrates 
# how multiple variables can be declared together using the `{}`
inside(M, {
  df <- df %>%
    mutate( ComputedCycleLength = StartDateofLastPeriod - StartDateofPeriodBeforeLast )
  
  df <- df %>%
    mutate( NextMenstrualOnset = branch(menstrual_calculation, 
      "mc_option1" ~ StartDateofLastPeriod + ComputedCycleLength,
      "mc_option2" ~ StartDateofLastPeriod + ReportedCycleLength,
      "mc_option3" ~ StartDateNext)
    )
})

Note

In this vignette, we make use of the function which is more suited for a script-style implementation. Keeping consistency with the interactive programming interface of RStudio, we also offer the user multiverse code chunks, a custom engine designed to work with the multiverse package, to implement the multiverse analyses. multiverse code chunks can be used instead of the r code block to write code inside a multiverse object. See or refer to the vignette (vignette("multiverse-in-rmd")) for more details on using the multiverse with RMarkdown.

The multiverse, with declared code and branches

Once you add the code to the multiverse, it automatically parses the code to identify the parameters and the corresponding options that have been defined for each parameter.

Once the code has been added, the multiverse object will have the following attributes:

parameters, which is a list of parameters

parameters(M)

conditions, which is a list of conditions (we’ll define this later)
expand, which is a tibble consisting of all possible combination of values for the multiverse

expand(M)

code, which is the code that the user passes to the multiverse to conduct a multiverse analysis. However, we do not execute this code and it is stored unevaluated. The user can interactively edit and rewrite this code, and can execute it for the current analysis or the entire multiverse using dedicated functions.

code(M)

Running a single analysis from the multiverse

At this point, we have defined three possible processing options (three universes) in our multiverse. Although we don’t execute all the universes in the multiverses once they are defined, we do run the default analysis (i.e. the first row in the multiverse table). We can extract objects from the default analysis using the $ operator.

M$df

A multiverse with all possible combinations specified

Besides calculating the onset of the next menstruation cycle, there are other variables which have multiple valid and reasonable processing options. These include defining Relationship and Fertility, and exclusion criteria based on the values for cycle length and certainty of responses. The next code chunk illustrates how this can be added to the multiverse object defined above.

inside(M, {
  df <- df %>%
      mutate(RelationshipStatus = branch( relationship_status, 
        "rs_option1" ~ factor(ifelse(Relationship==1 | Relationship==2, 'Single', 'Relationship')),
        "rs_option2" ~ factor(ifelse(Relationship==1, 'Single', 'Relationship')),
        "rs_option3" ~ factor(ifelse(Relationship==1, 'Single', ifelse(Relationship==3 | Relationship==4, 'Relationship', NA))) )
      ) %>%
      mutate(
        CycleDay = 28 - (NextMenstrualOnset - DateTesting),
        CycleDay = ifelse(WorkerID == 15, 11, ifelse(WorkerID == 16, 18, CycleDay)),
        CycleDay = ifelse(CycleDay > 1 & CycleDay < 28, CycleDay, ifelse(CycleDay < 1, 1, 28))
      ) %>%
      filter( branch(cycle_length, 
        "cl_option1" ~ TRUE,
        "cl_option2" ~ ComputedCycleLength > 25 & ComputedCycleLength < 35,
        "cl_option3" ~ ReportedCycleLength > 25 & ReportedCycleLength < 35
      )) %>%
      filter( branch(certainty,
        "cer_option1" ~ TRUE,
        "cer_option2" ~ Sure1 > 6 | Sure2 > 6
      )) %>%
      mutate( Fertility = branch( fertile,
        "fer_option1" ~ factor( ifelse(CycleDay >= 7 & CycleDay <= 14, "high", ifelse(CycleDay >= 17 & CycleDay <= 25, "low", "medium")) ),
        "fer_option2" ~ factor( ifelse(CycleDay >= 6 & CycleDay <= 14, "high", ifelse(CycleDay >= 17 & CycleDay <= 27, "low", "medium")) ),
        "fer_option3" ~ factor( ifelse(CycleDay >= 9 & CycleDay <= 17, "high", ifelse(CycleDay >= 18 & CycleDay <= 25, "low", "medium")) ),
        "fer_option4" ~ factor( ifelse(CycleDay >= 8 & CycleDay <= 14, "high", "low") ),
        "fer_option5" ~ factor( ifelse(CycleDay >= 8 & CycleDay <= 17, "high", "low") )
      ))
})

Since the multiverse object has already been created and the one parameter has already been defined, the inside function will add to the previous code.

code(M)

The expand will contain all the possible combinations of the parameter options that have been identified.

expand(M) %>%
  head()

In our multiverse we have identified 5 options for calculating fertility, 3 options for calculating menstrual calculation and relationship status each, 3 wyas of excluding participants based on their cycle length and 2 ways of excluding participants based on the self-reported certainty of their responses.

This results in $ 5 = 270$ possible combinations.

expand(M) %>% nrow()

We can then inspect the default analysis the default single universe analysis from this multiverse:

M$df %>%
  head()

Specifying conditions in the multiverse analysis

In a multiverse analysis, it may occur that the value of one variable might depend on the value of another variable defined previously. For example, in our example, depending on how we filter participants based on cycle length, we can only the corresponding value for calculating participants’ NextMenstrualOnset. In other words, if we are using ComputedCycleLength to exclude participants, this means that we should not calculate the variable NextMenstrualOnset (date for the onset of the next menstrual cycle) using the ReportedCycleLength value. Similarly, if we are using ReportedCycleLength to exclude participants it is inconsistent to calculate NextMenstrualOnset using ComputedCycleLength.

We can express these conditionals in the multiverse (See vignette(“Conditions”) for more details). Below, we use the %when% operator:

df <- data.raw.study2  %>%
    mutate( ComputedCycleLength = StartDateofLastPeriod - StartDateofPeriodBeforeLast ) %>%
    mutate(NextMenstrualOnset = branch(menstrual_calculation,
            "mc_option1" ~ (StartDateofLastPeriod + ComputedCycleLength) %when% (cycle_length != "cl_option3"),
            "mc_option2" ~ (StartDateofLastPeriod + ReportedCycleLength) %when% (cycle_length != "cl_option2"),
            "mc_option3" ~ StartDateNext)
    )

Putting it all together

Specifying these conditions allows us to exclude inconsistent combinations from our analyses. Let’s update our example by including these conditions:

M = multiverse()

inside(M, {
  df <- data.raw.study2 %>%
    mutate( ComputedCycleLength = StartDateofLastPeriod - StartDateofPeriodBeforeLast )  %>%
    dplyr::filter( branch(cycle_length,
        "cl_option1" ~ TRUE,
        "cl_option2" ~ ComputedCycleLength > 25 & ComputedCycleLength < 35,
        "cl_option3" ~ ReportedCycleLength > 25 & ReportedCycleLength < 35
    )) %>%
    dplyr::filter( branch(certainty,
        "cer_option1" ~ TRUE,
        "cer_option2" ~ Sure1 > 6 | Sure2 > 6
    )) %>%
    mutate(NextMenstrualOnset = branch(menstrual_calculation,
        "mc_option1" %when% (cycle_length != "cl_option3") ~ StartDateofLastPeriod + ComputedCycleLength,
        "mc_option2" %when% (cycle_length != "cl_option2") ~ StartDateofLastPeriod + ReportedCycleLength,
        "mc_option3" ~ StartDateNext)
    )  %>%
    mutate(
      CycleDay = 28 - (NextMenstrualOnset - DateTesting),
      CycleDay = ifelse(WorkerID == 15, 11, ifelse(WorkerID == 16, 18, CycleDay)),
      CycleDay = ifelse(CycleDay > 1 & CycleDay < 28, CycleDay, ifelse(CycleDay < 1, 1, 28))
    ) %>%
    mutate( Fertility = branch( fertile,
        "fer_option1" ~ factor( ifelse(CycleDay >= 7 & CycleDay <= 14, "high", ifelse(CycleDay >= 17 & CycleDay <= 25, "low", NA)) ),
        "fer_option2" ~ factor( ifelse(CycleDay >= 6 & CycleDay <= 14, "high", ifelse(CycleDay >= 17 & CycleDay <= 27, "low", NA)) ),
        "fer_option3" ~ factor( ifelse(CycleDay >= 9 & CycleDay <= 17, "high", ifelse(CycleDay >= 18 & CycleDay <= 25, "low", NA)) ),
        "fer_option4" ~ factor( ifelse(CycleDay >= 8 & CycleDay <= 14, "high", "low") ),
        "fer_option5" ~ factor( ifelse(CycleDay >= 8 & CycleDay <= 17, "high", "low") )
    )) %>%
    mutate(RelationshipStatus = branch(relationship_status,
        "rs_option1" ~ factor(ifelse(Relationship==1 | Relationship==2, 'Single', 'Relationship')),
        "rs_option2" ~ factor(ifelse(Relationship==1, 'Single', 'Relationship')),
        "rs_option3" ~ factor(ifelse(Relationship==1, 'Single', ifelse(Relationship==3 | Relationship==4, 'Relationship', NA))) )
    )
})

After excluding the inconsistent choice combinations, $270 − 2 \times (5 \times 1 \times 3 \times 1 \times 2) = 210$ choice combinations remain:

expand(M) %>% nrow()

Now, we’ve created the complete multiverse that was presented as example #2 from Steegen et al.’s paper.

Modeling

Steegen et al. create 6 models. The first model uses data from example #1. The other five models use the data from example #2, which we’ve using so far.

Model #2: Effect of Fertility and Relationship status on Religiosity

The authors compute a composite score of Religiosity by calculating the average of the three Religiosity items.

inside(M, {
  df <- df %>%
  mutate( RelComp = round((Rel1 + Rel2 + Rel3)/3, 2))
})

The authors perform an ANOVA to study the effect of Fertility, Relationship and their interaction term, on the composite Religiosity score. We fit the linear model using the call: lm( RelComp ~ Fertility * RelationshipStatus, data = df ) inside our multiverse and save the result to a variable called fit_RelComp.

inside(M, {
  fit_RelComp <- lm( RelComp ~ Fertility * RelationshipStatus, data = df )
})

To extract the results from the analysis, we first create a tidy data-frame of the results of the model, using broom::tidy. Recall that declaring a variable in the multiverse only executes it in the default universe, and hence we need to call execute_multiverse() to execute our analysis in each multiverse.

inside(M, {
  summary_RelComp <- fit_RelComp %>% 
    broom::tidy( conf.int = TRUE )
})

execute_multiverse(M)

Now that we have performed the analysis in each universe of the multiverse, we need to plot the data. To plot the data, we need to extract the relevant result data-frame from each universe into a single data-frame. The following code does this, by extracting the variable where the estimates of the model are stored, summary_RelComp and creating a single tidy data-frame that can be accessed easily.

expand(M) %>%
  extract_variables(summary_RelComp) %>%
  unnest( cols = c(summary_RelComp) ) %>%
  head( 10 )

We then take this data frame and plot the results as a confidence interval and point estimate using ggplot2. As you can see, this is similar to how you would plot a point estimate and confidence intervals for a regular analysis. We then use gganimate to animate through the results of each universe to quickly get an overview of the robustness of the results.

Note: we discuss extracting results from the multiverse and visualising them in more detail in vignette(visualising-multiverse)

p <- expand(M) %>%
  extract_variables(summary_RelComp) %>%
  unnest( cols = c(summary_RelComp) ) %>%
  mutate( term = recode( term, 
                 "RelationshipStatusSingle" = "Single",
                 "Fertilitylow:RelationshipStatusSingle" = "Single:Fertility_low"
  ) ) %>%
  filter( term != "(Intercept)" ) %>%
  ggplot() + 
  geom_vline( xintercept = 0,  colour = '#979797' ) +
  geom_point( aes(x = estimate, y = term)) +
  geom_errorbarh( aes(xmin = conf.low, xmax = conf.high, y = term), height = 0) +
  theme_minimal() +
  transition_manual( .universe )

animate(p, nframes = 210, fps = 4, res = 72)

Animation of linear regression coefficients

Abhraneel Sarma, Northwestern University

2023-11-25