misl implements Multiple Imputation by Super
Learning, a flexible approach to handling missing data
described in:
Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super learning. Statistical Methods in Medical Research. 31(10):1904–1915. doi: 10.1177/09622802221104238
Rather than relying on a single parametric imputation model,
misl builds a stacked ensemble of machine learning
algorithms for each incomplete column, producing well-calibrated
imputations for continuous, binary, categorical, and ordered categorical
variables.
We simulate a small dataset with four types of incomplete variables
to demonstrate misl across all supported outcome types.
library(misl)
set.seed(42)
n <- 300
sim_data <- data.frame(
# Continuous predictors (always observed)
age = round(rnorm(n, mean = 50, sd = 12)),
bmi = round(rnorm(n, mean = 26, sd = 4), 1),
# Continuous outcome with missingness
sbp = round(120 + 0.4 * rnorm(n, mean = 50, sd = 12) +
0.3 * rnorm(n, mean = 26, sd = 4) + rnorm(n, sd = 10)),
# Binary outcome with missingness (0 = no, 1 = yes)
smoker = rbinom(n, 1, prob = 0.3),
# Unordered categorical outcome with missingness
group = factor(sample(c("A", "B", "C"), n, replace = TRUE,
prob = c(0.4, 0.35, 0.25))),
# Ordered categorical outcome with missingness
health = factor(sample(c("Poor", "Fair", "Good", "Excellent"), n,
replace = TRUE, prob = c(0.1, 0.2, 0.5, 0.2)),
levels = c("Poor", "Fair", "Good", "Excellent"),
ordered = TRUE)
)
# Introduce missing values
sim_data[sample(n, 40), "sbp"] <- NA
sim_data[sample(n, 30), "smoker"] <- NA
sim_data[sample(n, 30), "group"] <- NA
sim_data[sample(n, 30), "health"] <- NA
# Summarise missingness
sapply(sim_data, function(x) sum(is.na(x)))
#> age bmi sbp smoker group health
#> 0 0 40 30 30 30Use list_learners() to see the available named learners,
optionally filtered by outcome type:
| learner | description | continuous | binomial | categorical | ordinal | package | installed |
|---|---|---|---|---|---|---|---|
| glm | Linear / logistic regression | TRUE | TRUE | FALSE | FALSE | stats | TRUE |
| mars | Multivariate adaptive regression splines | TRUE | TRUE | FALSE | FALSE | earth | TRUE |
| multinom_reg | Multinomial regression | FALSE | FALSE | TRUE | FALSE | nnet | TRUE |
| polr | Proportional odds logistic regression | FALSE | FALSE | FALSE | TRUE | MASS | TRUE |
| rand_forest | Random forest | TRUE | TRUE | TRUE | TRUE | ranger | TRUE |
| boost_tree | Gradient boosted trees | TRUE | TRUE | TRUE | TRUE | xgboost | TRUE |
| learner | description | package | installed |
|---|---|---|---|
| polr | Proportional odds logistic regression | MASS | TRUE |
| rand_forest | Random forest | ranger | TRUE |
| boost_tree | Gradient boosted trees | xgboost | TRUE |
The primary function is misl(). Supply a dataset and
specify:
m – the number of multiply imputed datasets to
createmaxit – the number of Gibbs sampling iterations per
datasetcon_method, bin_method,
cat_method, ord_method – learners for each
outcome typeRunning misl() with a single named learner per outcome
type is the fastest option and is well suited for exploratory work. Note
that ordered factors are automatically detected and routed to
ord_method:
misl_imp <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = "glm",
bin_method = "glm",
cat_method = "multinom_reg",
ord_method = "polr",
seed = 42,
quiet = TRUE
)misl() returns a list of length m. Each
element contains:
$datasets – the fully imputed tibble$trace – a long-format tibble of convergence
statisticsIn addition to the built-in named learners, misl v2.0
accepts any parsnip-compatible model spec directly. This allows you to
use any learner available in the tidymodels ecosystem without waiting
for it to be added to the built-in registry.
Simply pass a parsnip model spec in place of (or alongside) a named
string. misl will automatically enforce the correct mode
(regression vs classification) regardless of what is set on the
spec:
library(parsnip)
# A random forest with custom hyperparameters
custom_rf <- rand_forest(trees = 500, mtry = 3) |>
set_engine("ranger")
misl_custom <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = list(custom_rf),
bin_method = list(custom_rf),
cat_method = "multinom_reg",
ord_method = "polr",
seed = 42,
quiet = TRUE
)Named strings and parsnip specs can be freely mixed in the same list.
When multiple learners are supplied, misl uses
cross-validation to build a stacked ensemble:
library(parsnip)
# Mix a named learner with a custom tuned spec
custom_xgb <- boost_tree(trees = 200, learn_rate = 0.05) |>
set_engine("xgboost")
misl_mixed <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = list("glm", custom_xgb),
bin_method = list("glm", custom_xgb),
cat_method = list("multinom_reg", "rand_forest"),
ord_method = list("polr", "rand_forest"),
cv_folds = 3,
seed = 42
)Any parsnip-supported engine can be used. For example, a support
vector machine via the kernlab package:
library(parsnip)
# SVM - not in the built-in registry but works via parsnip
svm_spec <- svm_rbf() |>
set_engine("kernlab")
misl_svm <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = list("glm", svm_spec),
bin_method = list("glm", svm_spec),
cat_method = "multinom_reg",
ord_method = "polr",
cv_folds = 3,
seed = 42
)For ordered categorical variables, misl automatically
detects ordered factors and routes them to ord_method. The
default learner is "polr" (proportional odds logistic
regression from the MASS package), which respects the
ordering of the levels:
# Ensure your ordered variable is an ordered factor
sim_data$health <- factor(sim_data$health,
levels = c("Poor", "Fair", "Good", "Excellent"),
ordered = TRUE
)
misl_ordinal <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = "glm",
bin_method = "glm",
cat_method = "multinom_reg",
ord_method = "polr", # proportional odds model for ordered categories
seed = 42,
quiet = TRUE
)
# Imputed values respect the ordering
is.ordered(misl_ordinal[[1]]$datasets$health)
levels(misl_ordinal[[1]]$datasets$health)When multiple learners are supplied (whether named strings, parsnip
specs, or a mix), misl uses cross-validation to learn
optimal ensemble weights via the stacks package. Use
cv_folds to reduce the number of folds and speed up
computation:
After imputation, fit your analysis model to each of the
m datasets and pool the results using Rubin’s rules. Here
we implement pooling manually using base R:
# Fit a linear model to each imputed dataset
models <- lapply(misl_imp, function(imp) {
lm(sbp ~ age + bmi + smoker + group + health, data = imp$datasets)
})
# Pool point estimates and standard errors using Rubin's rules
m <- length(models)
ests <- sapply(models, function(fit) coef(fit))
vars <- sapply(models, function(fit) diag(vcov(fit)))
Q_bar <- rowMeans(ests) # pooled estimate
U_bar <- rowMeans(vars) # within-imputation variance
B <- apply(ests, 1, var) # between-imputation variance
T_total <- U_bar + (1 + 1 / m) * B # total variance
pooled <- data.frame(
term = names(Q_bar),
estimate = round(Q_bar, 4),
std.error = round(sqrt(T_total), 4),
conf.low = round(Q_bar - 1.96 * sqrt(T_total), 4),
conf.high = round(Q_bar + 1.96 * sqrt(T_total), 4)
)
print(pooled)For a full implementation of Rubin’s rules including degrees of
freedom and p-values, the mice package provides
pool() and can be used directly with misl
output:
The plot_misl_trace() function plots the mean or
standard deviation of imputed values across iterations for a given
variable, with one line per imputed dataset. Stable traces that mix well
across datasets indicate convergence. Note that trace statistics are
only computed for continuous and numeric binary columns — they are not
available for categorical or ordinal columns.
# Plot mean of imputed sbp values across iterations for each dataset
plot_misl_trace(misl_imp, variable = "sbp", ylab = "Mean imputed sbp (mm Hg)")Stable traces that mix well across datasets indicate the algorithm has converged.
The m imputed datasets are generated independently and
can be run in parallel using the future framework. Set a
parallel plan before calling misl():
library(future)
# Use all available cores
plan(multisession)
misl_parallel <- misl(
sim_data,
m = 10,
maxit = 5,
con_method = c("glm", "rand_forest"),
bin_method = c("glm", "rand_forest"),
cat_method = c("multinom_reg", "rand_forest"),
ord_method = c("polr", "rand_forest"),
seed = 42
)
# Always reset the plan when done
plan(sequential)To limit the number of cores:
The largest speedup comes from running the m datasets
simultaneously. Check how many cores are available with:
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.4
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] misl_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.7.2 cli_3.6.5 knitr_1.50
#> [4] rlang_1.1.7 xfun_0.54 Formula_1.2-5
#> [7] data.table_1.18.2.1 jsonlite_2.0.0 glue_1.8.0
#> [10] ranger_0.18.0 nnet_7.3-20 htmltools_0.5.8.1
#> [13] sass_0.4.10 rmarkdown_2.30 grid_4.5.2
#> [16] evaluate_1.0.5 jquerylib_0.1.4 tibble_3.3.1
#> [19] MASS_7.3-65 earth_5.3.5 fastmap_1.2.0
#> [22] yaml_2.3.10 lifecycle_1.0.5 compiler_4.5.2
#> [25] Rcpp_1.1.1 pkgconfig_2.0.3 rstudioapi_0.17.1
#> [28] lattice_0.22-7 digest_0.6.39 xgboost_3.2.1.1
#> [31] R6_2.6.1 pillar_1.11.1 magrittr_2.0.4
#> [34] Matrix_1.7-4 bslib_0.9.0 tools_4.5.2
#> [37] plotmo_3.7.0 plotrix_3.8-14 cachem_1.1.0