Multiple imputation by Gaussian-mixture conditioning

Fits a Gaussian mixture to a numeric dataset that contains missing values and draws m completed datasets from the mixture conditional \(p(x_{\mathrm{missing}} \mid x_{\mathrm{observed}})\). Because the mixture can be multimodal and heteroscedastic, the imputations follow the shape of the joint distribution rather than a single Gaussian, which keeps downstream inference valid on data that a single-Gaussian or linear-Gaussian imputer mis-specifies.

Usage

gmm_impute(
  data,
  N = NULL,
  m = 20L,
  mechanism = mar(),
  seed = NULL,
  max_iter = 100L,
  tol = 1e-06,
  ridge_eps = 1e-06
)

Arguments

data: A numeric matrix or data frame with NA for missing entries.
N: Number of mixture components. NULL (the default) selects it by the Bayesian information criterion over 1:6.
m: Number of completed datasets to draw. Default 20L.
mechanism: A missingness mechanism: mar(), censored(), or mnar(). The string "mar" is also accepted. Default mar().
seed: Optional integer seed. When supplied the result is reproducible and the ambient random-number state is restored on exit.
max_iter: Maximum EM iterations per fit. Default 100L.
tol: Relative log-likelihood tolerance for EM convergence. Default 1e-6.
ridge_eps: Ridge added to each component covariance at every M-step. Default 1e-6.

Value

A gmm_imputation object.

Details

Imputation is conditioning. For a row with observed coordinates the missing coordinates follow the closed-form mixture conditional (the same Schur-complement algebra as gmm_conditionalise()). The mixture is fitted to the incomplete data by expectation-maximisation whose E-step uses each row's observed margin and whose M-step restores the conditional covariance of the filled entries, so component variances are not under-estimated.

Proper multiple imputation requires the fitting parameters themselves to carry uncertainty, otherwise the pooled intervals are too narrow. Each of the m imputations is therefore drawn under a mixture fitted to an independent bootstrap resample of the rows, so proxy_pool() reflects both imputation and parameter uncertainty.

The mechanism says how an entry came to be missing, which sets the conditional the missing value is drawn from: mar() (the default) for missing at random, censored() for a known interval such as a detection limit, or mnar() for a value-dependent selection model. The interval and value-dependent gates act on a single coordinate, and a row missing that coordinate must have its other coordinates observed. Numeric data only; categorical variables are out of scope.

Examples

set.seed(1)
x1 <- rnorm(200)
x2 <- x1 + rnorm(200)
x2[runif(200) < plogis(x1)] <- NA          # missing at random on x1
imp <- gmm_impute(cbind(x1, x2), N = 1L, m = 10L, seed = 1L)
proxy_pool(imp, "x2")$estimate             # pooled mean of x2
#> [1] -0.06410561