Fits a Gaussian mixture to a numeric dataset that contains missing values
and draws m completed datasets from the mixture conditional
\(p(x_{\mathrm{missing}} \mid x_{\mathrm{observed}})\). Because the
mixture can be multimodal and heteroscedastic, the imputations follow the
shape of the joint distribution rather than a single Gaussian, which keeps
downstream inference valid on data that a single-Gaussian or
linear-Gaussian imputer mis-specifies.
Usage
gmm_impute(
data,
N = NULL,
m = 20L,
mechanism = mar(),
seed = NULL,
max_iter = 100L,
tol = 1e-06,
ridge_eps = 1e-06
)Arguments
- data
A numeric matrix or data frame with
NAfor missing entries.- N
Number of mixture components.
NULL(the default) selects it by the Bayesian information criterion over1:6.- m
Number of completed datasets to draw. Default
20L.- mechanism
A missingness mechanism:
mar(),censored(), ormnar(). The string"mar"is also accepted. Defaultmar().- seed
Optional integer seed. When supplied the result is reproducible and the ambient random-number state is restored on exit.
- max_iter
Maximum EM iterations per fit. Default
100L.- tol
Relative log-likelihood tolerance for EM convergence. Default
1e-6.- ridge_eps
Ridge added to each component covariance at every M-step. Default
1e-6.
Value
A gmm_imputation object.
Details
Imputation is conditioning. For a row with observed coordinates the
missing coordinates follow the closed-form mixture conditional (the same
Schur-complement algebra as gmm_conditionalise()). The mixture is fitted
to the incomplete data by expectation-maximisation whose E-step uses each
row's observed margin and whose M-step restores the conditional covariance
of the filled entries, so component variances are not under-estimated.
Proper multiple imputation requires the fitting parameters themselves to
carry uncertainty, otherwise the pooled intervals are too narrow. Each of
the m imputations is therefore drawn under a mixture fitted to an
independent bootstrap resample of the rows, so proxy_pool() reflects both
imputation and parameter uncertainty.
The mechanism says how an entry came to be missing, which sets the
conditional the missing value is drawn from: mar() (the default) for missing
at random, censored() for a known interval such as a detection limit, or
mnar() for a value-dependent selection model. The interval and
value-dependent gates act on a single coordinate, and a row missing that
coordinate must have its other coordinates observed. Numeric data only;
categorical variables are out of scope.
See also
gmm_complete() to extract completions, proxy_pool() to pool an
estimand across them, gmm_conditionalise() for the conditioning algebra.
Other imputation:
as_mids(),
gmm_complete(),
gmm_imputation(),
mechanism,
proxy_fmi(),
proxy_mnar_sensitivity(),
proxy_pool()