Changelog
Source:NEWS.md
proxymix 0.15.1
Housekeeping
- Author list reduced to the package author-maintainer. Credit for the foundational theory remains with van der Hoek and Elliott (2024) through the package
CITATION, theDESCRIPTIONreferences, and the documentation. - Typography pass over the earlier vignettes (dashes normalised); no content changes.
proxymix 0.15.0
New features
-
The proxy’s own error budget.
gmm_fit_ensemble()quantifies the sampling variability of the fitted mixture itself by a Bayesian (weighted) bootstrap: each replicate re-weights the fit’s own observations with Dirichlet weights and refits by a warm-started weighted EM. In regime"kld"the observations are the fit’s cached importance draws, so a replicate costs zero new target evaluations.proxy_functional_ci()turns the ensemble into a percentile interval for any functional of the proxy (moments, tail probabilities via the operator calculus, entropies, conditional means) – functional-space intervals sidestep component label-switching entirely. Coverage validated against analytic truths on a correlated Gaussian target (validation/ensemble_coverage_study.R, R = 200 replications): nominal 90% intervals cover the two mean coordinates and a tail probability at 0.885 / 0.855 / 0.855. -
New flagship vignette Compressing a Bayesian posterior you can evaluate but not sample: a real unnormalised logistic posterior compressed end to end – adaptive fit, quality certificate, evidence with a Laplace cross-check, closed-form tail reads, and bootstrap error bars. Finite-sample cautions for
gmm_eos_test()(estimated-parameter over-rejection of the chi-square calibration; the subsampling p-value floor) are now stated in the manual and vignette, and the README positions the package against its nearest CRAN neighbour (AdMit) and the MCMC-then-fit route explicitly. -
select_N(): component-count selection across regimes. With samples, the smallest BIC (agreeing withmclust::mclustBICon shared data); with an evaluable-only target, each candidate is scored on its held-out validation draw (whose Monte Carlo standard error is now reported asvalidation_mc_se) and chosen by the one-standard-error rule. On an evaluable three-component target the selector recoveredN = 3in ten out of ten seeds. The scored table is returned, so the recommendation can be ignored by callers who prefer to choose by eye.
proxymix 0.14.0
New features
-
Adaptive proposals for regime (iii):
fit_kld_em(adapt = "pmc"). The importance proposal is refreshed from the current EM iterate as a defensive mixture (the fitted components with inflated covariances plus adefensive_gammashare of the original proposal as a heavy-tailed anchor) – the mixture population-Monte-Carlo scheme of Cappé et al.- with the Owen–Zhou (2000) defensive safeguard. While a batch is degenerate the refresh fires every iteration with an escalating inflation floored at a growing fraction of the batch’s sample covariance, so a collapsed iterate walks back out to the target; and convergence is only accepted on an adapted batch. The per-batch ESS trace ships in
diagnostics$ess_historyand the total evaluation cost indiagnostics$n_target_evals.
validation/adaptive_pmc_study.Rin the development repository): with a centred but badly over-wide proposal, adaptation restores the relative effective sample size from 0.006 to 0.26 atp = 20(5–40x across dimensions) with exact moment recovery; with a proposal mislocated by three standard deviations per coordinate it restores 0.0003 to 0.36 on the warpedp = 10target and beats the fixed proposal by 6–65x throughp = 15. The documented boundary: a badly mislocated light anchor atp = 20leaves too few effective draws to carry directional signal at these budgets – supply a located proposal there. On the low-dimensional shipped scenarios adaptation is uniformly at least as good as the fixed proposal (held-out validation KLD), andadapt = "none"(the default) is byte-identical to the historical behaviour. - with the Owen–Zhou (2000) defensive safeguard. While a batch is degenerate the refresh fires every iteration with an escalating inflation floored at a growing fraction of the batch’s sample covariance, so a collapsed iterate walks back out to the target; and convergence is only accepted on an adapted batch. The per-batch ESS trace ships in
proxymix 0.13.0
New features
-
The algebra is completed. Four operations whose mathematics already lived inside the package are now first-class operators:
gmm_product()(the normalised pointwise product of two mixtures – the conjugate Bayes update, with the marginal evidence returned asmetadata$log_integraland an optionalreduceto cap theK1 * K2component growth),gmm_convolve()(the exact distribution of the sum of independent mixture variables),gmm_mix()(model averaging / mixture-of-mixtures), andgmm_mean()/gmm_cov()(the exact first two moments).pgmm()/qgmm()complete the d/p/q/r quartet in one dimension. All follow the operator metadata policy: the quality certificate travels and the provenance chain grows. -
gmm_evidence(): the normalising constant as a first-class output. With the fitted proxy as the importance proposal,log Zof the target (the log marginal likelihood, for a posterior handed over aslikelihood x prior) is estimated in the log domain with a delta-method standard error and a heavy-tail diagnostic (classed warningproxymix_heavy_tailwhen the proxy’s tails look lighter than the target’s). Validated against known constants for unimodal and multimodal targets. -
Accessors and tidiers.
gmm_weights(),gmm_means(),gmm_covariances()read the component parameters without reaching into the@property layout; broom-styletidy()(component table) andglance()(one-row fit summary) register against thegenericspackage when installed. -
proposal_uniform(),proposal_mvn(),proposal_mvt()are the preferred names of the importance-proposal constructors (the historicalis_*prefix reads as a logical predicate); theis_*names remain as aliases.
API changes
- The four planned-interface placeholders (
from_aggregate_likelihood(),fit_kld_em_collider(),to_apsim_scenarios(),from_simulator()) and the posterior-producer seam helpers (from_fb_posterior(),fb_log_posterior_spec(),fb_producer_available(),mock_fb_posterior()) are internal as of this release: an exported function whose only behaviour is to error, or whose contract awaits an unreleased counterpart, does not belong on the public surface. Their signatures and behaviour are unchanged and remain under test; the general-purpose S3 genericgmm_target_from_posterior()remains the public route for external posteriors. - The imputation mechanism layer is explicitly sealed:
.as_gate()rejects unknown gate types with a clear error instead of letting a third-party gate object fail deep inside the engine.
Behaviour changes
-
gmm_impute()on complete data now does what its warning says: the completions are the data verbatim, with no bootstrap refits, no imputation draws, and no random-number consumption (previously it fitted and drew anyway). -
proxy_pool(method = "analytic")on a gated (MNAR / censored) mechanism announces the downgrade to Rubin’s rules instead of switching silently.
proxymix 0.12.0
New features
-
Fit-quality certificate. Every fitter now stamps a quality certificate into the result’s metadata (regime, convergence, degeneracy, the effective-sample-size profile including a new per-component ESS, the support fraction, and the held-out validation gap), readable with the new
gmm_fit_quality(). Every closed-form operator carries the certificate through unchanged, together with aprovenancevector recording the chain of operations, so the certificate can be read off a marginal, a filtered belief, or any other derived mixture. Downstream verbs (gmm_entropy(),gmm_mutual_information(),gmm_independence_graph(),gmm_intervene(),gmm_counterfactual(),gmm_filter()) read the same certificate and raise a one-shot advisory (classproxymix_low_quality) when the source fit is flagged. -
Degeneracy is a state, not a footnote. An importance-sampling collapse (
ESS < min_ess) now flags the fit as degenerate: the warning is classed (proxymix_low_ess),convergedis forced toFALSE, and the newon_low_ess = "abort"refuses to return the fit at all (classed errorproxymix_degenerate_fit). Previously a fit sitting on one effective draw could reportconverged = TRUEwith a warning as the only trace. -
Held-out validation on by default.
validation_sizenow defaults toceiling(is_size / 4)rather than0, so the overfit-vs-generalise diagnostic exists on every regime-(iii) fit. Passvalidation_size = 0Lto disable. -
autoplot()is now registered for plaingmmobjects too, so operator-calculus results (marginals, conditionals, filtered beliefs) plot directly rather than only freshly fitted proxies. - A dimension disclosure now lives at the core fitter:
fit_kld_em()notesp > 5and warns (classedproxymix_high_dimension) beyondp = 10, instead of only the wrapper entry points knowing the scaling story. - Classed conditions throughout the fitting path:
proxymix_low_ess,proxymix_degenerate_fit,proxymix_high_dimension,proxymix_support,proxymix_nonmonotone,proxymix_low_quality– pipelines can condition-handle instead of matching message text.
Behaviour changes
-
Seeded fits are now reproducible end-to-end.
fit_kld_em(seed =)previously seeded only the importance-sampling draw; the initialisation resample, the kmeans pass, and empty-component reseeds consumed the ambient random-number stream, so two calls with the same seed could return different fits. All internal draws now derive from the seed. Fits under a given seed therefore differ from 0.11.x.fit_em_samples(seed =)likewise drives its multi-start (previously hard-coded restart seeds). -
from_objective()derives a distinct seed per cooling step: previously the importance-sampling stream was re-seeded identically at every temperature, so with the defaultexploration = 0.5half of the draws were byte-identical across the whole ladder and a basin missed by the first exploration draw was never probed again. -
Data-scaled ridge in the fitters. The EM fitters scale their covariance ridge by the data’s covariance scale (previously an absolute constant), so the same default regularises identically at data scale
1e-8and1e+8. The floor is constant within a fit – a ridge relative to the component’s own diagonal would shrink together with a collapsing component and stop flooring exactly when needed. At unit data scale the behaviour is unchanged to first order; the operators’ tiny hygiene ridge is unchanged. - An empty (dead) EM component now has its weight reset alongside its mean and covariance (previously the weight stayed at zero, the reseeded mean was unreachable, and the reseed re-fired every iteration); the reseed covariance is at data scale. Both EM fitters also warn (classed
proxymix_nonmonotone) if their objective decreases beyond numerical tolerance, which the documentation previously claimed and the code did not do.
Bug fixes
- Mixture-reduction merge costs are computed in the log domain: the raw Gaussian-product density underflows for large dimension times scale (reaching exactly zero near
p = 115at scale1e5), which made every merge cost0/Inf/NaNand the merge order arbitrary. -
hellinger_mc()reports the standard error of the self-normalised weighted estimator; the previoussd(ratio)/sqrt(n)treated skewed importance weights as uniform and could understate the Monte Carlo error by orders of magnitude.
Performance
- Greedy mixture reduction caches pairwise merge costs and re-costs only the pairs touching the merged component (previously all
O(K^2)pairs were re-costed, each with its own Cholesky, after every single merge – the bottleneck inside long Gaussian-sum-filter runs). - The log-sum-exp row maximum is computed by a vectorised
pmaxreduce rather than a per-row loop, on every E-step’s hot path.
Tests
- The shipped conformance case set now executes during test runs (off CRAN) through its built-in driver, so the two-sided contract sweep gates checks rather than living only in an external harness; the two remaining stubs gained negative cases.
- New independent oracles: closed-form Gaussian KL and quadrature for
gmm_kld()atK = 1(the package’s namesake divergence previously had only positivity checks), closed-form Gaussian Hellinger forhellinger_mc(), quadrature for the logit-link gated moments (the shipped default link previously had no moment-level oracle), and a probit-link end-to-end MNAR recovery. - New metamorphic invariants: Renyi-2 affine equivariance (
H(AX + b) = H(X) + log|det A|), marginal-vs-joint quadrature consistency,gmm_observewith a near-noiseless selection row againstgmm_conditionalise, post-operator mass preservation, and anrgmm/dgmmKolmogorov-Smirnov cross-check. - The uplift Monte Carlo studies are skipped on CRAN and grade monotonicity with slack plus a rank correlation, instead of strict positivity of every adjacent difference from a local-optimum EM.
proxymix 0.11.6
Bug fixes
- The regime-(iii) stopping rule is now invariant to the target’s normalising constant.
fit_kld_em()previously judged convergence on the relative change of the importance-sampled KLD trace, whose magnitude carries the arbitrary-log Z(f)offset of an unnormalised target – so the same target, shifted by a constant, could stop after two iterations where the unshifted run used a hundred. Convergence is now judged on the importance-weighted EM objectiveQ(theta) = sum_n W_n log g(x_n), which never touches the constant. Iteration counts (and therefore fits that previously stopped early or late for this reason) can change; a regression test pins the invariance. -
gmm_eos_test()now resolves its model specification through the same machinery asgmm_filter(): dynamics and measurement offsets (b,d) are honoured (they were previously ignored without notice, silently changing the innovations), function-valued (time-varying) specifications are accepted, and Gaussian-sum (mixture) noise is rejected with an informative error since both calibrations assume Gaussian innovations. - The internal log-sum-exp kernel now excludes
NaNentries from the sum (previously they were neutralised only in the row maximum and could still poison the result) and returns+Inffor a row containing+Inf(previously-Inf, the exact inversion). - The smooth-gate moment computation in gated (MNAR) imputation floors its normaliser exactly as the censored path does, so a component conditional sitting deep in the never-missing region can no longer send
NaNthrough the M-step and abort the EM with an opaque Cholesky error.
Documentation
- Documentation, NEWS, and shipped validation scripts have been reworded to remove internal development vocabulary and references to unreleased companion packages; the posterior-producer probe
fb_producer_available()now reads the producer package name fromgetOption("proxymix.producer_package")instead of hard-coding one. -
READMEgains a GitHub installation section, a “Why not MCMC?” positioning note with the dimensional limits of regime (iii) stated up front, a “Which function do I need?” routing table, and the full eleven-vignette list; the quick-start example now runs to convergence. - The missing-data vignette’s scope section now reflects the
mnar()/censored()mechanisms available since 0.11.0 and cross-references the companion vignette. -
inst/validation/contract-conformance.Ris now self-contained: a minimal serial sweep driver ships in the file, so the two-sided case set is executable anywhere (an external driver can still be injected viaPROXYMIX_CONFORMANCE_ENGINE). - The pkgdown reference index gains the missing interoperability section and the articles index gains the MNAR vignette (both previously broke the site build).
proxymix 0.11.5
Fixes
-
withris now a hard dependency (Imports, previouslySuggests). It is called on the core seeded-sampling paths (annealing, entropy estimation, importance-sampled KLD-EM fitting, mixture reduction, thegmm_targetsample constructors, initialisation, and the seeded diagnostics), so a clean installation withoutwithrwould have failed at runtime. Declaring it as the runtime dependency it already is removes that failure mode.
Internal
- Metadata coherence: the
_pkgdown.ymlsiteurl:now resolves to the package’s GitHub Pages host (max578.github.io/proxymix), andCITATION.cffversionanddate-releasedtrack the released version. - The censored-imputation recovery check now grades the estimator on its mean absolute error across several simulated draws rather than a single draw. With more than half of the target column left-censored, finite-sample recovery carries a small expected positive bias, so one draw could legitimately sit further from the truth than a tight per-draw bound allowed; the comparison against the LOD/2 substitute is unchanged.
proxymix 0.11.4
Internal
- Ships the durable two-sided conformance case set in
inst/validation/(the authored validation cases an external conformance harness executes to measure coverage of the documented function surface). No change to package code; this gives the previously local-only validation material a tracked, travelling home.
proxymix 0.11.3
New features
-
gmm_independence_graph()returns the undirected second-order conditional-independence (Gaussian graphical model) structure of a fitted mixture – the partial-correlation skeleton of its overall covariance, in closed form. Composed withfit_kld_em(), it recovers the dependency structure of a target that can only be evaluated (an unnormalised energy / Gibbs density), where no sample exists to drive a sampling-based estimator. It is a graphical-model diagnostic, not a causal-discovery method: it recovers the undirected Markov skeleton, not edge directions.
proxymix 0.11.2
Bug fixes
-
gmm()now rejects a covariance that is not positive-definite. A component covariance with a negative variance (or, more generally, a negative eigenvalue beyond numerical tolerance) was previously accepted silently, deferring the failure to a later evaluation. The validator now checks the smallest eigenvalue of each finite covariance against a scale-relative tolerance, so a clearly indefinite matrix is caught at construction while a numerically near-singular fit still constructs.
proxymix 0.11.1
Bug fixes
-
gmm_impute(mechanism = censored(...))is now numerically stable when a mixture component sits well past the censoring bound. The fit could previously produceNAresponsibilities and abort, because it initialised the missing entries from the observed mean – the wrong side of a one-sided bound – and the truncated-conditional moments could overshoot the interval. It now initialises inside the censored interval and clamps the truncated moments to it.
proxymix 0.11.0
New features
-
Missing-not-at-random and censored imputation.
gmm_impute()now takes a missingness mechanism.mnar()specifies a selection model in which an entry of a coordinate is missing with a probability that depends on its own unobserved value, fitted jointly with the mixture so the imputations are not biased by an ignorable model. The slope is a sensitivity parameter, supplied rather than estimated, and the intercept is calibrated to the observed missingness rate.censored()handles a known interval, such as a detection limit, by drawing from the mixture conditional truncated to that interval in closed form.mar()is the default and reproduces the earlier behaviour. -
proxy_mnar_sensitivity()sweeps the sensitivity slope over a grid and pools the coordinate mean at each value, the input to a tipping-point analysis.
proxymix 0.10.0
New features
-
End-of-sample instability testing.
gmm_eos_test()asks whether the lastmobservations of a series are consistent with a linear-Gaussian state-space model fitted on the rest, in the small-mregime (evenm = 1) where ordinary structural-break tests are undefined because the post-break parameters cannot be estimated. It scores the lastmstandardised one-step filter innovations and calibrates the score either parametrically (a chi-square reference, exact under Gaussian innovations) or by the distribution-free subsampling test of Andrews (2003), which stays valid under heavy-tailed observation noise. A new vignette, Testing the last observation for instability, works through a local-level example.
proxymix 0.9.0
New features
Multiple imputation by Gaussian-mixture conditioning.
gmm_impute()fits a Gaussian mixture to a numeric dataset that contains missing values and drawsmcompleted datasets from the mixture conditionalp(x_missing | x_observed), the same Schur-complement algebra asgmm_conditionalise(). Because the mixture can be multimodal and heteroscedastic, the imputations follow the shape of the joint distribution, which keeps downstream inference valid on data a single-Gaussian or linear-Gaussian imputer mis-specifies. The mixture is fitted to the incomplete data by an expectation-maximisation that uses each row’s observed margin and restores the conditional covariance of the filled entries; each completion is drawn under a mixture fitted to a bootstrap resample of the rows, so the pooled inference reflects both imputation and parameter uncertainty. This release covers numeric data missing at random.Pooling, diagnostics, and mice interoperability.
gmm_complete()extracts the completed datasets.proxy_pool()pools a column mean in closed form – the exact large-sample limit of the between-imputation variance, with no Monte-Carlo noise and an imputation / parameter variance split – andproxy_fmi()reports its fraction of missing information. For a regression or any other model estimand,as_mids()packages the completions as amiceobject so the joint mixture imputations flow intomice::pool()unchanged: proxymix supplies the imputation model, mice the pooling. A new vignette, Imputing missing data with a mixture, works through a multimodal example.
proxymix 0.8.0
New features
A mixture proxy for the optima of an objective.
from_objective()fits a Gaussian-mixture proxy to the Gibbs measureexp(-f(x) / T)of a user-supplied objectivefover a bounded box, by cooling a short temperature ladder through regime-(iii) KLD-EM (fit_kld_em()). The Gibbs measure can be evaluated point-wise but not directly sampled, so this is regime (iii) applied to an objective: the returned mixture is a closed-form map over the low regions off, and a multimodal objective is recovered as a whole rather than one optimum at a time. A new vignette, Mapping the optima of an objective, works through the bimodal and Himmelblau cases.Modes of a Gaussian mixture.
gmm_modes()returns the distinct local modes of a mixture density by Gaussian mean-shift from each component mean (Carreira-Perpinan 2000), with the mixture density at each mode. It resolves the map fromfrom_objective()into the recovered optima and applies to anygmmorgmm_fit.
proxymix 0.7.0
New features
Closed-form entropy diagnostics.
gmm_entropy()returns the closed-form quadratic (order-2) Renyi entropy of a Gaussian mixture, or a Monte-Carlo Shannon estimate bracketed by an analytic upper bound.gmm_divergence()returns the closed-form, symmetric Cauchy-Schwarz divergence between two mixtures (type = "cs"), withtype = "kl"delegating togmm_kld(). Both rest on the Gaussian-product integral, so the closed-form quantities are exact finite sums of Gaussian-density evaluations. A new vignette, Entropy diagnostics with proxymix, demonstrates the layer.Mutual information and conditional predictive entropy.
gmm_mutual_information()returns the closed-form, non-negative Cauchy-Schwarz mutual information between two coordinate blocks of a fitted joint (the divergence between the joint and the product of the marginals; zero exactly under independence).gmm_conditional_entropy()returns the order-2 Renyi entropy of the conditional mixture fromgmm_conditionalise()– the predictive uncertainty of the target coordinates given the conditioned ones, evaluated row-by-row.Deterministic-annealing fitting and phase-transition component discovery.
fit_em_samples()andfit_kld_em()gain an opt-inanneal = TRUEargument that locates the mixture components by deterministic annealing – cooling a temperature from a high value toward one – before the unchanged cold EM loop polishes the fit. The annealed warm-start is markedly less sensitive to local optima than a cold multi-start. The companion diagnosticgmm_anneal_path()tracks the number of distinct centroids as the temperature falls, a physics-derived component-count read whose first phase transition has the closed-form critical temperaturelambda_max(Sigma^{-1} C).Maximum-entropy targets and the ICL criterion.
maxent_target()constructs the least-committalgmm_targetconsistent with the supplied constraints: the Gaussian under second-moment constraints on full support, the uniform under a support constraint alone, and a truncated Gaussian under second-moment constraints on a box (fit via regime (iii) under the automatically selected support-matched proposal).bic_aic()now also returns the integrated completed likelihoodicland theclassification_entropyof the fitted responsibilities (Biernacki, Celeux and Govaert 2000), which penalises overlapping components and equals the BIC for a single component.Kernel regression in the unifying-primitive vignette. The One mixture, many methods vignette now covers Nadaraya-Watson kernel regression as the nonparametric end of the conditioning axis: the conditional mean of a one-component-per-datum kernel density estimate equals the Nadaraya-Watson estimator exactly, so a single conditioning operation spans ordinary least squares (
K = 1) through fully-local kernel smoothing (K = n).Kalman filtering over time in the operator-calculus vignette. The Affine-Gaussian operator calculus vignette now shows the predict (
gmm_affine) and update (gmm_observe) operators run as a filter over a time series: at one component the recursion is exactly the classical Kalman filter (verified against a textbook implementation on a constant-velocity track), and at several components it is the Gaussian-sum filter.Mixture reduction.
gmm_reduce()collapses a Gaussian mixture to a budget of at mostk_maxcomponents. The defaultmethod = "merge"is a greedy, moment-preserving pairwise merge, using either the Runnalls (2007) Kullback-Leibler bound (cost = "kl") or a closed-form Cauchy-Schwarz cost (cost = "cs"); every merge preserves the combined weight, mean and covariance, so the reduced mixture has the same global mean and covariance as the original, and reducing to one component returns the moment-matched Gaussian.method = "anneal"refines the merge with an annealed re-fit of a budget-sized proxy and keeps it when it improves on the merge (never worse), which helps for smooth, over-parameterised mixtures. Reduction bounds the component count of a Gaussian-sum filter built fromgmm_affine()andgmm_observe().Bounded Gaussian-sum filtering.
gmm_filter()runs a filter over an observation series by alternating the predict operator (gmm_affine()), the update operator (gmm_observe()) and an optional reduction (gmm_reduce()). At one component it is the Kalman filter (verified against a textbook implementation); with a Gaussian-sum process or measurement noise – agmmsupplied in place of a covariance matrix – it is the Gaussian-sum filter of Alspach and Sorenson (1972), and thek_maxcap holds the component count at budget over a long horizon. It returns the filtered mixture at each step, the filtered means and covariances, and a tidy per-step summary including the log marginal evidence. Constant and time-varying dynamics and measurements are both supported. A new section of the Affine-Gaussian operator calculus vignette demonstrates the verb.
proxymix 0.6.0
New features
-
New: a closed-form decision layer (uplift / next-best-action). One joint Gaussian-mixture proxy over
(outcome, treatment, covariates)is read – in closed form, from that single fit – as prediction, heterogeneous treatment effects, optimal per-unit actions, off-line policy value, and an identification audit.fit_uplift()assembles the joint fit and returns anuplift_model; theproxy_*verbs score it without re-fitting:proxy_predict()(the response / risk-scoring rung),proxy_cate()/proxy_uplift()(heterogeneous effects with a delta-method or resampling standard error),proxy_decide()(the revenue-maximising action plus an action-flip probability),proxy_policy_value()(off-line value of a targeting policy),proxy_confounding_gap()(the sensitivity to a latent confounder),proxy_retrospective_uplift()(counterfactual-mean uplift for observed units),proxy_regime_segments()(the fitted regimes as a segment table),proxy_overlap()(per-unit positivity / mass coverage), andproxy_identification_report()(the executive one-pager). The decision layer rides only identified quantities – the conditional average treatment effect and counterfactual means. -
New: the do-operator and the counterfactual as first-class operators.
gmm_intervene()returns the interventional lawp(. | do(.), .)– it sets the intervened coordinates inside every component without re-weighting the regime gate (the graph surgery that distinguishesdo(T = t)fromT = t).gmm_counterfactual()returns theK-atom counterfactual law of one observed unit by abduction, action, and prediction. Only the counterfactual mean is identified: the newgmm_counterfactual_lawobject exposesgmm_cf_mean(), whilegmm_cf_variance()andgmm_cf_tail_prob()deliberately error (proxymix_not_identified) – the individual counterfactual law depends on an unidentified cross-world coupling. -
New: binary outcomes via latent-scale fitting with a discretised predictive.
fit_uplift(outcome_type = "binary")fits on the latent continuous scale;proxy_cate(scale = "response")andproxy_predict(scale = "response")report effects and predictions on the discretised predictive probabilityP(Y > threshold). Count outcomes are supported on the same latent-scale reading.
Documentation
- New vignette One mixture, many methods – using one fitted Gaussian mixture in place of regression (
lm), clustering (kmeans/mclust), principal components (prcomp), and ridge regression, with the trade-off of each substitution stated. AtN = 1the conditional mean is exactly least squares, the covariance eigenvectors are exactly the principal components, and a covariance ridge is exactly theL2penalty. - Vignette figures refreshed to publication quality – proper mathematical subscripts and Greek symbols in titles and axis labels, shared contour levels, and a legend on the KDE-vs-proxy comparison.
Internal and tests
- New
data.tableimport for the decision-verb return tables. - New test files lock the operators against hand-built linear structural causal models (
test-intervene.R), the K = 1 reduction ofproxy_cate()to the ordinary-least-squares treatment coefficient and the asymptotic agreement of the delta standard error with thelmcoefficient standard error (test-uplift-cate.R), the audit verbs including recovery of a planted latent confounder (test-uplift-audit.R), and a six-process synthetic validation battery graded against known ground truth (test-uplift-validation.R).
proxymix 0.5.0
New features
-
New: support-aware importance proposals for bounded and one-sided targets.
gmm_target()gains an optionalsupportargument (list(lower = , upper = ), with-Inf/Inffor unbounded coordinates). When a regime-(iii) fit is given such a target and no explicitproposal,fit_kld_em()now selects a support-matchedis_uniform()proposal automatically – inset inside a compact box, or data-derived for a one-sided coordinate – instead of the default multivariate-t, which placed importance mass where the target log-density is-Infand produced non-finite weights. The automatic choice is announced with a one-line message, never silently. Unbounded targets are unaffected: they keep the heavy-tailed default. -
New:
epanechnikov_target(). A compact-support fixture (the Epanechnikov kernel(3/4)(1 - u^2)on a box) joiningbanana_target()/donut_target()/mixture_target(). It declares itssupport, so it fits via regime (iii) under the auto-selected uniform proposal with no NaN weights – the canonical case where no mixture of full-support Gaussians can have compact support.
Internal and tests
-
inst/validation/regime_iii_pinned_fits.Rgains a pinned Epanechnikov bounded-support fit (ESS, support fraction, no NaN weights). - New regression tests lock the no-NaN-weight guarantee on compact and one-sided targets (
test-support-aware-proposal.R), the exactK = 1conditional-mean and conditional-variance match againstlm(test-gmr-k1-lm.R), and class / constructor / diagnostic contract branches. Line coverage raised to >= 90%.
proxymix 0.4.0
New features
New: a consumer seam for external Bayesian posteriors.
from_fb_posterior()takes a posterior addressed only through its (unnormalised) log-density and returns a closed-form Gaussian-mixture proxy via importance-sampled KLD-EM, generalising the input source from a kernel-density estimate (from_kde()) to any Bayesian posterior. The producer interface the seam expects is materialised and documented byfb_log_posterior_spec();fb_producer_available()is a capability probe (degrades toFALSE, never errors, when no producer package is installed); andmock_fb_posterior()is a synthetic producer (known Gaussian or banana log-density) for testing the path with no producer package present. proxymix neverImports:a producer package – the seam is a soft contract andR CMD checkis clean with none installed.New:
autoplot()method forgmm_fit. Render a fitted proxy withggplot2::autoplot(fit)— a marginal density curve in one dimension, or a viridis density raster with per-component ellipses in two. Any ambient dimension is supported: the requested coordinates are reduced through the package’s own closed-formgmm_marginalise()before plotting (e.g.autoplot(fit, dims = c(1L, 3L))).ggplot2stays an optional dependency — the method registers only whenggplot2is installed, soR CMD checkremains clean with no sibling package present.
proxymix 0.3.0 (2026-05-14)
Second methodological extension. Brings a complete affine-Gaussian operator calculus to Gaussian-mixture proxies — pushforward, Bayesian update on a noisy linear observation, aggregation, missing-data conditioning — each closed-form and component-wise.
User-visible changes
-
New:
gmm_affine(g, A, b, noise_cov)— closed-form pushforward of a Gaussian mixture throughy = A x + b + epsilon,epsilon ~ N(0, noise_cov). Returns the mixture inR^mwithmu'_k = A mu_k + b,Sigma'_k = A Sigma_k A' + noise_cov, weights unchanged. -
New:
gmm_observe(g, A, y, noise_cov)— Bayesian update on a noisy linear observation. Applies the Kalman gain per component and reweights component weights by per-component marginal evidence. The finite-mixture analogue of a Kalman update. -
New:
gmm_aggregate(g, A, noise_cov)— named alias forgmm_affine()aimed at downscaling / aggregation pipelines. -
New:
gmm_missing(g, observed, values)— Schur-complement conditioning routed through an integer-index API for missing-data pipelines.
Design and validation
-
docs/design/operator_calculus_v0.3.md— pre-implementation design note: maths, scope constraints (no non-affine fallbacks, no approximate closed form), public API freeze, test obligations, performance budget, pre-release gate. -
vignettes/operator_calculus.Rmd— educational vignette with Kalman parity check, sequential vs stacked observations, aggregation through a coarsening matrix, and a comparison to a Gaussian-process latent. -
inst/validation/operator_calculus_pinned.R— three pinned reference pipelines (Kalman parity, sequential vs stacked, aggregate-then-observe) with hand-coded acceptance ranges.
Tests
-
test-operator-calculus.R(12 tests, 46 expectations): A0–A2 (affine of moments), O0–O2 (Kalman parity, vanishing-evidence guard, Bayes consistency), G0 (aggregate alias), M0 (missing vs conditionalise), C0 (composition with marginalise), plus full input-validation coverage.
Internal
-
R/operator_calculus.Rconsolidates the four operators with shared validation helpers (.validate_A,.validate_b,.validate_noise_cov) and a single numerical-hygiene policy (ridge after each output covariance, symmetrisation, chol-based inverse with retry on near-singular matrices). -
gmm_observe()issues aproxymix_observe_no_updatewarning when the marginal evidence is numerically zero at every component and returns the prior unchanged with a metadata flag.
proxymix 0.2.0 (2026-05-14)
Two methodological extensions that compose cleanly with regime (iii):
User-visible changes
-
New:
from_kde(). Compiles a kernel density estimate over annbypsample matrix into a closed-form Gaussian-mixture proxy via regime-(iii) KLD-EM. Supports scalar and diagonal bandwidths ("silverman","scott", numeric scalar, or per-coordinate numeric vector). Dimensional guard:p <= 5recommended,p <= 10allowed with warning,p > 10rejected. The KDE-target isnormalised = TRUEby construction, so downstream KLD and Hellinger diagnostics report absolute values. Companion vignette:vignettes/from_kde.Rmd. -
New:
gmm_target_from_posterior()(Contract A constructor). S3 generic that compiles an (unnormalised) Bayesian posterior into agmm_target. Thefunctionmethod accepts a bare vectorised callable with requiredparameter_names; thedefaultmethod points users at either a registered Bayesian-package method (brms,Stan, …) or the function-based path. Vectorisation contract is enforced at construction by a probe call. -
URL and BugReports.
DESCRIPTIONnow ships the canonical GitHub namespace atgithub.com/max578/proxymix.
Tests
-
test-from-kde.R(8 tests, 25 expectations): end-to-end recovery, bandwidth selection branches, dimensional guards, normalisation declaration, default proposal sanity, metadata pass-through. -
test-from-posterior.R(7 tests, 21 expectations): vectorisation contract enforcement, log-normalizer pass-through, default-method hinting, name validation, round-trip throughfit_proxymix(regime = "kld"), attribute-based parameter-name support. -
inst/validation/from_kde_pinned_fits.R: pinned validation across three reference KDE -> GMM pipelines (bimodal, banana, mixture) with MC-SE-aware acceptance ranges.
Documentation
-
vignettes/from_kde.Rmd: educational walk-through covering scope, bandwidth sensitivity, recovery on a known mixture, and the contrast between KDE and proxy log-densities.
Internal
-
gmm_target_from_posteriorregisters an S3 generic, paving the way for class-specific methods registered by Bayesian fitting packages (brms,Stan,pymc-via-reticulate) without coupling proxymix to any specific Bayesian backend. - From-KDE log-density evaluation uses chunked matrix builds so that peak memory stays bounded for large IS samples.
proxymix 0.1.1 (2026-05-14)
Scientific hardening pass: regime (iii) is made harder to misuse.
User-visible changes
-
Normalisation-aware targets.
gmm_targetgains two new properties,normalised(logical orNA) andlog_normalizer(numeric orNA), so that an unnormalisedlog_densitycan be supplied without making downstream KLD or Hellinger diagnostics misleading. All three built-in targets (banana_target(),donut_target(),mixture_target()) declarenormalised = TRUE; the unnormalised case is now explicitly documented at the target level. -
Canonical component ordering. A new
gmm_canonicalise()function reorders the components of agmm(orgmm_fit) by descending weight, then by descending||mu||as a tiebreaker.fit_proxymix()and the regime-specific fitters now canonicalise their outputs by default (canonicalise = TRUE), making prints, snapshot tests, and cross-run comparisons reproducible. Setcanonicalise = FALSEto retain the raw EM-order parameters. -
Held-out importance-sample validation.
fit_kld_em()(and thereforefit_proxymix(regime = "kld")) acceptsvalidation_sizeandvalidation_proposal. Whenvalidation_size > 0, a second independent IS sample is drawn and the fit’s diagnostics list recordsvalidation_kld,validation_ess, andvalidation_max_weight. This lets users tell the difference between in-sample overfit and a fit that generalises across IS draws. -
Richer IS diagnostics.
fit_kld_em()now recordsess_relative(ESS /is_size),max_weight(largest self-normalised weight),support_fraction(fraction of IS draws with finite log-density under target and proposal), and a Monte-Carlo standard error for the final KLD estimate (mc_se_kld). A newess_summary()helper returns the headline numbers as a small list. -
Shifted-KLD labelling. Diagnostics now record
kld_is_shiftedandkld_shift_explanationwhenever the target is unnormalised or its normalisation is unknown, so users do not silently read a shifted MC integral as an absolute divergence. -
Hellinger guard.
hellinger_mc()now warns when the target is not declarednormalised = TRUE— the squared Hellinger distance is not meaningful against an unnormalised target. -
Proposal-support warning.
fit_kld_em()issues acliwarning when more than 5% of importance-sample draws fall outside the proposal’s support or carry non-finite weights. The most common trigger is anis_uniform()proposal whose box does not cover the target’s mass.
Validation corpus
-
inst/validation/regime_iii_pinned_fits.R— a runnable validation script that fits the three built-in targets with pinned seeds and records final KLD, ESS, max weight, validation KLD, and runtime; intended as the seed of a growinginst/validation/corpus.
Tests
- New:
test-canonicalise.R,test-normalisation.R,test-validation-split.R,test-support-warning.R, andtest-monotone-objective.R. The last asserts monotonicity of the fixed IS-weighted objective (n W_n g(x_n)) under exact KLD-EM updates, which is a tighter check than the previous generic “trace decreases” test.
Internal
-
gmm_canonicalise()is the single source of truth for component ordering — used by all three fitters and the dispatcher.
proxymix 0.0.1 (2026-05-13)
Initial development release. Local-only; not yet on CRAN.
Implemented
-
fit_proxymix()top-level dispatcher with three fitting regimes:"moment"(closed-form moment matching),"sample"(classical EM on i.i.d. samples), and"kld"(importance-sampled KLD-EM against an evaluable-only target density). The"auto"regime picks the cheapest applicable regime from the structure of the suppliedgmm_target. - S7 class hierarchy:
gmm_target,gmm_fit,is_proposal, withprint()/format()methods and validators. - Closed-form GMM operators in
gmm_ops.R:dgmm(),rgmm(),gmm_marginalise(),gmm_conditionalise()(Schur complement),gmm_kld()(Monte Carlo estimator with variational upper / lower bounds for sanity). - Importance-sampling proposals in
proposals.R:is_uniform(),is_mvn(),is_mvt(); all wrap ais_proposalinstance. - Diagnostics:
kld_trace(),ess_trace(),hellinger_mc(),bic_aic(). - Multi-start best-of (Karlis & Xekalaki) initialisation in
init.R, plusinit_random(),init_kmeans(),init_moment_seed(). - Built-in target factories used in the vignettes:
banana_target(),donut_target(),mixture_target(), plus the from-samples and from-function constructorsgmm_target_from_samples()andgmm_target(). - Four vignettes:
quickstart,three_regimes,density_shapes(the regime-(iii) demonstration), androadmap(planned interfaces).
Provisioned stubs
The following functions ship with stable signatures, full documentation, and signature-stability tests; the body raises a “not yet implemented” condition with a pointer to vignettes/roadmap.Rmd.
-
from_kde()— KDE to GMM proxy via KLD-EM. -
from_aggregate_likelihood()— aggregate-likelihood downscaling (Sejdinovic et al. kernel-downsizing framework). -
fit_kld_em_collider()— KLD-EM under DAG-implied conditional independence constraints. -
to_apsim_scenarios()— Gaussian-mixture samples to APSIM scenario tables. -
from_simulator()— wrap an expensive simulator as agmm_targetvia kernel-density or empirical-likelihood bridges.