Skip to contents

[Experimental]

Companion to janusplot() returning the raw list of GAM fits plus per-cell metrics (EDF, F-test p-value, deviance explained, asymmetry index, pairwise correlations, shape descriptors) without constructing the ggplot. Useful for custom rendering or downstream analysis.

Usage

janusplot_data(
  data,
  vars = NULL,
  adjust = NULL,
  method = NULL,
  k = -1L,
  bs = "tp",
  na_action = c("pairwise", "complete"),
  parallel = FALSE,
  keep_fits = FALSE,
  derivatives = integer(),
  derivative_ci = c("pointwise", "none", "simultaneous"),
  derivative_ci_nsim = 1000L,
  n_grid = NULL,
  shape_cutoffs = janusplot_shape_cutoffs(),
  k_check_thresholds = NULL,
  auto_refit_k = FALSE,
  k_max_iter = 2L,
  engine = c("bam", "gam"),
  discrete = FALSE,
  nthreads = 1L,
  ...
)

Arguments

data

A data frame with numeric columns to include.

vars

Character vector of column names to use. NULL (default) uses all numeric columns in data. Non-numeric columns trigger an error listing offenders.

adjust

A one-sided formula RHS giving additional covariates and/or random effects to include in every pairwise GAM. For example, adjust = ~ s(age) + s(site, bs = "re") fits gam(y ~ s(x) + s(age) + s(site, bs = "re")) for each pair. Default NULL fits unadjusted pairwise smooths.

method

Smoothing-parameter estimation method passed to the chosen fitting backend. Default NULL resolves per-engine: "fREML" for engine = "bam" (mgcv's recommended at scale), "REML" for engine = "gam" (the v0.1.0 behaviour). Pass any valid mgcv method string to override.

k

Integer, or named list mapping variable names to integers. Basis dimension for s(). Default -1L (mgcv's automatic choice).

bs

Basis type for s(). Default "tp" (thin plate).

na_action

One of "pairwise" (default; per-cell complete observations) or "complete" (listwise; all cells use the same rows).

parallel

Logical. If TRUE, use future.apply::future_mapply() to fit pairs in parallel. Requires the future.apply package and a user-configured future::plan(). Default FALSE.

keep_fits

Logical. If TRUE, retain full mgcv::gam() model objects in the return (large memory footprint for k above ~15). Default FALSE — retains summary metrics and prediction grids only.

derivatives

Integer vector of derivative orders to compute on every pair (subset of 1:2). Default integer() — no derivatives. Unlike janusplot(), the data companion can return multiple orders from a single call for programmatic analysis; pass c(1L, 2L) to surface both.

derivative_ci

One of "none" (default), "pointwise", or "simultaneous". Controls whether — and how — a 95% confidence ribbon is drawn underneath the derivative curve when display %in% c("d1", "d2"). Ignored when display = "fit".

  • "none" — no ribbon. The curve and the zero reference line are all you see. Default, because pointwise ribbons overshoot nominal coverage as a joint region and can invite over-reading of local features.

  • "pointwise" — 95% pointwise ribbon from \(\sqrt{\mathrm{diag}(D V_p D^\top)}\) (Wood 2017 §7.2.4). Valid marginally; not a simultaneous statement.

  • "simultaneous" — 95% simultaneous band via the Monte Carlo construction of Ruppert, Wand & Carroll (2003) popularised for GAMs by Simpson (2018, Frontiers Ecol. Evol. 6:149): draw \(B\) samples \(\tilde{\boldsymbol\beta} \sim \mathcal{N}(\hat{\boldsymbol\beta}, V_p)\), compute \(\max_x |D_i(\tilde{\boldsymbol\beta} - \hat{\boldsymbol\beta})| / \mathrm{se}_i\), and use the \((1-\alpha)\) quantile as a critical multiplier on the pointwise SE. Valid for feature localisation ("where is \(\hat f'(x)\) significantly non-zero").

derivative_ci_nsim

Integer. Number of Monte Carlo samples used when derivative_ci = "simultaneous". Default 1000L — a compromise between coverage accuracy (Simpson 2018 uses 10000) and CPU budget across every pair in a medium-sized matrix. Ignored for any other derivative_ci.

n_grid

Integer or NULL. Number of equally-spaced points used to evaluate each fitted smooth (and its derivatives). Default NULL resolves to 100 when display = "fit" and 200 otherwise, because finite-difference second derivatives visibly degrade below \(\sim 150\) points on moderate-k smooths. Supplying n_grid directly overrides both defaults. Larger grids shift the numerical shape-metric values (\(M\), \(C\), turning / inflection counts) slightly because they are computed on this same grid. Shapes and asymmetry are the primary reading; M, C and the counts are secondary diagnostics and the grid-induced drift is tolerable.

shape_cutoffs

Named list of classification thresholds used to map the continuous shape indices (monotonicity_index, convexity_index) into discrete shape_category labels. Defaults from janusplot_shape_cutoffs().

k_check_thresholds

Named list giving the three flag-criterion thresholds used by mgcv::k.check()-style basis-dimension diagnostics. Required entries: edf_ratio (Wood's \(\widehat{\mathrm{edf}}/k'\) ratio above which the smooth is too close to its basis cap), k_index (residual-difference variance ratio below which the basis appears underspecified), and p (the simulation p-value below which the basis-deficiency signal is significant). Defaults — edf_ratio = 0.9, k_index = 1.0, p = 0.05 — track mgcv::gam.check() and Wood (2017) §5.9.

auto_refit_k

Logical. If TRUE, every cell whose Wood trifecta flags an underfit is refit with a doubling-k loop until either the flag clears, the per-cell unique-x cap is reached, or k_max_iter iterations have passed. Default FALSE — the diagnostic (k_check_status, k_flag, k_prime, k_index, k_p) is always computed and surfaced regardless of this flag, but the refit is opt-in because it can multiply wall time on pathological data.

k_max_iter

Integer. Maximum number of doublings allowed per cell when auto_refit_k = TRUE. Default 2L (so a cell that starts at the mgcv default k = 10 will visit at most k = 20 and then k = 40, capped by the per-cell unique-x limit). Ignored when auto_refit_k = FALSE.

engine

One of "bam" (default, new in v0.1.1) or "gam". Selects mgcv's fitting backend:

  • "bam"mgcv::bam(). Block-Lanczos solve + fREML estimation + lower memory. ~3-10x speedup at janusplot's scale (k = 15-25 vars, 600+ pairwise fits per call). The default, and the one non-byte-identical change in v0.1.1: fREML differs from REML by ~1-3% in EDF on identical data, so the asymmetry index may shift by similar amounts vs v0.1.0 output. Recoverable verbatim via engine = "gam".

  • "gam"mgcv::gam(). The v0.1.0 backend. Use for backward-compat reproduction, very small n (< 200) where bam's setup overhead exceeds its solve gain, or methodologically sensitive contexts that require REML rather than fREML.

discrete

Logical. bam-only opt-in to mgcv's covariate-discretisation optimisation. Further ~2-5x speedup at the cost of small (sub-pixel at typical janusplot resolution) prediction-shift. Default FALSE. Ignored when engine = "gam".

nthreads

Integer. bam-only intra-fit threading. Default 1L to avoid oversubscription when combined with parallel = TRUE (future.apply fans out pair fits across cores; nthreads > 1 within each pair would double-book CPUs). Raise above 1 only when parallel = FALSE. Ignored when engine = "gam".

...

Additional arguments passed to mgcv::gam().

Value

A list with components:

vars

Character vector of variables used, in plotted order.

pairs

List of per-pair results. Each element has i, j, var_i, var_j, fit_yx, fit_xy (NULL if keep_fits = FALSE), pred_yx, pred_xy (data frames with x, fit, se, lo, hi), edf_yx, edf_xy, pvalue_yx, pvalue_xy, dev_exp_yx, dev_exp_xy, n_used, asymmetry_index, plus Pearson / Spearman / Kendall correlations (cor_pearson, cor_spearman, cor_kendall), the maximum tie ratio across x and y (tie_ratio), and per-direction shape descriptors (monotonicity_index_yx, convexity_index_yx, monotonicity_index_xy, convexity_index_xy, n_turning_yx, n_inflect_yx, n_turning_xy, n_inflect_xy, shape_yx, shape_xy). When derivatives is non-empty, each pair additionally carries deriv_yx and deriv_xy, each a named list keyed by order ("1", "2") whose entries are data frames with columns x, fit, se, lo, hi, ci_type matching the schema of pred_yx / pred_xy. The ci_type column records whether the lo / hi columns are "pointwise" (default), "simultaneous" (Ruppert–Wand–Carroll / Simpson 2018 critical-multiplier bands), or "none". When derivative_ci = "simultaneous", each derivative frame also carries a "crit_multiplier" attribute giving the MC-derived critical multiplier used. See janusplot_shape_metrics() for the definition of the monotonicity and convexity indices.

call

Match call.

See also

janusplot() for the ggplot front-end, janusplot_shape_metrics() for the shape-metric primitives.

Other smooth-associations: janusplot()

Examples

# Per-pair fits + metrics on a small mtcars slice
out <- janusplot_data(mtcars[, c("mpg", "hp", "wt")])
out$pairs[[1L]]$asymmetry_index
#> [1] 0.006028205
out$pairs[[1L]]$cor_spearman
#> [1] -0.8946646
out$pairs[[1L]]$shape_yx
#> [1] "s_shape"