Skip to contents

[Experimental]

Render a pairwise, asymmetric matrix of smoothed associations between numeric variables. Each cell [i, j] where i != j shows the fitted spline from mgcv::gam():

  • Upper triangle (i < j): gam(x_j ~ s(x_i) + <adjust>).

  • Lower triangle (i > j): gam(x_i ~ s(x_j) + <adjust>).

  • Diagonal: blank panel when labels live on the border (default), or a variable-name label when labels = "diagonal".

The two triangles intentionally differ — the asymmetry reveals heteroscedasticity, leverage, and directional non-linearity that a single scalar correlation hides.

Usage

janusplot(
  data,
  vars = NULL,
  adjust = NULL,
  method = NULL,
  k = -1L,
  bs = "tp",
  engine = c("bam", "gam"),
  discrete = FALSE,
  nthreads = 1L,
  order = c("original", "hclust", "alphabetical"),
  show_data = TRUE,
  show_ci = TRUE,
  display = c("fit", "d1", "d2"),
  derivative_ci = c("none", "pointwise", "simultaneous"),
  derivative_ci_nsim = 1000L,
  n_grid = NULL,
  colour_by = c("pearson", "spearman", "kendall", "edf", "deviance_gap", "none"),
  fill_by = NULL,
  palette = NULL,
  annotations = c("edf", "A"),
  shape_cutoffs = janusplot_shape_cutoffs(),
  show_shape_legend = TRUE,
  glyph_style = c("ascii", "unicode"),
  labels = c("border", "diagonal", "none"),
  diagonal = c("auto", "blank", "name", "density"),
  label_srt = 45,
  label_cex = 1,
  signif_glyph = TRUE,
  show_asymmetry = NULL,
  na_action = c("pairwise", "complete"),
  parallel = FALSE,
  with_data = FALSE,
  text_scale_diag = 1,
  text_scale_off_diag = 1,
  show_glossary = TRUE,
  glossary_scale = 1,
  k_check_thresholds = NULL,
  auto_refit_k = FALSE,
  k_max_iter = 2L,
  compact = c("auto", "always", "never"),
  compact_threshold = 12L,
  compact_levels = NULL,
  focus_by = NA_character_,
  focus_threshold = "q90",
  focus_dim_alpha = 0.25,
  axes = c("original", "standardised", "centred", "rank"),
  save_as = NULL,
  save_width = NULL,
  save_height = NULL,
  save_dpi = 300,
  ...
)

Arguments

data

A data frame with numeric columns to include.

vars

Character vector of column names to use. NULL (default) uses all numeric columns in data. Non-numeric columns trigger an error listing offenders.

adjust

A one-sided formula RHS giving additional covariates and/or random effects to include in every pairwise GAM. For example, adjust = ~ s(age) + s(site, bs = "re") fits gam(y ~ s(x) + s(age) + s(site, bs = "re")) for each pair. Default NULL fits unadjusted pairwise smooths.

method

Smoothing-parameter estimation method passed to the chosen fitting backend. Default NULL resolves per-engine: "fREML" for engine = "bam" (mgcv's recommended at scale), "REML" for engine = "gam" (the v0.1.0 behaviour). Pass any valid mgcv method string to override.

k

Integer, or named list mapping variable names to integers. Basis dimension for s(). Default -1L (mgcv's automatic choice).

bs

Basis type for s(). Default "tp" (thin plate).

engine

One of "bam" (default, new in v0.1.1) or "gam". Selects mgcv's fitting backend:

  • "bam"mgcv::bam(). Block-Lanczos solve + fREML estimation + lower memory. ~3-10x speedup at janusplot's scale (k = 15-25 vars, 600+ pairwise fits per call). The default, and the one non-byte-identical change in v0.1.1: fREML differs from REML by ~1-3% in EDF on identical data, so the asymmetry index may shift by similar amounts vs v0.1.0 output. Recoverable verbatim via engine = "gam".

  • "gam"mgcv::gam(). The v0.1.0 backend. Use for backward-compat reproduction, very small n (< 200) where bam's setup overhead exceeds its solve gain, or methodologically sensitive contexts that require REML rather than fREML.

discrete

Logical. bam-only opt-in to mgcv's covariate-discretisation optimisation. Further ~2-5x speedup at the cost of small (sub-pixel at typical janusplot resolution) prediction-shift. Default FALSE. Ignored when engine = "gam".

nthreads

Integer. bam-only intra-fit threading. Default 1L to avoid oversubscription when combined with parallel = TRUE (future.apply fans out pair fits across cores; nthreads > 1 within each pair would double-book CPUs). Raise above 1 only when parallel = FALSE. Ignored when engine = "gam".

order

One of "original" (default), "hclust" (reorder by hierarchical clustering of Pearson correlations), or "alphabetical".

show_data

Logical. If TRUE (default), overlay raw data points (low alpha) behind each spline. Only applies when display = "fit"; derivative panels never overlay raw data.

show_ci

Logical. If TRUE (default), overlay the 95% confidence envelope from predict(gam, se.fit = TRUE) on the fit panel (i.e. when display = "fit"). CI rendering on derivative panels is controlled separately by derivative_ci.

display

One of "fit" (default), "d1", or "d2". Selects which single quantity is rendered in every off-diagonal cell of the matrix.

  • "fit" — the fitted smooth \(\hat f(x)\); default, behaviour identical to the pre-derivative release.

  • "d1" — the first derivative \(\hat f'(x)\) of the fitted smooth. Zero crossings localise turning points of \(\hat f\).

  • "d2" — the second derivative \(\hat f''(x)\). Zero crossings localise inflection points of \(\hat f\).

A single matrix shows a single quantity by design: stacked multi-panel cells crowd the matrix at any realistic variable count. To compare fit against derivative, render two or three janusplot() calls side-by-side; each call keeps its own with_data = TRUE summary table tagged with the display column.

Orders \(k \ge 3\) are not exposed — higher-order derivatives of penalised regression splines amplify noise and rarely carry usable signal at realistic sample sizes. See vignette("janusplot") for the theoretical justification and applied use-cases.

derivative_ci

One of "none" (default), "pointwise", or "simultaneous". Controls whether — and how — a 95% confidence ribbon is drawn underneath the derivative curve when display %in% c("d1", "d2"). Ignored when display = "fit".

  • "none" — no ribbon. The curve and the zero reference line are all you see. Default, because pointwise ribbons overshoot nominal coverage as a joint region and can invite over-reading of local features.

  • "pointwise" — 95% pointwise ribbon from \(\sqrt{\mathrm{diag}(D V_p D^\top)}\) (Wood 2017 §7.2.4). Valid marginally; not a simultaneous statement.

  • "simultaneous" — 95% simultaneous band via the Monte Carlo construction of Ruppert, Wand & Carroll (2003) popularised for GAMs by Simpson (2018, Frontiers Ecol. Evol. 6:149): draw \(B\) samples \(\tilde{\boldsymbol\beta} \sim \mathcal{N}(\hat{\boldsymbol\beta}, V_p)\), compute \(\max_x |D_i(\tilde{\boldsymbol\beta} - \hat{\boldsymbol\beta})| / \mathrm{se}_i\), and use the \((1-\alpha)\) quantile as a critical multiplier on the pointwise SE. Valid for feature localisation ("where is \(\hat f'(x)\) significantly non-zero").

derivative_ci_nsim

Integer. Number of Monte Carlo samples used when derivative_ci = "simultaneous". Default 1000L — a compromise between coverage accuracy (Simpson 2018 uses 10000) and CPU budget across every pair in a medium-sized matrix. Ignored for any other derivative_ci.

n_grid

Integer or NULL. Number of equally-spaced points used to evaluate each fitted smooth (and its derivatives). Default NULL resolves to 100 when display = "fit" and 200 otherwise, because finite-difference second derivatives visibly degrade below \(\sim 150\) points on moderate-k smooths. Supplying n_grid directly overrides both defaults. Larger grids shift the numerical shape-metric values (\(M\), \(C\), turning / inflection counts) slightly because they are computed on this same grid. Shapes and asymmetry are the primary reading; M, C and the counts are secondary diagnostics and the grid-induced drift is tolerable.

colour_by

One of "pearson" (default), "spearman", "kendall", "edf", "deviance_gap", or "none". Encodes the per-cell fill colour by the chosen scalar. Correlation choices use a diverging palette with limits c(-1, 1) and a shared corr colour-bar title; "edf" and "deviance_gap" use a sequential palette labelled by the metric.

fill_by

Deprecated alias for colour_by. If supplied emits a single soft deprecation warning and is forwarded to colour_by.

palette

Character. Colour palette for the cell fill scale. Defaults to "RdBu" when colour_by is a correlation and "viridis" otherwise. Sequential choices: "viridis", "magma", "inferno", "plasma", "cividis", "mako", "rocket", "turbo" (not CB-safe), "YlOrRd", "YlGnBu", "Blues", "Greens". Diverging choices: "RdYlBu", "RdBu", "PuOr", "Spectral" (not CB-safe). Passing a sequential palette while colour_by is a correlation silently upgrades to the default diverging palette.

annotations

Character vector, a subset of c("edf", "A", "shape", "code"). Controls which corner annotations appear on each off-diagonal cell:

  • "code" — 2-letter ASCII shape code, top-left corner.

  • "A" and "edf" — asymmetry index and effective degrees of freedom, stacked bottom-left.

  • "shape" — shape glyph (Unicode or ASCII per glyph_style), bottom-right corner.

Default c("edf", "A"). "code" and "shape" occupy distinct corners so both can be requested together. See janusplot_shape_hierarchy() for the full code list.

shape_cutoffs

Named list of classification thresholds used to map the continuous shape indices into discrete shape_category labels; see janusplot_shape_cutoffs().

show_shape_legend

Logical. If TRUE (default), attach a standing shape-types legend plate below the matrix that illustrates every category in the taxonomy as a canonical thumbnail spline. Independent of annotations.

glyph_style

One of "ascii" (default) or "unicode". Controls how cell shape glyphs render when "shape" is included in annotations. Default is "ascii" for maximum portability across typesetting pipelines; switch to "unicode" only when the target font is known to cover the curve glyph set.

labels

One of "border" (default), "diagonal", or "none". Controls where variable names are rendered:

  • "border" — names along the top (rotated per label_srt) and left margins of the matrix; diagonal cells are left blank. Mirrors corrplot's tl.pos = "lt" convention.

  • "diagonal" — names centred on the diagonal cells (the pre-0.1 layout).

  • "none" — labels suppressed entirely; diagonal cells blank.

diagonal

One of "auto" (default), "blank", "name", or "density". Controls what is rendered in the diagonal cells of the matrix.

  • "auto" — preserves the historical behaviour: variable name when labels = "diagonal", blank otherwise.

  • "blank" — empty bordered panel (uniform grid reading).

  • "name" — variable name centred in the cell, bold.

  • "density" — kernel density of the variable filled in translucent grey, with a rug of raw values along the bottom edge. Mirrors the GGally::ggpairs convention; surfaces tail weight, bimodality, and support clipping that the pairwise smooths alone cannot reveal. Variable names should come from the border (labels = "border", the default) when this mode is on.

label_srt

Numeric. Rotation (degrees) of top labels when labels = "border". Default 45; set to 0 for horizontal or 90 for vertical. Ignored when labels != "border".

label_cex

Positive numeric multiplier on the border-label font size. Default 1. Ignored when labels = "none".

signif_glyph

Logical. If TRUE (default), annotate cells with · / * / ** reflecting the smooth's F-test p-value.

show_asymmetry

Deprecated. Use annotations instead ("A" %in% annotations). When supplied, a soft deprecation warning fires and the argument is merged into annotations.

na_action

One of "pairwise" (default; per-cell complete observations) or "complete" (listwise; all cells use the same rows).

parallel

Logical. If TRUE, use future.apply::future_mapply() to fit pairs in parallel. Requires the future.apply package and a user-configured future::plan(). Default FALSE.

with_data

Logical. If TRUE, return a two-element list list(plot, data) where data is a flat per-cell summary (one row per off-diagonal cell) of everything the plot displays. The data element is always a plain data.frame (base R — no data.table dependency). Default FALSE — in which case only the ggplot is returned.

text_scale_diag

Positive numeric multiplier applied to the diagonal variable-name labels. Default 1. Diagonal labels additionally auto-shrink for long variable names (nchar(var) > 10) so they fit the cell regardless of this value.

text_scale_off_diag

Positive numeric multiplier applied to all off-diagonal annotations (n / EDF readouts, significance glyphs, asymmetry-index labels). Default 1. Use < 1 when cells are small and the annotations crowd the fit line; use > 1 for presentation plots.

show_glossary

Logical. If TRUE (default), attach a multi-line caption below the matrix describing the on-plot abbreviations (n, EDF, A, fill encoding, significance glyphs). Only keys actually displayed are listed.

glossary_scale

Positive numeric multiplier on the glossary caption font size. Default 1.

k_check_thresholds

Named list giving the three flag-criterion thresholds used by mgcv::k.check()-style basis-dimension diagnostics. Required entries: edf_ratio (Wood's \(\widehat{\mathrm{edf}}/k'\) ratio above which the smooth is too close to its basis cap), k_index (residual-difference variance ratio below which the basis appears underspecified), and p (the simulation p-value below which the basis-deficiency signal is significant). Defaults — edf_ratio = 0.9, k_index = 1.0, p = 0.05 — track mgcv::gam.check() and Wood (2017) §5.9.

auto_refit_k

Logical. If TRUE, every cell whose Wood trifecta flags an underfit is refit with a doubling-k loop until either the flag clears, the per-cell unique-x cap is reached, or k_max_iter iterations have passed. Default FALSE — the diagnostic (k_check_status, k_flag, k_prime, k_index, k_p) is always computed and surfaced regardless of this flag, but the refit is opt-in because it can multiply wall time on pathological data.

k_max_iter

Integer. Maximum number of doublings allowed per cell when auto_refit_k = TRUE. Default 2L (so a cell that starts at the mgcv default k = 10 will visit at most k = 20 and then k = 40, capped by the per-cell unique-x limit). Ignored when auto_refit_k = FALSE.

compact

One of "auto" (default), "always", or "never". Controls scale-aware content suppression per cell:

  • "auto" — tier 0 at n_var < compact_threshold (the v0.1.0 behaviour); progressively suppresses scatter, then CI, then annotations, then the spline itself as n_var crosses the compact_levels ladder. The matrix remains readable as k grows toward 25–30 by trading detail for legibility.

  • "always" — force at least tier 1 regardless of n_var. Useful for very dense fixed-size renders.

  • "never" — force tier 0 regardless of n_var. Useful for reproducing v0.1.0 figures on large matrices.

compact_threshold

Integer. The n_var value at which tier 1 (drop scatter) auto-activates under compact = "auto". Default 12L, anchored on the 150 × 150 px-per-cell pixel budget at typical 6"×6" 300 DPI R Journal figures.

compact_levels

Optional named list with entries t1, t2, t3 overriding the auto-tier ladder. Defaults derive from compact_threshold: t1 = compact_threshold, t2 = compact_threshold + 6, t3 = compact_threshold + 13. NULL (default) uses these derived defaults.

focus_by

One of NA (default — no filter), "asymmetry", "edf", "k_flag", or "non_linearity" (defined as edf - 1). When set, cells whose chosen metric falls below focus_threshold are rendered in grey85 at alpha focus_dim_alpha; the matrix shape is preserved so attention drains visually to high-metric cells. This is a visual filter, not a statistical one — the underlying fits are unchanged and the with_data table carries every cell.

focus_threshold

Either a quantile-string like "q90" (default) or a numeric cutoff. The quantile is taken over the non-NA distribution of focus_by across all off-diagonal cells in the matrix. Ignored when focus_by = NA.

focus_dim_alpha

Numeric in [0, 1]. Alpha applied to the grey85 wash on unfocused cells. Default 0.25. Ignored when focus_by = NA.

axes

One of "original" (default), "standardised", "centred", or "rank". Rendering-only knob — the underlying mgcv::gam fits are byte-identical across all four modes (verifiable via digest::digest() on the fit list); the transformation lives entirely inside the cell renderer and propagates to (a) the raw scatter, (b) the spline prediction grid, (c) the CI ribbon, and (d) the variable label on the matrix border. Use:

  • "original" — raw units. Maximum interpretability per cell. v0.1.0 behaviour.

  • "standardised"(x - mean(x)) / sd(x) per variable. Border label becomes e.g. "mpg (z)". Pairs scaled into a comparable visual range; useful at k >= 15 when raw-unit panels look disparate.

  • "centred"x - mean(x) per variable. Border label becomes e.g. "mpg (centred)". Preserves units while anchoring the origin.

  • "rank" — empirical-CDF-based rank, scaled to [0, n] per variable. Border label becomes "rank(mpg)". Sanity-check view: collapses outliers; if the smooth changes shape vs "original" the relationship is monotone-but-not-linear.

At compact tier 3 (n_var >= 25 under compact = "auto"), the cells render only colour fill + shape-class glyph — no curve, no scatter — so axes becomes a documented no-op (the border labels still pick up the mode suffix).

save_as

Optional file path with extension. When set, the final assembled plot is written to this path via ggplot2::ggsave(); the device is inferred from the extension. Supported extensions: .png, .pdf, .svg, .jpg / .jpeg, .tif / .tiff, .eps, .ps, .bmp. Default NULL (no file written; janusplot() still returns the ggplot). Width / height default to pmax(6, 0.65 * k_n) inches each — square aspect, scaling with matrix dimension.

save_width

Numeric. Override width (inches) for save_as. Default NULL uses the auto-resolved square.

save_height

Numeric. Override height (inches) for save_as. Default NULL uses the auto-resolved square.

save_dpi

Integer. Raster DPI for save_as when the inferred device is bitmap (png/jpg/tiff/bmp). Default 300, matching the R Journal target.

...

Additional arguments passed to mgcv::gam().

Value

If with_data = FALSE (default), a ggplot2::ggplot object (via patchwork::wrap_plots()) carrying a top-of-matrix title that names the displayed quantity ("Direct fit", "First derivative f'", or "Second derivative f''"). If with_data = TRUE, a list with two elements: plot (the ggplot) and data (a tidy table with columns var_x, var_y, position, n_used, edf, pvalue, signif, dev_exp, asymmetry_index, cor_pearson, cor_spearman, cor_kendall, tie_ratio, monotonicity_index, convexity_index, n_turning_points, n_inflections, flat_range_ratio, shape_category, colour_value, display, one row per off-diagonal cell). The display column tags which quantity the call rendered, so separate calls for fit / d1 / d2 yield comparable, stackable tables. Derivative curves themselves (grid of \(x\), fitted \(\hat f^{(k)}\), SE) live on janusplot_data() — see there.

See also

janusplot_data() for the raw per-cell fits + metrics.

Other smooth-associations: janusplot_data()

Examples

# Minimal runnable example — 3 variables, 6 asymmetric pairwise GAM fits.
janusplot(mtcars[, c("mpg", "hp", "wt")])


# \donttest{
# Heteroscedastic DGP: Pearson r is ~ 0.9 but the inverse fit is
# clearly non-linear, yielding asymmetry index > 0.5.
set.seed(2026L)
n  <- 200L
x1 <- stats::runif(n, 0, 10)
x2 <- x1 + stats::rnorm(n, sd = 0.2 * x1)
janusplot(data.frame(x1 = x1, x2 = x2, x3 = stats::rnorm(n)))


# A single matrix renders a single quantity. To compare the fit
# against its derivatives, render three calls and place them
# side-by-side; each call's title makes the quantity explicit.
set.seed(2026L)
xs <- stats::runif(300L, -3, 3)
df <- data.frame(
  x  = xs,
  y1 = sin(xs)  + stats::rnorm(300L, sd = 0.3),
  y2 = xs^2     + stats::rnorm(300L, sd = 0.6)
)
janusplot(df, display = "fit")

janusplot(df, display = "d1")

janusplot(df, display = "d2")


# Simultaneous CI bands on a derivative panel, per Simpson (2018).
janusplot(df, display = "d1", derivative_ci = "simultaneous")

# }