Propose role classifications for the columns of a data frame
Source:R/propose_roles.R
propose_roles.RdGenerates a heuristic role tibble for every column of df. The user is
expected to inspect this tibble and edit it before passing it to mask().
Heuristics are seeds, not law.
Arguments
- df
A data frame. Must have at least one column.
- detect
Logical scalar (default
TRUE). WhenTRUE, rundetect_design()and overlay its recommended role hints on the name-based heuristic. Stash thedesign_summaryasattr(roles, "design"). WhenFALSE, only the v0.2.x name-based heuristic runs.
Value
A tibble with one row per column, containing:
col: column name.role: one ofdesign,treatment,outcome,covariate,ignore.kind: storage kind (numeric,integer,factor,character,logical,date,datetime,other).freq_or_range: brief summary string (range for numeric, level count for factor, etc.).pii_suspected:TRUEif column name matches a PII pattern.notes: short explanation of the auto-classification.
Details
Roles are exactly one of:
designByte-identical pass-through. Trial / site / replicate / block / plot / row / column / year etc.
treatmentSame factor cardinality and per-level frequency; optional label aliasing or seeded permutation.
outcomeRe-simulated via Gaussian copula. Multiple allowed.
covariateNumeric: Gaussian copula (joint with outcomes). Categorical: row-permuted, levels preserved (local) or aliased (collaborate).
ignoreDropped or passed through depending on
mask()options; auto-assigned for date/time, free text, and PII-pattern names.
Default classification rules, applied in order:
PII-pattern column names (
contact,email,phone,gps,latitude/longitude,postcode,ssn,password,owner,farmer,operator, etc., case-insensitive substring) ->ignorewithpii_suspected = TRUE.Date / POSIXct / POSIXlt / difftime columns ->
ignore.ID-pattern names (
\\bid\\b,_id$,^id_) ->ignore.Design-pattern names (
rep,block,row,col(umn)?,range,plot(no)?,site,env(ironment)?,trial,year,season,colrep,tos) ->design.Treatment-pattern names (
treatment,variety,cultivar,genotype,^trt,^dose) ->treatment.Character columns with > 50% unique values on non-NA ->
ignore(likely free text).Everything else ->
covariate. The user re-classifies one or more columns asoutcome.
Failing to designate at least one outcome is a hard error at mask()
time (via roles_validate()).
Since masque 0.3.0, propose_roles() also calls detect_design() by
default (detect = TRUE) and applies the detected design's
recommended_roles on top of the name-based heuristic. This promotes
structurally-identified block / treatment columns even when the
column names do not match the design / treatment regexes. The
resulting design summary is stashed as attr(roles, "design") so the
user can plot() it or inspect alternates. Pass detect = FALSE to
recover the v0.2.x name-only behaviour byte-for-byte.
See also
roles_validate() for the fail-closed validation applied at
mask() time.
Examples
propose_roles(iris)
#> # A tibble: 5 × 6
#> col role kind freq_or_range pii_suspected notes
#> <chr> <chr> <chr> <chr> <lgl> <chr>
#> 1 Sepal.Length covariate numeric [4.3, 7.9] FALSE Default -> covaria…
#> 2 Sepal.Width covariate numeric [2, 4.4] FALSE Default -> covaria…
#> 3 Petal.Length covariate numeric [1, 6.9] FALSE Default -> covaria…
#> 4 Petal.Width covariate numeric [0.1, 2.5] FALSE Default -> covaria…
#> 5 Species treatment factor n=3 levels FALSE detect_design: cov…