Frequently Asked Questions

Do I need to use MU-referencing in my model definitions, like in NONMEM / nlmixr2?

No. You do not need to write an explicit MU_i intermediate variable — ferx detects the same structure automatically from the right-hand side of each [individual_parameters] line.

In NONMEM and nlmixr2, MU-referencing is a convention in which each random effect ETA(i) is linearly associated with a single MU_i term (typically MU_i = LOG(THETA(i))), so individual parameters look like:

MU_1 = LOG(THETA(1))
CL   = EXP(MU_1 + ETA(1))

This structure is required by NONMEM’s SAEM implementation, whose conjugate-Gibbs E-step is only valid when the MU_i → ETA(i) relationship is strictly linear (typically on a log scale). Deviating from it — for example writing CL = THETA(1) * EXP(ETA(1)) without going through an intermediate MU_1 — causes NONMEM SAEM to reject the model or silently produce biased estimates.

How ferx handles it

In ferx, you simply write the individual-parameter line in the form that makes sense for your model. The parser inspects each line and records which THETA acts as the “anchor” for each ETA, then uses that information to re-centre the estimation search at every outer iteration. The following patterns are detected automatically:

CL = TVCL * exp(ETA_CL)              # multiplicative exp — most common
CL = exp(log(TVCL) + ETA_CL)         # canonical MU form
CL = TVCL * (WT/70)^0.75 * exp(ETA_CL)   # with covariate adjustment
CL = TVCL + ETA_CL                   # additive eta (linear shift)

The parser records ETA_CL → (TVCL, log_transformed) and at each outer step computes the shift

mu[i] = log(theta[i])   # for multiplicative/exp patterns
mu[i] = theta[i]        # for additive patterns

The inner optimizer (and the SAEM exploration-phase MH proposal) is then centred on this mu shift rather than on eta = 0. This matches the convergence behaviour that NONMEM / nlmixr2 get from an explicit MU_i = LOG(THETA(i)) block, without requiring you to write it.

Patterns that ferx does not auto-detect (and therefore falls back to the plain eta = 0 start) include:

Parameters with two or more free thetas in the product (CL = TVCL * TVSCALE * exp(ETA_CL)) — ambiguous anchor.
Compound eta expressions inside the exp (CL = TVCL * exp(ETA_CL + ETA_OCC)).
Non-standard transforms the parser doesn’t recognise.

Mu-referencing being absent is not an error — it just means that eta for that parameter is initialised to zero at each outer step, same as the pre-automatic behaviour. When a fit runs, ferx records the list of auto-detected mu-referenced ETAs in FitResult.warnings (as a line starting with mu-ref: …) so you can confirm what the parser picked up.

Why it matters

Mu-referencing mostly affects convergence speed and robustness, not the final MLE:

FOCE / FOCEI: each outer step re-optimises ETA by BFGS. With a mu-shift, the warm-started BFGS sees a much better starting point when THETA moves away from the initial estimate, so fewer inner iterations are needed and pathological steps are less likely.
SAEM: during the exploration phase, Metropolis–Hastings proposals are centred on mu_k instead of on the current chain state. This helps the chain escape the (incorrect) eta = 0 basin when TVCL is still far from the true value. During the convergence phase the proposal reverts to a symmetric random walk so that detailed balance holds.

Because the estimator is MLE in both cases, models that converge without mu-referencing will converge to the same estimates with it — just (usually) in fewer iterations.

Disabling it

If you want to benchmark against the pre-2026 behaviour, set:

[fit_options]
  mu_referencing = false

or, from the Rust API, FitOptions { mu_referencing: false, .. }. The default is true.

Porting from NONMEM

If you have a NONMEM model that uses an explicit MU_1 = LOG(THETA(1)) line, just drop the MU_i intermediate and write the individual parameter directly — ferx will detect the equivalent mu-reference automatically and the fit will be equivalent.

Can I use `if` / `else` statements like in NONMEM or nlmixr2?

Yes. ferx supports two forms of conditional logic in both [individual_parameters] and [odes] blocks:

# Block form
if (WT > 70) {
  CL = TVCL * (WT / 70)^0.75 * exp(ETA_CL)
} else if (SEX == 1) {
  CL = TVCL * 1.2 * exp(ETA_CL)
} else {
  CL = TVCL * exp(ETA_CL)
}

# Inline (ternary) form
CL = if (SEX == 1) TVCL * 1.5 else TVCL

Conditions support comparisons (<, <=, >, >=, ==, !=) and logical operators (&&, ||, !). Either form may be combined with arbitrary arithmetic, including covariate references and other ETAs.

Compared to NONMEM: NONMEM uses Fortran-style IF (cond) THEN ... ELSE ... ENDIF and Fortran comparison aliases (.GT., .LE., etc.). ferx uses C-style braces and operator symbols. The semantics are otherwise the same — exactly one branch fires per evaluation, the rest are ignored.

Compared to nlmixr2: nlmixr2’s if (cond) { ... } else { ... } syntax is a near-verbatim match for ferx’s block form. Drop the <- assignments in favour of = and the model translates directly.

Caveat: when a parameter is assigned inside an if block, ferx skips mu-reference detection for that parameter (the (ETA → THETA) link is no longer unconditional). See Individual Parameters for the workaround if you need both conditional logic and mu-referencing on the same parameter.

A note on == and !=: these operators do an exact bitwise float comparison. They work as you’d expect for integer-coded covariates (SEX == 1, STUDY != 3) but are unreliable on continuous values, where floating-point round-off can flip the result. For continuous thresholds prefer a bracketed comparison (WT >= 70 && WT < 80) over WT == 75.

Which outer optimizer should I pick?

bobyqa (the default) is the right choice for most models — derivative-free quadratic trust-region, robust to noisy FD gradients, and consistently reaches a lower OFV than slsqp on ODE/PD models, sparse data, and Hill-ridge identifiability problems. Previously the default was slsqp; the change is one line away if you need the old behaviour (optimizer = slsqp in [fit_options]).

Reach for a different optimizer when the default misbehaves:

slsqp — gradient-based; faster per iteration on smooth, well-conditioned analytical PK models with many parameters. Can stall above the true minimum on ill-conditioned fits unless paired with reconverge_gradient_interval = 1 (5–6× cost per gradient).
trust_region — second-order Newton trust-region with the analytic outer gradient and BHHH approximate Hessian. Can be faster near convergence because it uses curvature information; the CG budget defaults to ceil(sqrt(n_params)) (~5 for standard NLME models), but you can pin it with steihaug_max_iters if you have many packed parameters and want more aggressive sub-problem solves.
lbfgs / bfgs — fall back to these only when NLopt is unavailable.

See Fit Options for the full list.

Does ferx have a `LAPLACIAN` estimation method, like NONMEM?

There is no separate LAPLACE / LAPLACIAN switch in ferx — and you don’t need one. In NONMEM, METHOD=COND (FOCE/FOCEI) and METHOD=COND LAPLACE differ only in how the curvature of the individual objective in the random effects (η) is formed: FOCE uses a first-order (Gauss–Newton) approximation that drops the second derivative of the prediction, whereas LAPLACE computes the exact second derivative of the individual −2·log L in η. LAPLACE is required whenever a subject’s likelihood is not the built-in Gaussian-residual form — i.e. a user-defined likelihood (NONMEM’s F_FLAG): categorical, count, and time-to-event data.

ferx’s conditional engine is itself a Laplace approximation, and it selects the right curvature per endpoint automatically:

Endpoint	NONMEM	ferx
Continuous (Gaussian residual error)	FOCE / FOCEI — first-order Hessian	FOCEI — same first-order Laplace form (matches nlmixr2)
Time-to-event (`[event_model]` hazard)	`LAPLACIAN` required	exact second-order η-Hessian — applied automatically

So for a survival model you still write method = focei; ferx detects the [event_model] endpoint and, because there is no Gaussian residual to linearize, computes the true second derivative of the hazard −log L with respect to η and folds it into the Laplace marginal. That is exactly NONMEM’s LAPLACIAN computation — which is why ferx TTE fits are validated against NONMEM CONDITIONAL LAPLACIAN INTERACTION (see Time-to-event estimation). It is not a user-selectable option because no alternative is meaningful: a hazard likelihood always uses it.

For continuous data, ferx (like nlmixr2 FOCEI) keeps the first-order Hessian and does not offer a full-Laplace-on-Gaussian toggle — the extra curvature term it would add is rarely worth it. When FOCE/Laplace accuracy is genuinely the concern (strong nonlinearity, non-Gaussian η posteriors), reach for the importance-sampling (imp / impmap) or saem / bayes methods, which integrate over η rather than approximating around its mode. See FOCE/FOCEI and Fit Options.

The Laplace approximation has limits either way. Because Laplace is exact only for Gaussian integrands, frailty variances on nonlinear hazard parameters can be over-estimated under FOCEI-Laplace — in ferx and in NONMEM. Prefer SAEM/IMP there; the TTE page has the cross-tool comparison.

How do I fit on the log scale, like NONMEM’s `Y = LOG(F) + EPS(1)`?

Use a log-transform-both-sides (LTBS) error model. Two forms are available in the [error_model] block, depending on the scale of your DV column:

# DV on the natural scale — engine log-transforms DV and the prediction:
log(DV) ~ additive(ADD_LOG)

# DV already log-transformed in the data (e.g. ported from NONMEM):
DV ~ log_additive(ADD_LOG)

Both compare log(prediction) to a log-scale observation with additive error on the log scale — exactly NONMEM’s IPRED = LOG(F); Y = IPRED + EPS(1). Under LTBS, IPRED/PRED, IWRES/CWRES, and simulated DV are all reported on the log scale (back-transform with exp() for natural-scale values).

Relationship to proportional error. For a small residual CV, additive-on-log and proportional error coincide. On the warfarin dataset the two parameterizations agree closely: examples/warfarin.ferx (DV ~ proportional(PROP_ERR)) fits PROP_ERR ≈ 0.0106 with TVCL ≈ 0.133, TVV ≈ 7.69, TVKA ≈ 0.76, and examples/warfarin_ltbs.ferx (log(DV) ~ additive(ADD_LOG)) fits ADD_LOG ≈ 0.0106 with TVCL ≈ 0.133, TVV ≈ 7.74, TVKA ≈ 0.81 — the same structural parameters and the same residual magnitude. See Error Model for the full reference and restrictions.

How do I scale predictions to match my data’s units, like NONMEM’s `S1`/`S2`?

Use the [scaling] block. The convention is divisive (pred_scaled = pred_raw / scale), matching the natural reading of obs_scale = V/1000.

[scaling]
  obs_scale = 1000          # mg/L → mg/mL

The block also supports expression-form scales (obs_scale = WT / 70) and a full Form C readout that replaces the built-in output entirely — on ODE and analytical (closed-form) PK models (#650):

[structural_model]
  ode(states=[depot, central])   # or: pk one_cpt_iv(cl=CL, v=V)

[scaling]
  y = central / V           # central holds the amount; observe as concentration

On an analytical model the readout can express a nonlinear multi-DV output such as a saturable free-vs-total protein-binding correction switched by a per-row covariate, and stays analytic in the FOCEI/FOCE gradient (including non-structural readout parameters like a binding BMAX/KD):

[scaling]
  y = if (FREE == 0) central / V + BMAX * (central / V) / (KD + central / V) \
      else           central / V

For models with multiple observed compartments (parent + metabolite, sum-of-moieties, free vs. total), specify a scale per CMT:

[scaling]
  obs_scale[CMT=1] = 1000
  obs_scale[CMT=2] = 1
  y[CMT=1] = parent / V
  y[CMT=2] = metab  / VM

Every observed CMT must have a matching [CMT=N] entry — fit() errors at startup with the list of missing CMTs.

See Scaling for the full reference, including how this compares to NONMEM’s S1/S2 and nlmixr2’s cmt(central); f = ....

My DV data is on the log scale (or I want to fit on the log scale). How do I do that?

Use the log-transform-both-sides (LTBS) error model. There are two forms depending on the scale of the DV column:

DV is on the natural (concentration) scale — use log(DV) ~ additive(SIGMA). ferx log-transforms the DV column at load time and compares it to log(prediction):

[parameters]
  sigma ADD_LOG ~ 0.1     # additive SD on the log scale (≈ CV if small)

[error_model]
  log(DV) ~ additive(ADD_LOG)

DV is already log-transformed in the dataset — use DV ~ log_additive(SIGMA). ferx takes DV as-is and only log-transforms the prediction:

[error_model]
  DV ~ log_additive(ADD_LOG)

Both produce IPRED, PRED, IWRES, and CWRES on the log scale (back-transform with exp() for natural-scale values). BLOQ/M3 is supported; multi-endpoint and SDE models are not. See Error Model for full details.

Which outer optimizer should I use?

Default (bobyqa) works well for most models — including ODE/PD models, sparse data, and Hill-ridge problems where gradient-based optimizers stall. If you’re unsure, start here.

slsqp is the right pick when bobyqa is too slow on a smooth, well-conditioned model with many parameters (it can take many quadratic-interpolation samples to triangulate a high-dimensional surface). Pair with reconverge_gradient_interval = 1 if it stalls above an expected OFV.

trust_region shines on high-parameter-count models (many thetas/omegas) or when combined with inits_from_nca — the second-order curvature helps when starting values are good. Set steihaug_max_iters if you want to pin the CG budget.

gn / gn_hybrid for fast iteration during model development: Gauss-Newton converges in 10–30 steps vs 100+ for gradient methods. gn_hybrid adds a FOCEI polish pass for robustness; that polish stage runs with bobyqa by default and inherits any optimizer override.

Gradient-based (lbfgs, mma) are rarely needed; prefer bobyqa or slsqp. See FOCE/FOCEI — Optimizer Options for a full comparison.

How do I validate a model file without running a full fit?

Use ferx check:

ferx check model.ferx                  # parse + structural validation
ferx check model.ferx --data data.csv  # also run data-dependent checks
ferx check model.ferx --data data.csv --json

It runs the parser and every validation step that normally fires at the start of a fit — missing blocks, missing covariates, per-CMT scaling / error-model coverage, steady-state and lag-time sanity — then reports the findings and exits, without fitting. This turns a multi-second fit-to-find-a-typo into a sub-second loop.

--json emits a structured report (stable code per finding, optional block / line / suggestion) that tooling and coding agents can consume directly, rather than parsing prose. The exit code is 0 when no errors are found, 1 when there are errors. See the check report reference for the JSON schema and the full code table.

How do I exclude records or subjects, like NONMEM’s `$DATA IGNORE=`?

Use the [data_selection] block:

[data_selection]
  ignore = DV < 0.001          # drop any obs where DV is below detection
  ignore_subjects = [3, 17]    # drop subjects 3 and 17 entirely

The ignore key works like NONMEM’s $DATA IGNORE=: a record is excluded when the expression is true. The accept key is the complement: a record is kept only when the expression passes (equivalent to NONMEM’s $DATA ACCEPT=).

NONMEM	ferx
`$DATA IGNORE=(BW.GT.80)`	`ignore = BW > 80`
`$DATA ACCEPT=(DV.GE.0.001)`	`accept = DV >= 0.001`
`$DATA IGNORE=(ID.EQ.3) IGNORE=(ID.EQ.17)`	`ignore_subjects = [3, 17]`

Multiple ignore lines mean “exclude if any condition matches”; multiple accept lines mean “exclude unless all conditions pass”. Conditions within a single line can be joined with && (both must hold to trigger the rule). || within a single expression is not supported — use two separate lines instead.

After the fit, the CLI prints a --- Data Selection --- block and the YAML output file includes an exclusions: section with record counts and the expressions that fired. See Data Selection for the full reference.

What about NONMEM’s coded `RATE` values (`-1`, `-2`)?

NONMEM overloads the RATE column. A positive value is a constant infusion rate, but negative values are codes that tell NONMEM to take the rate or duration from a $PK parameter rather than the data:

`RATE`	NONMEM meaning	ferx
`0`	Bolus — route set by the dose compartment	supported
`> 0`	Constant-rate infusion, duration = `AMT/RATE`	supported
`-1`	Infusion rate modeled — `R1` defined in `$PK`	supported, both engines (#324)
`-2`	Infusion duration modeled — `D1` defined in `$PK`	supported, both engines (#324, #394)

RATE = -2 (duration) and RATE = -1 (rate) are both supported on the analytical pk(...) engine and ode(...) models. Declare the matching $PK-style parameter — D{cmt} for -2 (ferx infuses AMT over that duration, rate AMT / D{cmt}) or R{cmt} for -1 (ferx infuses at that rate, duration AMT / R{cmt}) — evaluated per iteration and occasion, matching NONMEM’s $PK D{n} / R{n}. A coded RATE with no matching parameter is a loud error (E_MODELED_DURATION_NO_PARAM / E_MODELED_RATE_NO_PARAM), never a silent bolus. Both codes are parameter-driven; neither reads a separate DURATION data column. Any other negative or non-finite RATE is rejected — earlier versions silently misread the coded forms as a bolus (#324).

For bioavailability F ≠ 1, ferx reshapes an infusion the NONMEM way (#419): a rate-defined infusion (RATE>0 data and RATE=-1 → R{cmt}) holds the rate and scales the duration to F·AMT/RATE, while a duration-defined infusion (RATE=-2 → D{cmt}) holds the duration and scales the rate to F·AMT/D{cmt}. Total exposure (F·AMT) is identical; only the infusion shape differs between the two modes, and only when F ≠ 1. At F = 1 they coincide.

RATE=0 denotes a bolus, not specifically an intravenous dose: the route follows the dose compartment — a bolus into the central compartment is intravenous, while a bolus into a depot compartment with first-order absorption is extravascular (oral).

Can I model zero-order absorption (zero-order input into the depot)?

Yes — on the analytical engine, with no ode(...) block. A RATE=-2 modeled duration D1 (or an explicit positive RATE) into compartment 1 of an analytical oral model (pk one_cpt_oral / two_cpt_oral / three_cpt_oral) releases the dose into the depot at a constant zero-order rate over D1, which is then absorbed first-order into central via KA — the standard zero-order-into-depot absorption model. This mirrors NONMEM’s ADVAN2 with a $PK D1 (the depot is compartment 1 in both). D2 into the same model is a depot-bypassing infusion straight into central.

This stays on the closed form (it’s a linear system with piecewise-constant forcing) — ferx does not silently rewrite it into an ODE. The one limitation: per-compartment amounts in sdtab / [derived] are not available for these subjects (predictions are exact; a W_DERIVED_CMT_ORAL_DEPOT_INFUSION_ANALYTICAL warning notes it). Use an ode(...) model if you need the compartment amounts.

NONMEM comparison

A one_cpt_oral model with CL=5, V=50, KA=1, and a modeled depot duration D1=5 (so a 100-unit dose is released into the depot at rate 20 over 5 h), versus NONMEM 7.5.1 ADVAN2 TRANS2 with $PK D1=THETA(4) (MAXEVAL=0, eta=0). Control file and data: tests/nonmem/oral_depot_d1.ctl / .csv.

time (h)	ferx `pk one_cpt_oral`	NONMEM `ADVAN2` PRED
1	0.14200	0.14200
2	0.42135	0.42135
3	0.72960	0.72960
5	1.30730	1.30730
8	1.27350	1.27350
12	0.86800	0.86800
18	0.47659	0.47659
24	0.26156	0.26156

ferx matches NONMEM to the 5 significant figures tabulated (the analytical_oral_depot_modeled_duration_matches_nonmem regression test asserts this).