FOCE / FOCEI

Maturity: stable — see Feature Maturity for what this means.

First-Order Conditional Estimation (FOCE) and its interaction variant (FOCEI) are the primary estimation methods in ferx-core. They use a two-level nested optimization to find maximum likelihood estimates of population parameters.

Algorithm Overview

Outer Loop (Population Parameters)

The outer loop optimizes the population parameters $\theta$, $\Omega$, and $\sigma$ by minimizing the population objective function value (OFV):

\[ \text{OFV} = -2 \log L = \sum_{i=1}^{N} \text{OFV}_i \]

Parameters are internally transformed for unconstrained optimization: - Theta: Log-transformed (ensures positivity) - Omega: Cholesky factorization (ensures positive-definiteness) - Sigma: Log-transformed (ensures positivity)

Inner Loop (Empirical Bayes Estimates)

For each subject, the inner loop finds the empirical Bayes estimate (EBE) of the random effects $\hat{\eta}_i$ by minimizing the individual negative log-likelihood:

\[ -\log p(\eta_i | y_i, \theta, \Omega, \sigma) = \frac{1}{2} \left[ \eta_i^T \Omega^{-1} \eta_i + \log|\Omega| + \sum_j \left( \frac{(y_{ij} - f_{ij})^2}{V_{ij}} + \log V_{ij} \right) \right] \]

The inner loop uses BFGS, falling back to Nelder-Mead simplex if BFGS fails.

Gradient route (analytic vs FD)

The BFGS gradient is the exact analytic η-gradient computed from hand-rolled forward sensitivities (the Dual2 type over the closed-form PK solutions) when the model is in scope, and central finite differences (FD) otherwise. The analytic route serves the analytical 1-/2-/3-cpt PK models and — since #410 — in-scope user-[odes] models (via a lighter Dual1 forward walk over the ODE solver), including multi-endpoint per-CMT Form-C readouts (y[CMT=N] = …, since #439) and Form-C readouts that reference a θ/η directly (e.g. y = central/V1 * (1 + ETA_CL) + TVBASE, since #486 — each bare θ/η in the readout is desugared into a hidden individual parameter so its ∂y/∂θ/∂y/∂η rides the same individual-parameter sensitivity chain; a readout referencing a neural-network output still falls back to FD). A structural parameter that reads the event-time built-in TIME/time (a NONMEM $PK IF (TIME.GE.45) CL=… time-dependent switch) is analytic since #486 too: the per-event time is threaded into the same event-driven Dual2 (outer) / Dual1 (inner EBE) walk used for time-varying covariates, so its ∂f/∂θ/∂f/∂η are exact rather than finite-differenced — covering closed-form (non-IOV and IOV — each occasion’s stacked [η_bsv, κ] seeding is evaluated at the event time) and ODE (non-IOV and IOV) models alike. TIME composes with the other event-driven-walk features too: a direct pk(...=TIME) mapping (desugared into a synthetic individual parameter, #486), a built-in absorption input-rate forcing (the walk carries R_in alongside the per-event time, #486), and a non-zero ODE init(...) baseline (the walk seeds the init state, #486) — none of these remains a TIME-specific fallback. It is resolved per subject, so individual subjects fall back to FD for features outside that scope. Log-transform-both-sides (LTBS) on analytical models is now analytic on both loops for the plain closed-form path — the inner EBE gradient applies the same g = ln(f) jet as the outer since #665, with the covariance step reconverging the EBEs tighter (cov_inner_tol, default min(inner_tol, 1e-8)) and, for a closed-form non-IOV LTBS fit, the fit’s inner loop tightened to at least 1e-6 so the standard errors of weakly-identified variances are reproducible. LTBS combined with time-varying covariates, a TIME-built-in structural parameter, or an η-dependent ExpressionScale obs_scale now also takes the analytic inner gradient — the event-driven / static inner walks apply the same ln(f) jet last, matching the outer (#486). For LTBS × IOV, the closed-form outer θ/Ω/σ gradient is now analytic too (#486 — the IOV sensitivity walk applies the ln(f) jet after its in-walk scale quotient, reproducing production’s scale-then-log order, including combined with an ExpressionScale obs_scale); only its inner EBE gradient stays on finite differences (analytic_inner_common_bail declines log_transform under IOV, since the Dual1 inner walk carries no ln jet). Other per-subject FD fallbacks include an input_rate forcing combined with an SDE [diffusion] block, or weibull combined with an estimated lagtime (its onset can diverge for shape β < 1) — plus, for the inner EBE gradient only, IIV-on-residual-error combined with M3 BLOQ on ODE models (the outer θ/Ω/σ gradient still serves it; plain ODE-LTBS and plain ODE iiv_on_ruv are analytic on both loops since #474, as the Dual1 ODE walk shares solve_ode_g with the objective). On analytical (closed-form 1-/2-/3-cpt) models, IIV-on-residual-error (iiv_on_ruv) is analytic on both loops, including combined with inter-occasion variability (the residual-variance scaling exp(2·η_ruv) and the η_ruv covariance column ride the stacked [η_bsv, κ] assembly), with a custom / time-varying residual-error magnitude (the magnitude’s direct-θ terms thread the residual-eta d/R coupling on the FOCEI outer gradient — non-IOV since #673, under IOV since #486, as the κ-augmented residual-eta assembly is dimension-generic), and with M3 BLOQ (the censored data term contributes its η_ruv first-order column, and the censored second-order cross-terms enter both the inner Hessian and the Laplace H̃ consistently with quantified rows, at FOCEI order — since #486); the non-IOV ODE iiv_on_ruv × M3 combination is analytic too (since #623). M3 BLOQ combined with inter-occasion variability is analytic across the full estimator matrix on closed-form models (FOCEI since #580; non-interaction FOCE and the triple M3 + IOV + iiv_on_ruv since #591): the censored −logΦ((LLOQ−f)/√v) data term and its f-derivatives ride the stacked [η_bsv, κ] assembly — entering the data gradient, the true inner Hessian, and (since #486) the Laplace H̃ at FOCEI order, consistently with quantified rows and exactly as on the non-IOV M3 path. Under FOCEI the inner EBE and the outer θ/Ω/σ gradient match reconverged FD of the censored FOCEI marginal; under FOCE the censored rows leave the augmented Sheiner–Beal marginal and re-enter as the marginal tail probability −logΦ((LLOQ−f0)/√R̃ⱼⱼ), R̃ⱼⱼ = HⱼΣ_bHⱼᵀ + R⁰ⱼ — the same linearized-marginal moments the quantified rows use (#646), matching the way Monolix’s linearization likelihood treats censored data — so plain FOCE stays a consistent Sheiner–Beal objective for the whole subject (no silent promotion to FOCEI). FOCEI (a conditional first-order method) instead keeps the censored term at the conditional f(η̂)/R⁰, so FOCE-M3 and FOCEI-M3 are genuinely different optima. FOCEI is the conditional-family method NONMEM’s METHOD=1 LAPLACE M3 belongs to (NONMEM cannot run M3 without LAPLACE) — but ferx’s FOCEI drops the ∂²f/∂η² second-order term that full Laplace keeps, so it matches NONMEM’s Laplace M3 only up to that FOCEI-vs-Laplace term (ferx has no full-Laplace M3 mode yet). Only the ODE M3 + IOV + iiv_on_ruv triple still falls back to FD. On analytical (closed-form 1-/2-/3-cpt) models, modeled-duration/rate doses (RATE=-1/-2 → D{cmt}/R{cmt}) are analytic on both loops since #486, closing the largest closed-form-vs-ODE gap (the ODE path had it via #530/#635): the modeled infusion window resolves from the PK parameters, so the infusion end is a moving boundary in D/R that the event-driven walk carries exactly — to second order — via the dual window length (the sign-mirror of the dose-start lagtime handling), for non-IOV and IOV (each occasion resolves its own window from the per-occasion PK jet, including a κ-coupled modeled slot), on both the outer θ/Ω/σ gradient and the inner EBE η-gradient. Modeled-dose × steady state is analytic too (#486): the closed-form dual SS equilibration threads the modeled-window jet (rate, dur) into each cycle’s active/quiet split, so the moving infusion-end flows through the SS trough exactly as it does through the current pulse (matching the ODE path). Only a rate-defined (RATE=-1) infusion under F ≠ 1 stays on FD (as on the ODE path). An η-dependent expression obs_scale is fully analytic on both loops (since #486): on analytical models the inner EBE gradient applies the η-block of the outer’s scale quotient rule; on ODE models the same quotient is applied on the static walk and on the event-driven walk (time-varying covariates and/or the TIME built-in) — the divisor is subject-static (production evaluates it at the subject covariate snapshot), so the non-IOV event-driven walk applies one subject-static jet since #486 — and the IOV ODE walk evaluates one scale jet per occasion group so IOV + time-varying covariates (and TIME) remains analytic since #590 (combined with LTBS it still falls back to FD). A constant ScalarScale output divisor (f/k) is analytic under IOV on both engines since #486 — the trivial covariate/η/κ-independent case of the same quotient: on analytical (closed-form) models it divides the final jet uniformly, and on ODE models the in-walk readout already divides p/k over the stacked dual. On ODE models the event-driven dual walk (since #439) is analytic for bolus dosing — with or without time-varying covariates, the TIME built-in, and IOV — and now also covers an estimated lagtime (bare LAGTIME/ALAG or per-compartment ALAG{n}, injected as an event-time saltation), finite-duration infusions, modeled-duration/rate doses (RATE=-1/-2 → D{cmt}/R{cmt}, whose infusion window is a moving boundary in the modeled parameter carried by a rate-off saltation — analytic since #530, and combined with IOV since #486, where each occasion resolves its own window from the per-occasion PK jet, including a κ-coupled modeled slot), EVID 3/4 resets, and EVID=2 covariate-only breakpoints on both the non-IOV and IOV ODE paths (κ fixed at zero on the IOV path), for both the outer θ/Ω/σ gradient and the inner EBE η-gradient (since #449/#590/#486). A built-in absorption input-rate forcing (igd/transit/weibull/first_order) combined with an EVID 3/4 reset or time-varying covariates is analytic since #486: the reset case threads the tracked reset_floor into the shared forcing helper, and the TV-cov case hoists the forcing’s dose-invariant constants fresh per segment as the PK snapshot changes. Combined with an estimated lagtime, igd/transit/first_order are also analytic since #486 — the continuous ∂R_in/∂lag flows through the walk’s dual time-after-dose, and the forcing’s onset at the dose’s lagged arrival is injected as an exact rate-on saltation. zero_order(dur) is also analytic combined with time-varying covariates or an estimated lagtime since #486: the event-driven walk now carries its moving-boundary window (delivered as a per-segment constant, with rate-on / rate-off saltations at the moving window start / end — the rate-off using the general g⁻ − g⁺ saltation so a covariate that varies across the window end is exact). Because the IOV walk is the same event-driven walk, every input-rate kind the non-IOV walk carries is analytic under IOV too (#486): the smooth densities igd/transit/weibull/first_order, the moving-window zero_order, and any composition (mixed, parallel) — each forcing’s rate/window is rebuilt from its dose’s own occasion-seeded PK jet, so κ rides through. Compartment-indexed bioavailability F{cmt} and lagtime ALAG{cmt} are analytic under IOV as well (the walk resolves each dose’s own compartment slot). The only input-rate holdouts under IOV are weibull + estimated lagtime (its onset diverges for shape β < 1, on every path) and any built-in absorption forcing combined with a steady-state dose (the dual SS equilibration does not spread a periodic zero-order window / first-order tail over the cycle). Steady-state (SS=1) dosing is analytic too (#473), including combined with a modeled-duration/rate dose, a rate-defined infusion under F ≠ 1, or an estimated lagtime (each closed by #486 — see Steady-state dosing); only SS combined with a non-autonomous RHS (reads TIME/TAFD/TAD) stays on FD. A time-varying-covariate + init(...) ODE model is analytic for the plain-bolus subset — the initial state is seeded at the subject’s first-record covariate snapshot (matching production, #486); init(...) combined with a reset / estimated lagtime / finite infusion / input-rate forcing / steady-state / modeled-dose still falls back to FD. It is the default (gradient_method = auto); one provider evaluation replaces FD’s ~2·n_eta+1 predictions per inner step, so it is exact and faster as the parameter count grows. Use gradient_method = fd only when you want to force finite differences.

The startup banner reports the route actually resolved across the population — not just the requested setting — so a silent analytic→FD fallback is visible:

  gradient: analytic (Dual2)  [requested: auto]

When the population splits across routes, the banner shows per-route subject counts, e.g. analytic (Dual2) ×118, FD ×3.

This gradient: line appears for gradient-driven estimators (FOCE/FOCEI/GN) and for imp, which reuses the EBE Hessian built via the same route. SAEM is sampling-based and reports its E-step kernel on a sampler: line instead — see SAEM.

If the EBE search wanders into a region where the individual NLL evaluates to a non-finite value (for example, an ODE model whose integration blows up at extreme $\eta$), that point is treated as the worst possible objective rather than aborting the fit. The subject is reported as non-converged and estimation continues for the remaining subjects.

Gradient accuracy vs cost (reconverged EBEs)

By default the population gradient holds each subject’s EBEs ($\hat\eta$) and FOCE Hessian fixed while perturbing the population parameters — the fixed-EBE gradient. This is cheap, but it omits the response of the inner solution to $\theta$ and $\Omega$. On well-conditioned problems that omitted term is negligible and a gradient optimizer matches the derivative-free bobyqa (see Outer Optimizers). On ill-conditioned problems it is not: the omitted term is what separates a true descent direction from a flat-looking plateau, and slsqp can report converged at an OFV far above the bobyqa optimum — which is exactly why the default optimizer = auto falls back to bobyqa on the FD-only fits where this bias bites.

Set reconverge_gradient_interval = 1 (in [fit_options]) to re-solve the inner EBE loop at every finite-difference perturbation, recovering the full surface. This costs roughly 5–6× per gradient — reserve it for fits whose OFV looks suspiciously high. IOV models (kappa/block_kappa) always reconverge, so the setting is a no-op there; it only changes non-IOV fits.

To amortize that cost, reconverge_gradient_interval = N reconverges only every N-th gradient evaluation and uses the cheap fixed-EBE gradient in between — the periodic correction is often enough to keep the optimizer off the plateau at a fraction of the always-on cost. The default 0 disables reconverging entirely (cheap fixed-EBE gradient).

Validation — cefepime pediatric population PK, 2-compartment IV infusion, combined error, no IOV (5937 subjects / 17380 observations). All ferx rows use FOCEI and the same likelihood convention, so OFVs are directly comparable:

Configuration	OFV	Wall time
`slsqp` (fixed-EBE gradient)	68,252	390 s
`slsqp` + reconverge every 10 (`interval = 10`)	66,118	633 s
`slsqp` + reconverge every 5 (`interval = 5`)	66,056	1,004 s
`slsqp` + reconverge every eval (`interval = 1`)	65,485	1,871 s
`bobyqa` (derivative-free, default)	65,598	315 s
ferx OFV evaluated at NONMEM’s final estimates	67,514	—

The fixed-EBE gradient stalls slsqp ~2,650 OFV units above bobyqa. Reconverging the EBEs on every gradient evaluation closes the entire gap and reaches an optimum marginally below bobyqa’s — and below the point NONMEM converged to (NONMEM’s estimates score 67,514 under ferx’s likelihood), confirming the stall was a gradient-bias artifact, not a worse model.

A larger reconverge_gradient_interval trades that accuracy back for speed: reconverging every 5th or 10th gradient still closes ~80% of the stall, but the OFV degrades monotonically as the interval grows (the cheap fixed-EBE gradients in between bias the direction enough that slsqp declares convergence slightly early). On this problem every interval setting is dominated by bobyqa on both OFV and wall time — which is why the default optimizer = auto selects bobyqa on FD-only fits like this one. The reconverged-slsqp path earns its keep when a gradient optimizer is required (e.g. parameter count high enough that derivative-free search scales poorly, or downstream tooling that consumes the optimizer’s gradient).

FOCE vs FOCEI

Standard FOCE

Uses linearized predictions around the EBEs:

\[ f_0 = f(\hat{\eta}) - H \hat{\eta} \]

where $H$ is the Jacobian matrix $\partial f / \partial \eta$. The per-subject objective is:

\[ \text{OFV}_i = (y - f_0)^T \tilde{R}^{-1} (y - f_0) + \log|\tilde{R}| \]

where $\tilde{R} = H \Omega H^T + R(f(\eta{=}0))$. The residual variance $R$ is evaluated at the population prediction $f(\eta{=}0)$ — matching NONMEM’s METHOD=1 (no INTER) semantics — not at the linearized point $f_0$. On a nonlinear model (e.g. oral absorption) the linearization $f_0 = f(\hat\eta) - H\hat\eta$ can extrapolate to a near-zero or negative concentration where the true typical-individual prediction is healthy; evaluating a proportional variance $(f\sigma)^2$ there collapses the weight and makes the objective multimodal with an indefinite covariance. $f(\eta{=}0)$ is always physically sensible, so FOCE+proportional converges deterministically and matches NONMEM FOCE. Additive error is unaffected ($R$ does not depend on $f$).

FOCEI (Interaction)

Uses individual predictions directly without linearization. ferx-core implements the Laplace approximation form from Almquist et al. (2015):

\[ \text{OFV}_i = (y - \hat{f})^T V^{-1} (y - \hat{f}) + \hat{\eta}^T \Omega^{-1} \hat{\eta} + \log|\tilde{R}| \]

FOCEI is more accurate when the residual variance depends on the predicted value (proportional or combined error models), because it captures the interaction between random effects and residual error.

Almquist, J., et al. (2015). Comparison of maximum a posteriori and conditional likelihood estimation in the FOCE method. PAGE 24, Abstr 3516.

Optimizer Options

The outer optimizer is configured independently of the estimation method. See Outer Optimizers for a full description of all available algorithms (SLSQP, BOBYQA, trust-region, BFGS, L-BFGS, MMA) and guidance on when to use each.

Set via [fit_options]:

[fit_options]
  method    = focei
  optimizer = slsqp    # override the default (bobyqa)

Global Search

Enable gradient-free pre-search to help escape local minima:

[fit_options]
  method        = foce
  global_search = true
  global_maxeval = 2000

The pre-search uses NLopt’s CRS2-LM algorithm (Controlled Random Search) to explore the parameter space before handing off to the local optimizer. The number of evaluations auto-scales with model complexity if global_maxeval is not set.

Covariance Step

When covariance = true, ferx-core computes the variance-covariance matrix of the parameter estimates (the R-matrix: the inverse observed Fisher information) at the converged solution. This provides:

Standard errors (SE) for all parameters
Relative standard errors (%RSE) for assessing estimation precision
Omega SEs via delta method from the Cholesky parameterization

How the Hessian is built

The covariance is 2·H⁻¹, where H is the finite-difference Hessian of the objective. Two details make the SEs match NONMEM’s $COVARIANCE:

The EBEs are reconverged at every perturbed point. Like NONMEM’s covariance step, ferx re-solves the inner conditional-estimation loop (warm-started from the converged EBEs) at each finite-difference point. Holding the EBEs fixed omits the response of the conditional optimum to the population parameters, which gives a Hessian with the wrong curvature — indefinite even on well-conditioned surfaces such as warfarin — and was the source of the spurious eigenvalue clipping in issue #129.
The Hessian is a reconverged second difference of the objective (issue #209): a 3-point diagonal / 4-point off-diagonal difference of the marginal OFV, with the EBEs reconverged at every perturbed point. Because it re-evaluates the full marginal end-to-end (recomputing a = ∂f/∂η and the log|H̃| EBE-response at each point), it captures the complete curvature with no envelope approximation and no held-fixed a, and serves FOCE, FOCEI and IOV — and additive/proportional/combined error — uniformly. An earlier release also offered a covariance_ofv_hessian = false path that took a central difference of the analytical population gradient (2·n_free gradient evaluations instead of ~2·n_free² objective evaluations); that stencil was a-fixed and only exact to first order in the log|H̃| term (it carried a ~9% TVKA bias and could explode SEs at non-converged points), so it was removed in #639 — the reconverged-OFV difference is now the sole stencil.

The factor of two is because the objective is −2·logL: its Hessian is twice the observed information, so the covariance needs 2·H⁻¹ to be on the right scale.

The covariance objective is 2·pop_nll for both FOCE and FOCEI — the Ω penalty is already inside each per-subject marginal and must not be added again. FOCEI’s Almquist–Laplace marginal carries η̂ᵀΩ⁻¹η̂ + log|Ω| explicitly; FOCE’s Sheiner–Beal marginal carries it through R̃ = HΩHᵀ + R (equivalent by the Woodbury identity). An earlier version added the Ω prior a second time on the FOCE path, double-counting Ω and under-stating the FOCE omega SEs by ~31% — fixed in issue #243.

For a mixed block + diagonal Ω, the structural-zero cross-block off-diagonals (which are not estimated) are excluded from the covariance parameter set like FIX parameters; otherwise their flat Hessian diagonal aborts the step. (The same #243 change exposed and fixed this for both FOCE and FOCEI.)

NONMEM cross-check

On data/warfarin.csv (10 subjects, 1-cpt oral, proportional error) against NONMEM 7.5.1 $COVARIANCE MATRIX=R, FOCEI ($EST METHOD=1 INTER):

Parameter	ferx SE	NONMEM SE	rel. diff
TVCL	0.00712	0.00710	+0.3%
TVV	0.2350	0.2401	−2.1%
TVKA	0.1506	0.1486	+1.3%
PROP_ERR (SD)	0.000832	0.000835	−0.4%
ω²(CL)	0.01291	0.01279	+0.9%
ω²(V)	0.00403	0.00431	−6.4%
ω²(KA)	0.1530	0.1504	+1.8%

and FOCE ($EST METHOD=1, no INTER), where ferx’s OFV/estimates already match NONMEM (OFV −280.17 vs −280.36) and, after the #243 covariance fix, the omega SEs do too:

Parameter	ferx SE	NONMEM SE	rel. diff
TVCL	0.00661	0.00663	−0.3%
TVV	0.2375	0.2340	+1.5%
TVKA	0.1209	0.1245	−2.8%
PROP_ERR (SD)	0.000922	0.000941	−2.0%
ω²(CL)	0.01312	0.01280	+2.5%
ω²(V)	0.00454	0.00430	+5.7%
ω²(KA)	0.1510	0.1607	−6.1%

(Analytic-Jacobian route; the FD-Jacobian fallback agrees to within ~10% on the ω block.) Both are guarded by tests/warfarin_covariance_nonmem.rs within a 20% band.

Covariance estimator: R, S, or sandwich

By default ferx reports the R-matrix covariance R⁻¹ (the inverse observed information), which assumes the model is correctly specified. covariance_method selects an alternative, mirroring NONMEM’s $COVARIANCE MATRIX=:

`covariance_method`	Estimator	NONMEM	Use when
`r` (default)	`R⁻¹`	`MATRIX=R`	Model is well-specified; you want the model-based SEs.
`s`	`S⁻¹`	`MATRIX=S`	You want the empirical-information (cross-product) SEs.
`rsr`	`R⁻¹ S R⁻¹`	`MATRIX=RSR`	You want SEs robust to model mis-specification (the Huber–White “sandwich”).

Here S = Σᵢ gᵢgᵢᵀ is the cross-product of the per-subject score vectors gᵢ = ∂(−logLᵢ)/∂θ — the same per-subject gradients the Gauss–Newton optimizer uses for its BHHH step. At the MLE of a correctly-specified model the information-matrix equality gives R ≈ S, so all three estimators converge to the same SEs; they diverge when the model is mis-specified, where rsr is the conservative choice.

[fit_options]
  method           = focei
  covariance       = true
  covariance_method = rsr

s and rsr are available for FOCEI, FOCE, and IOV fits. Note that s requires a positive-definite FD Hessian (it reuses r_inv; a non-PD R is not yet rerouted to S⁻¹ directly — see TODO in outer_optimizer.rs).

NONMEM cross-check (S / RSR, FOCEI)

Same warfarin FOCEI fit as the MATRIX=R table above, with two additional NONMEM covariance runs ($COVARIANCE MATRIX=S and MATRIX=RSR); SEs are the .ext ITERATION = -1000000001 row (#266):

Parameter	ferx `s`	NONMEM `MATRIX=S`	rel. diff	ferx `rsr`	NONMEM `MATRIX=RSR`	rel. diff
TVCL	0.01031	0.00930	+10.9%	0.00742	0.00710	+4.6%
TVV	0.4250	0.4602	−7.7%	0.2418	0.2403	+0.6%
TVKA	0.2157	0.2268	−4.9%	0.1555	0.1487	+4.5%
PROP_ERR (SD)	0.001519	0.001545	−1.7%	0.000790	0.000796	−0.7%
ω²(CL)	0.02005	0.01764	+13.7%	0.01170	0.01093	+7.1%
ω²(V)	0.007139	0.008287	−13.9%	0.003699	0.003978	−7.0%
ω²(KA)	0.2265	0.2596	−12.8%	0.1311	0.1396	−6.1%

(ci/FD build.) The s cross-product matches NONMEM within ~14% and rsr within ~7%. Guarded by covariance_se_matches_nonmem_s_rsr in tests/covariance_method_sandwich.rs (20% band for s, 15% for rsr). The per-subject score S is on the −logL (NLL) scale and, unlike R, carries no factor of two; the close agreement confirms that scaling is correct.

NONMEM cross-check (S / RSR, FOCE)

Same warfarin model, $EST METHOD=1 (no INTER), $COVARIANCE MATRIX=S and MATRIX=RSR (#250):

Parameter	ferx `s`	NONMEM `MATRIX=S`	rel. diff	ferx `rsr`	NONMEM `MATRIX=RSR`	rel. diff
TVCL	0.00824	0.00838	−1.6%	0.00720	0.00757	−4.9%
TVV	0.3822	0.3996	−4.4%	0.2552	0.2457	+3.8%
TVKA	0.1445	0.1495	−3.3%	0.1373	0.1328	+3.4%
PROP_ERR (SD)	0.001593	0.001573	+1.3%	0.000942	0.000947	−0.6%
ω²(CL)	0.01604	0.01771	−9.4%	0.01038	0.01092	−5.0%
ω²(V)	0.008768	0.008387	+4.5%	0.004057	0.003957	+2.5%
ω²(KA)	0.2443	0.2336	+4.5%	0.1460	0.1549	−5.7%

(ci/FD build.) All within ~10%. Guarded by covariance_se_matches_nonmem_foce_s_rsr in tests/covariance_method_sandwich.rs.

Note that ferx defaults to covariance_method = r whereas NONMEM’s $COVARIANCE default is the rsr sandwich; set covariance_method = rsr when reconciling SEs against a default NONMEM run.

Hessian regularization

Because the EBEs reconverge, the Hessian is positive-definite on well-conditioned surfaces and no regularization is needed. As a safety net for genuinely ill-conditioned or near-singular problems, ferx-core still inverts via a symmetric eigendecomposition and clips eigenvalues below max(λ_max · 1e-10, 1e-12) to that floor before reconstructing H⁻¹, adding a warning to FitResult.warnings:

Covariance step regularized: eigenvalue floor applied to FD Hessian
  (1 of 7 free-block eigenvalues clipped; min eig = -3.21e-08, floor = 4.56e-09).
Standard errors should be interpreted with care.

SEs in the regularised directions are inflated (since 1/λ_clipped is larger than the would-be true 1/λ). With the reconverging covariance step this warning should now be rare; if it fires, the surface is genuinely near-identifiable and the SEs in those directions warrant scrutiny. A Hessian whose spectrum is genuinely indefinite — every eigenvalue ≤ 0 — still fails outright with the standard Covariance step failed message.

SIR fallback for non-PD Hessians

When a model is poorly identified or converges to a saddle point, the FD Hessian can have every free-block eigenvalue ≤ 0, making Hessian inversion impossible. Setting covariance_fallback = sir triggers a SIR (Sampling Importance Resampling) run instead of leaving the covariance step as failed:

[fit_options]
  covariance_fallback = sir

ferx-core constructs a proposal covariance by taking the absolute values of the Hessian eigenvalues, inflating by 4× (heavier tails to account for the non-PD correction), and running the same SIR sampler used by sir = true. The result is reported as 95% credible intervals in the fit YAML; covariance_status is set to sir_fallback.

Use this option when: - The Hessian is non-PD at convergence due to model mis-specification or weak identifiability - You want distributional uncertainty estimates without re-fitting with a more constrained model - You need to proceed to a downstream analysis despite a non-PD Hessian

The SIR fallback inherits all sir_* settings (sir_samples, sir_df, sir_resamples, etc.) from [fit_options].

Convergence

The outer loop terminates when any of: - The gradient norm falls below outer_gtol (default 1e-6) - The maximum number of iterations (maxiter) is reached - The optimizer reports convergence (NLopt XtolReached or FtolReached)

The inner loop terminates when the gradient norm falls below inner_tol (default 1e-5, tight enough that residual EBE noise does not perturb the marginal objective the outer optimizer minimises — see fit-options.md for tuning guidance) or inner_maxiter (default 200) iterations are reached.

Warm Starting

The inner loop warm-starts EBE estimation from the previous outer iteration’s EBEs. This significantly reduces computation time, especially in later iterations when parameters change slowly.

Configuration Example

[fit_options]
  method     = focei
  maxiter    = 500
  covariance = true
  # optimizer omitted → uses the default (auto); set explicitly to switch