Importance Sampling (IMP)
Maturity: beta — see Feature Maturity for what this means.
imp is, by default, a Monte-Carlo EM estimator — the equivalent of NONMEM $EST METHOD=IMP. It maximises the importance-sampled marginal likelihood, updating θ/Ω/σ each iteration. Set imp_eval_only = true (NONMEM EONLY=1) to instead evaluate the marginal log-likelihood
\[ -2 \log L(\theta) \;=\; -2 \sum_i \log p(y_i \mid \theta) \;=\; -2 \sum_i \log \int p(y_i \mid \eta, \theta)\, p(\eta \mid \theta)\, d\eta \]
at the fixed input parameters without estimating them.
⚠️ Behaviour change. Before this release
imponly evaluated−2 log Lat fixed parameters (theimp_eval_onlybehaviour). It is now an estimator by default. If you usedmethod = imp(or[focei, imp]) purely to score a fit, addimp_eval_only = trueto keep the old behaviour.
Mapping to NONMEM
| ferx | NONMEM |
|---|---|
method = imp (default) |
$EST METHOD=IMP (estimator) |
method = imp + imp_eval_only = true |
$EST METHOD=IMP EONLY=1 (evaluate) |
method = impmap |
$EST METHOD=IMPMAP (estimator) |
The IS Monte-Carlo estimator of p(yᵢ|θ) is unbiased; the reported −2 log L carries a small Jensen bias from the log(·) transform that vanishes as K → ∞ and is typically dominated by the much larger Laplace-approximation bias of ofv in the sparse-data / strongly-nonlinear regime IMP targets.
How the estimator works
IMP is a Monte-Carlo EM (MCEM) loop sharing all of its machinery with IMPMAP except the proposal re-centering strategy:
- On the first iteration it finds each subject’s conditional mode and first-order variance via a FOCE inner loop, and centres the importance-sampling proposal there.
- On every subsequent iteration it re-centres the proposal from the previous iteration’s importance-sample mean and covariance — it does not re-run the inner loop. This is the one-line difference from IMPMAP (which re-derives the mode/variance every iteration) and the source of IMP’s lower per-iteration cost.
Each iteration’s M-step updates θ/Ω/σ from the importance-weighted posterior moments (closed-form Ω and log-mu-referenced θ; a small BOBYQA step for σ and non-mu-ref θ), exactly as in IMPMAP. The reported estimate is the running mean of the parameter vector over the final imp_averaging iterations, and ofv is a final FOCE Laplace pass for AIC/BIC comparability.
Which number matches NONMEM’s reported OFV? NONMEM
METHOD=IMPreports the importance-sampling Monte-Carlo marginal−2 log L(the.ext/.lst#OBJV), not a Laplace value. ferx’sofvis a Laplace pass, so it will not equal NONMEM’s IMP#OBJVon data where the Laplace approximation and the true marginal diverge (sparse / strongly nonlinear). The NONMEM-comparable number is the marginal, evaluated at the final estimates and surfaced onFitResult.importance_sampling.minus2_log_likelihood(with its Monte-Carlo SE on.mc_standard_error) for estimatingimp/impmapruns too — not only the evaluation-only path. If IMPMAP is configured with a Gaussian proposal (impmap_proposal_df = normal), it is replaced by a finite-tproposal for this final marginal eval, so the heavier tails keep the importance weights bounded.
Rich data: prefer IMPMAP or warm-start
Because IMP’s proposal lags one iteration behind the parameters, it is fragile on rich data: when the conditional posterior of η is razor-sharp (many observations per subject), a large early M-step moves the posterior past the lagged proposal and the effective sample size collapses. This is the documented weakness of METHOD=IMP and the reason NONMEM offers IMPMAP. On rich data, either use impmap (re-centres every iteration, robust) or warm-start IMP from FOCEI (methods = [focei, imp]) so the per-iteration steps are small and the lagged proposal stays overlapped. IMP is well-suited to sparse data, where the broad posterior keeps the proposal overlapped.
[fit_options]
method = imp # estimate by importance-sampling MCEM
imp_iterations = 200 # MCEM iterations
imp_samples = 1000 # K, samples per subject per iteration
imp_averaging = 50 # terminal iterations to average
imp_proposal_df = 5 # ν, Student-t tail weight (or `normal` → MVN)
imp_seed = 12345
[fit_options]
method = [focei, imp] # robust on rich data: warm-start IMP from FOCEI
Evaluation-only mode (imp_eval_only = true)
With imp_eval_only = true, imp does not estimate: it evaluates the IS −2 log L at the fixed input parameters and reports it on FitResult.importance_sampling (handy for scoring imported / fixed parameter sets, or comparing models on a lower-bias likelihood than the Laplace OFV).
When evaluation-only is useful
The Laplace approximation behind the FOCE/FOCEI OFV assumes each subject’s posterior of η is Gaussian-shaped — fine for well-sampled PK (≥ 5–6 obs per subject) but biased for sparse data, strong nonlinearity (Michaelis–Menten, TMDD, transit absorption), or categorical/binary PD. There the FOCE OFV is typically under-stated, so naïve AIC/BIC favours over-parameterised models; the IS −2 log L is much lower-bias.
[fit_options]
method = [focei, imp] # estimate with FOCEI, then score the IS-LL
imp_eval_only = true
imp_samples = 2000
imp_proposal_df = 5
[fit_options]
method = imp # standalone: IS-LL at the initial parameters
imp_eval_only = true
Rules for the evaluation-only mode:
- Standalone is allowed.
method = imp+imp_eval_onlyevaluates the IS-LL at the initial parameters — deriving the EBEs/Jacobian (the proposal centre/scale) there via a FOCE inner loop. When chained after an estimator ([focei, imp]), it uses that stage’s EBEs/Jacobian instead. - Must appear at most once (true for both modes).
- Must be terminal.
methods = [imp, focei]withimp_eval_onlyis rejected — a following stage would overwrite the parameters and make the IS result meaningless. (The estimatingimphas no such restriction; it may lead or sit mid-chain like any estimator.)
For the estimating imp, FitResult.method reports IMP. For an evaluation-only imp after an estimator it reports that estimating stage (e.g. FOCEI); standalone evaluation-only reports IMP. Either way FitResult.method_chain preserves the full chain, and the IS-LL lands on FitResult.importance_sampling.
Algorithm
This is the per-subject importance-sampling kernel — the E-step of the estimator and the whole of the evaluation-only mode. The estimating loop simply re-runs this kernel each iteration (re-centring the proposal from the previous iteration’s sample moments) and feeds the weighted moments into the M-step.
For each subject i with proposal centre η̂ᵢ (the EBE on the first iteration / in evaluation-only mode) and inner-loop Jacobian Jᵢ = ∂f/∂η at η̂ᵢ:
Build proposal scale Σᵢ:
- Compute Sheiner–Beal posterior Hessian \(H_i = J_i^\top R_i^{-1} J_i + \Omega^{-1}\) at η̂ᵢ (\(R_i\) = diagonal residual variance at the EBE).
- Add a small ridge \(\lambda I\) for numerical safety (λ = max(10⁻⁶ · trace(H)/d, 10⁻¹⁰)).
- Cholesky-factor; on failure fall back to Σᵢ = Ω (broad prior-scale proposal — gives a valid but noisy LL estimate).
Draw K samples \(\eta_{ik} \sim t_\nu(\hat\eta_i, \Sigma_i)\) (multivariate Student-t).
Compute log importance weights: \[ \log w_{ik} = \log p(y_i \mid \eta_{ik}, \theta) + \log p(\eta_{ik} \mid \theta) - \log q(\eta_{ik}). \]
Subject marginal LL via log-sum-exp: \[ \log \hat p(y_i \mid \theta) \;=\; \operatorname{lse}_k \log w_{ik} - \log K. \]
Per-subject effective sample size: \(\mathrm{ESS}_i = 1 / \sum_k \tilde w_{ik}^2\) (normalised weights). The result reports the across-subject min and median of ESS/K, plus any subjects below
imp_low_ess_threshold(default 10%).Monte-Carlo standard error. For a self-normalised IS estimator the asymptotic per-subject variance (Geweke 1989) is \[ \operatorname{Var}\bigl(\log \hat p(y_i \mid \theta)\bigr) \;\approx\; \frac{1}{K}\left(\frac{1}{\mathrm{ESS}_i / K} - 1\right). \] Aggregating across subjects (LL is a sum of independent per-subject log-marginals) and converting to the
−2 log Lscale, \[ \operatorname{SE}(-2 \log L_{IS}) \;=\; 2 \sqrt{\sum_i \operatorname{Var}\bigl(\log \hat p(y_i \mid \theta)\bigr)}. \] Subjects with degenerate ESS (\(\mathrm{ESS}_i / K = 0\) — complete proposal collapse) fall back to a per-subject variance of1.0, which produces a finite but inflated SE rather than a NaN. A separate warning is emitted onFitResult.warningslisting the collapsed subjects; the IS-LL itself remains usable as a point estimate.Population −2 log L =
−2 · Σᵢ log p̂(yᵢ | θ), reported onFitResult.importance_sampling.minus2_log_likelihoodtogether withmc_standard_error.
Tuning
imp_iterations/imp_averaging(estimator only) set the number of MCEM iterations and how many terminal iterations are averaged into the reported estimate. Defaults 200 / 50. Under-running shows up as estimates still drifting with the MC noise; raiseimp_iterationsif the trace hasn’t stabilised.- K =
imp_samplescontrols accuracy. The MC SE scales as \(1/\sqrt{K}\); halve it by quadrupling K. Default 1000 is fine for a smoke test; bump to 2000–5000 for any reported LL. - ν =
imp_proposal_dfcontrols the proposal tails. The default of 5 is robust to mild proposal misspecification (Geweke 1989); raise toward 30+ to recover a near-Gaussian proposal when the posterior is known to be light-tailed. - Low-ESS subjects signal a proposal that doesn’t match the posterior well — usually a sign the EBE didn’t converge or the Hessian was near-singular. The IS estimate is still unbiased, just noisier. Investigate by re-running with a tighter
inner_tolor inspecting theFitResult.subjects[i]diagnostics. - Reproducibility.
imp_seeddefaults to42when unset (so two fits with the sameimp_samplesandimp_proposal_dfproduce identical−2 log L). Set it explicitly to vary the RNG stream. The full set ofimp_*defaults is listed in the[fit_options]reference.
Defensive mixture
A weakly-identified subject — e.g. an analytical [initial_conditions] baseline whose V cancels in the amplitude — can produce a proposal so badly centred that a single importance sample carries almost all the weight. That one sample then hijacks the weighted M-step and walks θ to the parameter bounds; the run “diverges” to a finite-but-enormous OFV. (This is issue #528; the OFV blow-up is now caught by the convergence check, which flags an objective ≥ 1e15 as not converged for both IMP/IMPMAP and SAEM.)
The optional defensive mixture (imp_defensive_alpha, default 0) hardens the sampler against this. With weight \(\alpha \in [0,1)\) the proposal becomes
\[ q(\eta) = (1-\alpha)\,q_\text{narrow}(\eta) + \alpha\,q_\text{broad}(\eta), \]
where \(q_\text{broad} = \mathcal N(0, \Omega)\) is the prior (for the FREM Rao-Blackwell path it is the conditional PK prior \(\mathcal N(\mu, P_{pp}^{-1})\)). An \(\alpha\) fraction of draws come from \(q_\text{broad}\), and every sample is scored under the full mixture density (the balance heuristic). Because the prior provably covers the conditional posterior, this bounds each importance weight by \(p(y\mid\eta)/\alpha\), so no single sample can dominate — keeping the population estimates identifiable even when the narrow proposal is poor.
Practical notes:
- Opt-in / NONMEM parity. The default
0reproduces the single-proposal sampler exactly (same numbers, same RNG stream) and stays bit-comparable with NONMEM, which has no defensive mixture. Enable it (e.g.0.1) only for models that exhibit the runaway above. Changing \(\alpha\) changes the reported IS−2 log L, so re-validate against your reference when you turn it on. - It bounds the weights, not the raw ESS. A sharp interior likelihood spike can still keep ESS low; the mixture only guarantees no single sample runs away.
- ESS-floor side effect. Roughly \(\alpha\) of each subject’s samples now come from the broad prior, which raises the per-subject ESS fraction. A genuinely weakly-identified subject that would have been listed under
imp_low_ess_thresholdmay no longer trip it — read the low-ESS diagnostic with that in mind when \(\alpha > 0\). - Sobol QMC is disabled while the mixture is active (the per-sample broad/narrow branch breaks the quasi-random sequence); IMPMAP emits a warning if you set both
impmap_sobol = trueandimp_defensive_alpha > 0. - For an IMPMAP stage the option may also be written
impmap_defensive_alpha(an alias for the same field).
SDE / [diffusion] models (not supported)
IMP refuses to run on models with a [diffusion] block. The EKF process-noise variance that inflates the residual variance for SDE models is not yet threaded through the IS observation-likelihood path, so the marginal would be silently biased. Use FOCE / FOCEI for the Laplace OFV on SDE models; IMP for SDE is tracked as a follow-up.
IOV (inter-occasion variability)
IOV is supported only in evaluation-only mode (
imp_eval_only = true). The estimatingimp(likeimpmap) refuses IOV models for now — the κ M-step is a planned follow-up. Use SAEM or FOCEI to estimate IOV models, then score with an evaluation-onlyimp.
For models with kappa declarations, evaluation-only IMP performs joint sampling of both η (between-subject variability) and κ (between-occasion variability). The proposal is built from the full (n_eta + n_kappa × n_occasions) posterior Hessian, and the IS weights include both the η and κ priors.
The result struct flags this via kappa_treatment = "marginalized" (YAML) and the CLI prints a notice. The reported −2 log L integrates over both η and κ uncertainty, making it directly comparable to NONMEM $EST METHOD=IMP LAPLACIAN=1 on κ.
Cost
K × n_subjects predict calls per IMP run. For warfarin (32 subjects) with K=2000 this is ≈ 64k predicts — seconds to a minute on a typical laptop. The cost is linear in K and trivially parallel over subjects (uses the same Rayon worker pool as the inner loop).
Output
In the CLI:
--- Importance Sampling (marginal log-likelihood) ---
-2 log L (IS): 1284.52 (MC SE = 0.41, K = 2000, ν = 5)
ESS / K: min = 0.18, median = 0.62
Low-ESS subjects (3): 12=0.18, 27=0.21, 34=0.22
In the fit YAML:
importance_sampling:
minus2_log_likelihood: 1284.524300
mc_standard_error: 0.412700
n_samples: 2000
proposal_df: 5.0000
ess_min: 0.1800
ess_median: 0.6200
kappa_treatment: marginalized # or not_applicable if no IOV
low_ess_subjects:
- id: "12"
ess_fraction: 0.1800
- id: "27"
ess_fraction: 0.2100
- id: "34"
ess_fraction: 0.2200The Rust FitResult.importance_sampling field carries the same data typed as Option<ImportanceSamplingResult>.
NONMEM comparison
ferx’s estimating imp is validated against NONMEM 7.5.1 METHOD=IMP on the warfarin example (10-subject extract, data/warfarin.csv). Both engines are warm-started from a FOCE/FOCEI pass (ferx methods = [focei, imp]; NONMEM $EST METHOD=COND INTERACTION → $EST METHOD=IMP INTERACTION NITER=100 ISAMPLE=1000 SEED=12345, control stream tests/nonmem/warfarin_imp.ctl).
| Parameter | NONMEM METHOD=IMP |
ferx [focei, imp] |
|---|---|---|
| TVCL | 0.1264 | 0.1327 |
| TVV | 7.723 | 7.737 |
| TVKA | 0.8857 | 0.811 |
| ω²(CL) | 0.0304 | 0.0286 |
| ω²(V) | 0.0096 | 0.0096 |
| ω²(KA) | 0.3405 | 0.336 |
| σ (SD) | 0.0105 | 0.0106 |
IMP marginal −2 log L |
−285.69 | −285.93 ± 0.07 |
Laplace ofv |
(−286.00 at COND) | −286.00 |
Both engines start from the same FOCEI basin (NONMEM METHOD=COND OFV −286.00, identical to ferx’s FOCEI). NONMEM’s IMP MCEM then drifts slightly off it toward the importance-sampled marginal optimum (TVKA up to 0.886), while ferx’s warm-started IMP holds near the FOCEI optimum — both stable and agreeing within the cross-engine + Monte-Carlo margin. TVKA is the least-identified parameter on this small extract (ETA_KA variance ≈ 0.34, high shrinkage); CL/V and the variance components agree to a few percent.
Compare like with like: NONMEM’s reported IMP #OBJV (−285.69) is the importance-sampling marginal −2 log L, so the matching ferx number is importance_sampling.minus2_log_likelihood (−285.93 ± 0.07), not ferx’s Laplace ofv (−286.00, which instead coincides with NONMEM’s COND OBJ). The small residual gap is the parameter-estimate difference (ferx’s TVKA sits a little below NONMEM’s), not an OFV-definition difference — both objectives drop the same Nobs·log(2π) constant (NONMEM “WITHOUT CONSTANT”). The cross-check lives in tests/warfarin_imp_nonmem.rs.