Model calibration: general concepts
ACBICI is a suite of Python routines that can be used to calibrate (complex) models, finding the optimal probability distributions of their unknown parameters, given some experimental data and the model itself. The library design favors clarity and simplicity, and thus users can adapt it to fit their own needs with minor modifications.
Before describing how to set up a calibration process, let us introduce a few concepts.
Models
A model is a mathematical expression, equation, or code that embodies the behavior of a system, a person, an object, etc. More formally, a model defines a mapping between an input space and a target space, usually with the goal of predicting or analyzing the behavior of the modeled system.
The input and target spaces can be, in general, abstract sets. In ACBICI, however, we restrict the input space to \(\mathbb{R}^d\), \(d\ge 1\), and the target space to \(\mathbb{R}\). That is, at this moment, the library accepts multidimensional inputs in a real vector space and only scalar outputs.
Often, models depend on parameters which we refer to as \(\theta\). These parameters belong to yet another space which in ACBICI can only be \(\mathbb{R}^p\) with \(p\ge1\). Both parameters and input variables determine the value of a model output but their roles are different. Input variables have values that are known, at least to a certain extent, coming from measurements, controls, pools, actuators, etc. In contrast, parameters are artificial variables introduced in the model to make it predictive, but need not be measurable. In addition, once the parameters are set in a model, they remain fixed for all inputs and predictions. We say that the model is calibrated and the accuracy of its predictions depends both on the goodness of the model itself as well as in the quality of the selected parameters.
The point of view of Bayesian calibration – and thus, ACBICI – pivots on the assumption that parameters are always random variables. Their probability distribution is often unknown and the goal of inference is to determine its optimal form, given the experimental evidence available and possibly some prior knowledge. In Bayesian statistics, the criterion for optimality comes from Bayes rule, which combines in a precise way the information from the experimental data, the prior information, and the model.
Calibrating a model
As indicated before, the goal of ACBICI is to calibrate models. To be specific, the model is denoted as \(g\), and it can refer to a simple mathematical function that can be written in analytical form or a complex code whose inner workings are not fully known. In any case, a model is a mapping \(\mathbb{R}^d\times \mathbb{R}^p \ni (x,\theta)\mapsto y=g(x;\theta)\in \mathbb{R}\), where \(x\) is referred to as the input variables and \(\theta\) are the parameters.
The goal of calibration is to find the optimal value of the parameters of a given model \(g\), when there is some experimental data in the form of pairs \((x_i,y_i)\in \mathbb{R}^d\times \mathbb{R}, i=1,2,...N\). Different notions of optimality produce different calibration results. In particular, in ACBICI, the library looks for the probability distribution for \(\theta\) that best combines the model, the prior information we have about this distribution, and the experimental data.
To calibrate a model in ACBICI three things are needed:
A model, that is, a Python function that can be called with some values for \(x\) and \(\theta\) and can calculate \(y=g(x;\theta)\). In some cases it is enough to have the results of evaluating the model for a rich enough set of pairs \((x_i,\theta_i)\). More about this later.
Experimental data stored in a data file containing pairs \((x_i,y_i)\).
Prior information about the probability distribution of the parameters \(\theta\). This information can be very precise or vague, but due to the Bayesian structure of the code some prior information is always required. In ACBICI, it is enough to select the type of prior information that one would like to consider since the library will provide helper functions to implement them.
To calibrate a model with ACBICI, thus, the user must prepare the three things enumerated above. In addition, the user must make three decisions that will determine the type of calibration that will be performed:
All experimental data has some error, irrespective of its origin. In ACBICI, experimental data is assumed to have a Gaussian probability distribution with zero mean. The user must decide whether the covariance of this error is known, and thus provide it, or it is unknown and should be estimated.
The user must assess whether the model we would like to calibrate is considered to be costly to run or now.
Finally, since all models are incomplete to a certain extent, the user might decide to accept the proposed model as good enough to represent the data or would like to estimate, in addition to all calibration tasks, the discrepancy of the model with the experimental data.
Let us provide some additional explanations regarding the two last topics.
Expensive or inexpensive models
To understand the need for this choice, we recall that in all kinds of Bayesian inference the goal is to obtain a posterior probability distribution which, in most cases, has no analytical expression. Rather, instead of a closed-form formula, in Bayesian calibration we obtain a (large) sample of points whose probability distribution is precisely the one we are looking for. To obtain it, a Markov chain Monte Carlo (MCMC) method needs to be employed. This type of methods requires evaluating the model hundreds or even thousands times. If the model that needs to be calibrated takes a long time to be evaluated, it might be prohibitively expensive to run MCMC with it and we then consider it as expensive.
In ACBICI, an expensive model is replaced by a Gaussian process that acts as surrogate. This meta-model is much faster to evaluate than the original model but its hyper-parameters need to be calibrated alongside the original parameters \(\theta\). Everything comes with a price: either you use the original model in the MCMC computations or you evaluate it a few times to calibrate the surrogate. For very expensive models it might be mandatory to use a meta-model at the expense of losing some accuracy in the overall calibration process. In these situations, the user has to supply a file with data triplets \((x_i,\theta_i,g(x_i;\theta_i))\) that are obtained evaluating the model \(g\) at carefully selected pairs \((x_i,\theta_i)\). These pairs must cover the input and parameter space well enough. More details will be provided later.
Discrepancy error
Every model, irrespective of its sophistication, ignores deliberatively or not details of the true system it represents, be it for simplicity or lack of knowledge. As a result, all models are inexact or, in other words, “All models are wrong, but some are useful” (G.E.P Box).
When performing the calibration of a model, we can ignore the limitations of the latter and insist in obtaining the best possible distribution for each of its parameters. It is also possible, however, to model also the discrepancy between the experimental results and the model predictions. This discrepancy error in ACBICI is also represented by a Gaussian process that, if present, provides additional hyperparameters that need to be calibrated, thus increasing the cost of the process.
In summary, depending on the calibration choices we might need to build a surrogate and the discrepancy model. The following table summarizes all the calibration types in ACBICI:
Name |
Expensive model |
Discrepancy surrogate |
Hyperparameters |
|---|---|---|---|
Type A |
NO |
NO |
None |
Type B |
YES |
NO |
3 (meta-model) |
Type C |
NO |
YES |
2 (discrepancy) |
Type D |
YES |
YES |
3 (metamodel) + 2 (discrepancy) |
Let us note that the complexity of calibrating a model grows with the number of parameters and hyperparameters. In fact, the larger this number is, the more costly is to evaluate the calibration model and the larger the number of MCMC steps to be performed. This comment should caution the user from using calibration strategies C and D. On the other hand, the knowledge of the discrepancy error can be a useful quantity to ascertain, in a global way, the adequacy of a model to reproduce experimental data.
GUI
ACBICI can be run using a graphical user interface (GUI) with all default options.
To do so, create a new folder inside the examples directory, for example:
examples/myexamples/subexample
In addition, you must provide a model file defining your model, following the same
structure as in example0/model.py.
Your experimental data file must be placed inside a folder named data in the
current directory. The data format must match that of the example experimental data
files provided with ACBICI.
Once the model and data files are prepared, run gui.py. From the GUI, you can
select the model file, the data file, and the type of calibration to perform, and
specify the model parameters as defined in your model class.
Outputs from the calibration
Once a model has been calibrated, there is a wealth of information that can be obtained from the posterior probability distribution. In fact, ACBICI computes the joint probability distribution of all the parameters and hyperparameters so any statistic can be, in principle, obtained from the former.
Parameter statistics, including the mean, variance, and maximum a posteriori estimate (MAP).
Corner plots: the outcome of the calibration is a multi-dimensional probability distribution for all the parameters of the model and hyperparameters of the meta-model. To better understand the correlations between these variables, a corner plot is depicted that includes all the one- and two-dimensional projections of the sample.
Correlation plots: when calibrating a model with multiple parameters and/or hyperparameters, it might be interesting to assess the strength of the correlation among variables. A correlation plot shows with a color scheme the value of the Pearson correlation parameter for each pair of calibrated variables. Highly correlated variables might indicate that they are not independent and thus might lead to issues in the calibration.
Comparison plots: a plot is generated that depicts the experimental data and the output of the calibrated model, as a curve. This plot is not shown when the dimension of the input space is larger than one.
Pearson plot: The correlation between all parameters and hyperparameters is shown as a heat map. Large correlations (+1 or -1) in off-diagonal blocks are indication that the corresponding parameter are not independent, signaling potential errors.
Prior/posterior plots: This plot shows the prior distribution functions for each parameter and hyperparameter next to its posterior probability densities. One should pay attention to situations in which the posterior is very similar to the prior in which case the sampled data might not have provided enough information to update the former.
Prediction plot: This plot shows a histogram of the errors made by the calibrated model as compared with the experimental data. Also the mean error in the predictions is indicated.
Trace plot: This plot shows, for each parameter and hyperparameter, its value along the MCMC chain. Users should pay attention to results that are not well-mixed, meaning that the variables do not explore their potential values. In these situations, the MCMC algorithm may have failed to identify regions of large likelihood and thus failing to sample the posterior.
Convergence
MCMC diagnostics
The calibration report includes several standard Markov chain Monte Carlo (MCMC) diagnostics that help assess whether the posterior samples are reliable.
Effective Sample Size (ESS): The report provides the ESS for each parameter as well as a mean ESS value. ESS estimates how many effectively independent samples the chain contains after accounting for autocorrelation. High ESS values generally indicate that the posterior summaries (means, credible intervals, etc.) are well estimated.
split-:math:`hat{R}` convergence diagnostic: For each parameter we report the split-\(\hat{R}\) statistic. Values very close to \(1\) indicate that multiple chains have mixed well and are consistent with sampling from the same stationary distribution.
Autocorrelation time (:math:`tau`): The mean integrated autocorrelation time \(\tau\) describes how quickly the sampler explores the posterior distribution. Shorter \(\tau\) implies better sampling efficiency and efficient exploration. We also use \(\tau\) as a practical convergence / run-length diagnostic: the chain should be long enough compared to \(\tau\) to ensure stable estimates.
More details are available in the emcee autocorrelation tutorial and in Goodman & Weare (2010), Ensemble samplers with affine invariance.
Note
In addition to the scalar diagnostics above, we generate autocorrelation plots and trace plots.
Autocorrelation plots: should show that autocorrelation decays with lag and that the estimated integrated autocorrelation time stabilizes as the chain length increases.
Trace plots: show each parameter value versus step. Trace plots are used as a quick visual diagnostic to assess mixing, stationarity after burn-in, and the presence of long-term trends or poor exploration.
Gelman-Rubin statistics plots: summarize the split-\(\hat{R}\) convergence diagnostics for all calibrated and surrogate-related parameters. Values close to one and below the indicated threshold lines, indicate that no parameter exhibits problematic convergence behavior and that the MCMC chains have mixed satisfactorily.
Warning
If you see an error like:
emcee.autocorr.AutocorrError: The chain is shorter than 50 times the integrated autocorrelation time for 1 parameter(s).
Use this estimate with caution and run a longer chain! N / 50 = 22;
tau: [22.14675566 23.67994754]
the chain is too short relative to the estimated \(\tau\). In that case, rerun the calibration with at least:
n_mcmc = max(tau) * len(p) * (1 + burnFraction) * nwalkers
For example:
24 * 2 * 1.2 * 32 = 1843
(rounded to full digits) gives \(n_\mathrm{mcmc}=1843\) steps. Repeat until the run is long enough that the autocorrelation-time estimate is valid and stable.
VBMC diagnostics
For variational inference with VBMC, convergence is assessed using diagnostics that are specific to optimization-based posterior approximation, rather than MCMC chain behavior. VBMC iteratively refines a variational posterior and monitors several quantities to determine when the approximation has stabilized.
Evidence Lower Bound (ELBO): During optimization, VBMC tracks the evidence lower bound (ELBO), which serves as the objective function for variational inference. Convergence is indicated when the ELBO reaches a plateau and subsequent changes become small relative to the estimated ELBO uncertainty (standard deviation). A stabilized ELBO suggests that further optimization is unlikely to substantially improve the posterior approximation.
Symmetrized KL divergence (sKL): VBMC monitors the symmetrized Kullback–Leibler (sKL) divergence between variational posteriors from successive iterations. As optimization progresses, the sKL should decrease toward zero, indicating that the variational posterior is no longer changing appreciably and has reached a stable solution.
Reliability (convergence) index (:math:`r_mathrm{index}`): The ELBO stability and inter-iteration sKL are combined into a single reliability (convergence) index, \(r_\mathrm{index}\). In :texttt{PyVBMC}, inference is terminated only once \(r_\mathrm{index} < 1\) and remains below this threshold for a prescribed stability window. This criterion ensures that both the optimization objective and the variational posterior have converged.
Note
VBMC diagnostics play a role analogous to ESS, \(\hat{R}\), and autocorrelation time in MCMC. While MCMC diagnostics assess sampling efficiency and mixing, VBMC diagnostics assess the stability and accuracy of the variational approximation during optimization.
In practice, stabilized ELBO values, near-zero inter-iteration sKL, and a low \(r_\mathrm{index}\) collectively indicate robust convergence for VBMC. In well-behaved analyses, the resulting marginal posterior distributions should exhibit clear peaks for most parameters, signaling that the algorithm has accurately approximated the dominant modes of the posterior distribution.
Multi-Output Calibration
When not explicitly stated, ACBICI operates on single-output calibration problems. Nevertheless, the framework naturally extends to multi-output calibration.
With the exception of the comparison plot, all features described above are also available in the multi-output setting.
Multi-output calibration in ACBICI is performed by augmenting the input space with a discrete output (task) index, which reduces the multi-output problem to an equivalent single-output formulation. The augmented input space is
where \(n_{\text{tasks}}\) denotes the number of calibration outputs.
Each output is associated with a task-specific model \(m_i : \mathbb{R}^d \times \mathbb{R}^p \to \mathbb{R}\), for \(i = 0,\ldots,n_{\text{tasks}}-1\). These are combined into a single effective model on the augmented space,
defined as
where \(\delta_{ij}\) is the Kronecker delta. For a fixed task index \(i\), the corresponding restriction of \(m\) coincides with the original model \(m_i\).
In practice, vector-valued experimental and synthetic datasets are mapped to the augmented input space by replicating each input once per task and assigning the corresponding task index. The multi-output observations are then stacked into a single scalar response vector, enabling calibration with a joint likelihood.
When surrogate models are used (types B, C, and D), this augmentation requires a task-dependent covariance function. In ACBICI, a similarity kernel is employed, where a Matérn 3/2 kernel is used when the task indices match (\(i=j\)), and the covariance is zero otherwise:
Here, \(k_{\text{m32}}\) denotes the Matérn 3/2 kernel parameterized by \(\lambda\) and \(\beta\).
In multi-output calibration, nondimensionalization and output scaling are especially important. Since all outputs are weighted equally in this formulation, they must be normalized to comparable scales to ensure balanced and meaningful calibration results.