# Bayesian Optimization & Its Applications Part I

## Introduction

Imagine being given an optimization problem, such that, we can obtain a sample $x_i(\in \mathbb{R}^n)$ from the problem and obtain a cost $f(x_i)$. The problem is noisy and non-linear (i.e. has multiple local and global minima) and is costly to evaluate. Basically, we have no idea about what function $f$ to fit to such a problem, but we really want a global optimum! How do we approach such a problem?

Well, we could start by randomly-sampling from the problem, until we obtain a optimum. This approach is not, very intelligent or interesting or practical (sample evaluation is costly). Can we do better?

We could apply an intelligent sampling strategy which cover a lot of optimization landscape, for example, Monte-Carlo Methods. This makes things a little more interesting, but now can we exploit some structure in our samples (e.g. sample covariances) or could we evolve a better sample, given $n$ good samples?

Sure, Evolutionary Algorithms, esp, CMA-ES can be applied. But wouldn’t it be great if we had someone (an acquisition function) who tells us to sample more (explore) in those areas of the optimization landscape, where we have no samples and at the same time also sample more (exploit) in those areas, where we have local optimums? Yes, that would be great!

Keeping all these requirements in mind, we introduce Bayesian Optimization (BO). Since, this is an elaborate topic we divide it into a two part series. This part builds up the basics required for BO i.e. Bayes’ Theorem and Gaussian Processes (GPs). The part II of the blog will introduce BO, the use of acquisition functions, exploration and exploitation strategies and BO’s applications.

## Introduction to Bayes’ Theorem

Building up on the context described in Introduction, let us assume that our optimization problem can be modeled using a function $f$. Also, lets say we are given a few samples (data) $x_i$, where $i \to 1:N$ and $\mathbf{X}^T$ is a vector of such samples. Then the Bayes’ Theorem suggests that:

$\underbrace{P(f|\mathbf{X}^T)}_\text{Posterior} \propto \underbrace{P(\mathbf{X}^T|f)}_\text{Likelihood} \underbrace{P(f)}_\text{Prior}$

Here although we do not know the true function which models our optimization problem, we assume it and the prior represents this, our confidence in $f$. While the posterior represents our confidence about the function $f$, given (conditioned) the samples $\mathbf{X}^T$.

More details in Wikipedia article on Bayes’ Theorem and Brochu 2010 et. al.

## Multi-Variate Normal (MVN) Distribution

Before we formally define GPs, lets take a detour into some results about MVN distributions (See Murphy 2012 el. al., Chapter 4). If $x = (x_1, x_2)$ are jointly Gaussian with parameters, $\mu = \begin{pmatrix} \mu_{1}\\ \mu_{2} \end{pmatrix}$, $\Sigma = \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix}$ and $\wedge = \Sigma^{-1}$. Then,

$\underbrace{p(x_1)}_\text{Marginal} =\mathcal{N}(x_1|\mu_1, \Sigma_{11})$,
$\underbrace{p(x_1|x_2)}_\text{Posterior}=\mathcal{N}(x_1,|\mu_{1|2}, \Sigma_{1|2})$
$\mu_{1|2}=\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2)$
$\Sigma_{1|2}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}$

## Introduction to Gaussian Process (GP)

If a MVN represents scalar and vectors, then intuitively, a GP represents a distributions which models functions. These functions have a Gaussian prior. Say, that we are given samples as follows $\{(x_i, f_i), i=1:N\}$ and a test sample $X_{*}$ of size $N_*$, then GP says that,

$\begin{pmatrix} f\\f_{*} \end{pmatrix} \sim \mathcal{N}\begin{pmatrix} \begin{pmatrix} \mu\\\mu_* \end{pmatrix}, \begin{pmatrix} K&K_*\\K_*^T&K_{**} \end{pmatrix}\end{pmatrix},$

where $K=\kappa(X, X)$ is $N\times N$, $K_*=\kappa(X, X_*)$ is $N\times N$, and $K_{**}=\kappa(X_*, X_*)$ is $N_*\times N_*$. $\kappa$ is also called the Kernel function and for example can be given by,
$\kappa(x_1, x_2) = \sigma_f^2 e^{\frac{-1}{2l^{2}}(x_1-x_2)^2},$ where $\sigma_f^2$ and $l^2$ are the parameter of the kernel function. Then, using definitions from Section on MVN, we get:

$P(f_*|X_*, X, f) = \mathcal{N}(f_*|\mu_*, \Sigma_*)$
$\mu_*=\mu(X_*) + K_*^T K^{-1}(f-\mu(X))$
$\Sigma_* = K_{**}-K_*^T K^{-1}K_*$

Hence, we see that a new the function $f_*$ can be estimated, given the old data $\{(x_i, f_i)\}$ and new sample $X_*$. This process of modeling data at MVNs and deriving a distribution (function) for new data is collectively GP modeling. Also, see lectures on Gaussian Processes by Nando de Fretias.