In Young Cho, Vincent Li

Since we have reached the conclusion of this course, I think it’s only proper to make a post that surveys the popular optimization packages to discover which algorithms are actually being used by researchers and the industry.

**Unconstrained Optimization**

*Nelder Mead Method*

This is a gradient-less method. It relies on the use of simplices, i.e. polytopes of dimension $n+1$. It updates the vertices of the simplices at each iteration. When tested on the Rosenbrock function, it took 85 iterations.

*Powell*

Also a gradient-less method, the Powell method works by searching along directional vectors then updating these vectors by the displacement vectors.

*quasi-Newton BFGS*

BFGS refers to the Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula for updating the approximation of the Hessian matrix. A variant is also popular: the L-BFGS (Limited memory BFGS)

*Newton Conjugate-Gradient*

This is a modification of the Newton’s method where Newton system is only approximated solved using conjugate gradients.

*Conjugate Gradient*

This is an exact implementation of what we mentioned in the theory section.

*Trick: Trust Region*

The trust-region algorithm is used for unconstrained nonlinear problems and is especially useful for large-scale problems where sparsity or structure can be exploited. A trust region is where the approximation to the objective function is close to the objective function. An example would be to restrict the approximation region of the Levenberg–Marquardt algorithm so that any one step does not go berserk, resulting in smaller but more trusted steps.

**Bounded Optimization**

*Truncated Newton Conjugate-Gradient*

This is a slight variant of Newton Conjugate-Gradient by truncating each step in order to keep each varaible within their bounds.

*L-BFGS-B*

Limited Memory Algorithm for Bound Constrained Optimization. This is a modification over the L-BFGS algorithm menntioned above. It deals well with large-scale problems.

**Constrained Optimization**

*Constrained Optimization BY Linear Approximation (COBYLA)*

This algorithm is based on linear approximations to the objective function and each constraint. This gives a linear program to solve. However the linear approximations are likely only good approximations near the current simplex, so the linear program is given the further requirement that the solution must be close to the previous solution.

*Sequential Least Squares Programming (SLSQP)*

As the name suggests, this is an improvement over COBYLA by approximing the objectives with quadratic formulation. The method is used on problems for which the objective function and the constraints are twice continuously differentiable. It approximates the objective with a quadratic model and approximates the constraints with linear models.

*Interior Point Methods*

The interior point algorithm is especially useful for large-scale problems that have sparsity or structure, and tolerates user-defined objective and constraint function evaluation failures. The Hessian can be approximated with BFGS or L-BFGS.

*Trick: active-set*

Active set algorithms determine which constraints relate to the optimum, therefore reducing the complexity of the search. SQP can be considered an active-set algorithm.

**Global Optimization**

*simulated annealing*

This a one of the family of Monte Carlo methods. They work by hopping around regions with possible local minima. In the case of simulated annealing, the hopping around is controlled by each region’s temperature (or distance t o termination criteria).

*basin hopping*

Another member that brings pride to our Monte Carlo family. The algorithm is iterative with each cycle composed of the following features, random perturbation of the coordinates, local minimization, accept or reject the new coordinates based on the minimized function value.

The evaluation of these algorithms on a variety of functions give these heuristics for applying these algorithms to optimization problems. In general, when the Hessian is known, the Newton-method is preferred. When the gradient is known, BFGS or L-BFGS is preferred – the computational overhead of BFGS is larger than that for L-BFGS, which is in turn larger than that for CG. Hwoever, BFGS usualy needs fewer functions evaluations than CG, so that CG is better than BFGS at optimizing computationally cheap functions, and vice versa. When the gradient is not known, BFGS or L-BFGS is still preferred, even if gradients have to be numerically approximated. Powell and Nelder-Mead work well in high-dimensions, but do quite poorly with ill-conditioned problems.