I decided to share something that I’ve been working on in the field of dimensionality reduction.

In 1987, Pena and Box published a simple dimensionality reduction technique that, in my opinion, has been vastly under-rated. Assuming some minimal knowledge of time series analysis, I’ll present the beautiful results of this antique.

is observable data, is underlying factors of lower dimension than . We believe that the observed data comes from a linear combination of components of the factor. And the goal is to recover the underlying factors.

We assume .

We assume the components of the factor are independent. An important corollary is that the and matrices of ARMA model are all diagonal.

We assume to eliminate indeterminancy. This does not alter the time series structure of the problem and is simply a scaling of the . The proof follows from SVD.

Theorem 1: Representation of

Theorem 2: Autocovariance of

The autocovariance matrix of has rank r,

Theorem 3: Canoniacl Transformation

Suppose we are given matrix . Then we can transform into by the following procedure.

Define Transformation Matrix

Then applying to , , we have

is our holy grail – the underlying factors plus some noise. is the non-informational dimension of and should be discarded.

Theorem 4: Non-Uniqueness of Representation of

Representation of is not unique.

Preserves the time series structure. Where for any A of . The ARMA matrices are:

]]>

Since we have reached the conclusion of this course, I think it’s only proper to make a post that surveys the popular optimization packages to discover which algorithms are actually being used by researchers and the industry.

**Unconstrained Optimization**

*Nelder Mead Method*

This is a gradient-less method. It relies on the use of simplices, i.e. polytopes of dimension $n+1$. It updates the vertices of the simplices at each iteration. When tested on the Rosenbrock function, it took 85 iterations.

*Powell*

Also a gradient-less method, the Powell method works by searching along directional vectors then updating these vectors by the displacement vectors.

*quasi-Newton BFGS*

BFGS refers to the Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula for updating the approximation of the Hessian matrix. A variant is also popular: the L-BFGS (Limited memory BFGS)

*Newton Conjugate-Gradient*

This is a modification of the Newton’s method where Newton system is only approximated solved using conjugate gradients.

*Conjugate Gradient*

This is an exact implementation of what we mentioned in the theory section.

*Trick: Trust Region*

The trust-region algorithm is used for unconstrained nonlinear problems and is especially useful for large-scale problems where sparsity or structure can be exploited. A trust region is where the approximation to the objective function is close to the objective function. An example would be to restrict the approximation region of the Levenberg–Marquardt algorithm so that any one step does not go berserk, resulting in smaller but more trusted steps.

**Bounded Optimization**

*Truncated Newton Conjugate-Gradient*

This is a slight variant of Newton Conjugate-Gradient by truncating each step in order to keep each varaible within their bounds.

*L-BFGS-B*

Limited Memory Algorithm for Bound Constrained Optimization. This is a modification over the L-BFGS algorithm menntioned above. It deals well with large-scale problems.

**Constrained Optimization**

*Constrained Optimization BY Linear Approximation (COBYLA)*

This algorithm is based on linear approximations to the objective function and each constraint. This gives a linear program to solve. However the linear approximations are likely only good approximations near the current simplex, so the linear program is given the further requirement that the solution must be close to the previous solution.

*Sequential Least Squares Programming (SLSQP)*

As the name suggests, this is an improvement over COBYLA by approximing the objectives with quadratic formulation. The method is used on problems for which the objective function and the constraints are twice continuously differentiable. It approximates the objective with a quadratic model and approximates the constraints with linear models.

*Interior Point Methods*

The interior point algorithm is especially useful for large-scale problems that have sparsity or structure, and tolerates user-defined objective and constraint function evaluation failures. The Hessian can be approximated with BFGS or L-BFGS.

*Trick: active-set*

Active set algorithms determine which constraints relate to the optimum, therefore reducing the complexity of the search. SQP can be considered an active-set algorithm.

**Global Optimization**

*simulated annealing*

This a one of the family of Monte Carlo methods. They work by hopping around regions with possible local minima. In the case of simulated annealing, the hopping around is controlled by each region’s temperature (or distance t o termination criteria).

*basin hopping*

Another member that brings pride to our Monte Carlo family. The algorithm is iterative with each cycle composed of the following features, random perturbation of the coordinates, local minimization, accept or reject the new coordinates based on the minimized function value.

The evaluation of these algorithms on a variety of functions give these heuristics for applying these algorithms to optimization problems. In general, when the Hessian is known, the Newton-method is preferred. When the gradient is known, BFGS or L-BFGS is preferred – the computational overhead of BFGS is larger than that for L-BFGS, which is in turn larger than that for CG. Hwoever, BFGS usualy needs fewer functions evaluations than CG, so that CG is better than BFGS at optimizing computationally cheap functions, and vice versa. When the gradient is not known, BFGS or L-BFGS is still preferred, even if gradients have to be numerically approximated. Powell and Nelder-Mead work well in high-dimensions, but do quite poorly with ill-conditioned problems.

]]>

How do we go about solving the Knight’s Tour then?

**Warnsdorff’s Rule**

One approach to solving the Knight’s Tour is Warnsdorff’s rule, which stipulates that we move the knight such that it advances to a square which has the fewest onward moves from that particular starting location. Drawing from an excellent paper by Doug Squirrel, I was able to calculate a Knight’s Tour for a board of standard size 8×8.

**Extensions**

Warnsdorff’s Rule applies to boards of size 50 or less but experiences difficulty finding an tour past that point. Two improvements (W+ and W2) have been introduced as better methods that draw on the same principles as Warnsdorff’s Rule.

Additionally, in the event of a tie between two squares, various tie-breaking algorithms have been developed to address that situation.

# Knight's Tour using Warnsdorff's Rule # http://en.wikipedia.org/wiki/Knight's_tour # height and width of the chessboard height = 8 width = 8 cb = [[0 for x in xrange(height)] for y in xrange(width)] # chessboard # possible direction combinations for knight dx = [-2, -1, 1, 2, -2, -1, 1, 2] dy = [1, 2, 2, 1, -1, -2, -2, -1] # start the Knight randomly on board kx = random.randint(0, width - 1) ky = random.randint(0, height - 1) for k in xrange(width * height): cb[ky][kx] = k + 1 p_queue = [] # available neighbors queue for i in xrange(8): nx = kx + dx[i]; ny = ky + dy[i] if nx >= 0 and nx < width and ny >= 0 and ny < height: if cb[ny][nx] == 0: # count the available neighbors of the potential next stops ctr = 0 for j in xrange(8): ex = nx + dx[j] ey = ny + dy[j] if ex >= 0 and ex < width and ey >= 0 and ey < height: if cb[ey][ex] == 0: ctr += 1 heappush(p_queue, (ctr, i)) # move to the neighbor that has min number of available neighbors if len(p_queue) > 0: (p, m) = heappop(p_queue) kx += dx[m] ky += dy[m] else: break # print cb for cy in xrange(height): for cx in xrange(width): print string.rjust(str(cb[cy][cx]), 2), print

]]>

For our final project, James Yu and I (Irene Chen) examined hierarchical text classification and its application on Wikipedia article data. Inspired by a related Kaggle challenge, we were interested in expanding beyond our problem set classifying Twitter data into binary categories: instead, each article can be classified as one or many categories. As James explained earlier, Wikipedia uses hierarchies as the classification methods for the organization of text documents. That is, an article can be labeled as “France” or “Germany” and knowledge about which of those country labels it has makes it more or less likely that “Paris” or “Berlin” is labeled. It is important to note that not every category nests neatly, but hierarchical classification algorithms use pre-existing knowledge about classifications to make category classifications. In our project, we wanted to improve upon multiclass SVM performance by incorporating principles outlined by Wang and Casasent (2008).

**Hierarchical Model**

Although hierarchical models have the potential to become much more complex, we implemented an exploratory approach by training a classifier on higher-level categories (of which there are fewer) so that we could detect earlier on if an article does not belong to a higher-level category and therefore discard all lower-level categories. As we can see the image below, the Wikipedia dataset to which we applied our model consists of a few higher-level categories and many lower-level leaves.

We take a bottoms-up approach to resolving the hierarchy structure of the classes. Starting from each leaf node , we determine $latex p_{ij}^1 \in P : p_{ij}^1 \to c_i$, this is to say we find the set of corresponding one-level-up parents to each leaf node $latex c_i$. The process can be repeated until each path reaches a terminal parent node so that

$latex p_{ij}^k = \emptyset : p_{ij}^k \to p_{ij}^{k-1}$

After reaching all terminal parent nodes, then by transforming the result classification vector from $latex Y = C \to P^k$ where $latex k$ is the highest degree of the terminal parent, we can train our SVM using the transformed vector to get a classifier that will map feature vectors of the test set onto posterior probability space. The overall idea here is that for each node, we can convert the outputs from the classifier into posterior probabilities. Based on some threshold requirement, we can determine which path along the hierarchy tree to follow, and thus ultimately arrive at the final leaf classes. In the case of an SVM, we can map the SVM outputs for the validation set data to a probability $latex p \in [0,1]$. To reduce the effect of outliers, we can use a sigmoid mapping

$latex P(y=1|t) = \dfrac{1}{1 + \exp(at+b)}$

where $latex t$ is the SVM output, and $latex y$ is the known class label. To estimate $latex a,b$, we can use the maximum likelihood estimate since the validation data is drawn independently.

$latex \arg \max_{a,b} \prod_i P(y=1 | t^i,a,b)$

Pragmatically, we can maximize the log likelihood and use gradient ascent to find $latex a,b$. Further, the leverage of the hierarchy structure allows us to avoid calculations of the feature vectors we already ruled out of the previous level parent, and thus each subsequent descent of the hierarchy structure requires less calculations. Nevertheless, this approach is still afflicted with the curse of dimensionality due to the sheer number of features as well as the depth of the hierarchy and size of the class set. Since our goal is ultimately to correctly classify the feature vectors based on leaf classes only, once we arrive at $latex P^1$, the one-level-up parent from the leaf class, then again based on threshold requirements we narrow down the samples of feature vectors. We perform a multilabel classification scheme upon the narrow set of feature vectors to receive the final classification.

One thing to note that this approach, while more computational intensive, gives the flexibility of classifying according to all nodes in the hierarchy, not just the leaf nodes, which could prove useful in another context. As we consider how to apply this to our dataset give our memory constraints (see below), we chose to only use the hierarchical approach for $latex k=2$ to avoid computational intensity but at the same time compare the usefulness and accuracy of the method.

Due to the limitation in sample size, as well as the vastness of the hierarchy tree, it appeared that almost all leaf classes had a set of parents, where the number of parents any leaf node had in common with other leaf nodes in the sample was minimal. This of course does not aid us in reducing computational complexity. To simply the problem further, we determined the $latex p$ most frequent parent and discarded the remainder, transforming our result vector to reflect only $latex p$ parents. This transformation allows us to finally leverage the hierarchy structure and thus narrow our feature vector set with the process described above.

**Evaluation Metrics**

We use the Macro F1-score to measure the performance of all the methods. Macro F1-score is a conventional metric used to evaluation classfication decisions. Let $latex tp_{c_i}, fp_{c_i}, fn_{c_i}$ be the true positives, false positives, and false negatives respectively for class $latex c_i$. The macro-averaged F1 $latex MaF$ is

$latex MaP = \dfrac{\sum_{i=1}^{|C|} \dfrac{tp_{c_i}}{tp_{c_i} + fp_{c_i}}}{|C|} $

$latex MaR = \dfrac{\sum_{i=1}^{|C|} \dfrac{tp_{c_i}}{tp_{c_i} + fp_{p_i}}}{|C|}$

Note that $latex MaP$ and $latex MaR$ correspond to the precision and recall of the classification method, respectively. Precision ($latex MaP$) refers to the number of correct results divided by the number of all returned results while recall ($latex MaR$) is the number of correct results divided by the number of results that should have been returned.

Our chosen evaluation metric can then be interpreted as the weighted average of the precision and recall where the Macro F1-score is at best 1 and at worst 0.

**Results**

On our reduced subset of the Wikipedia dataset, we found that the unadjusted sample performed with a , the Multiclass SVM model performed with , the simplistic Hierarchical model performed with . Although our hierarchical model did not perform as well as expected, we believe with a more significant selection of data or more advanced exploitation of the hierarchical structure (e.g. ), it is possible we could achieve better results; however, because of the constraint on computational power, we were not able to explore that realm.

]]>

**Abstract**

Modeling in atmospheric transport, astrophysics, diagnostics, genomics, materials science, engineering, and a variety of other systems face a significant constraint in optimizing the distribution of computing power and process sequences in order to appropriately represent forward and inverse model states. Once data is obtained, a vast amount of computational power is necessary to study these relationships. Improved methods of appropriately discerning optimal substructures have the potential to significantly reduce required computational resources while preserving a high degree of overall model accuracy.

Dimension reduction has been well-received as a means of producing cost-effective representations of large-scale systems. In order to reduce the total number of calculations necessary to model a time series in high-dimensional space, methods of extrapolation are commonly used by practitioners as a means by which to reduce computational expense. In these instances, the discretization of partial differential equations (PDEs) aids in optimal control, probability analysis, and inverse problems requiring multiple iterations of system simulation. This iterative approach necessitates that the reduced model produce an accurate representation of the system across a large number of parameters. Dynamic programming has contributed to these model acceleration methods through the isolation of identical problems normally solved many times by the model, known as overlapping sub-problems.

An illustration of the advantages afforded by the implementation of select methods described is simulated using an atmospheric chemical transport application. A global atmospheric model is constructed using empirical time-series data from the Emissions Database for Global Atmospheric Research (EDGAR), and the concentration of chemical species is determined using inverse modeling. In order to improve computation performance, the Jacobian matrix is optimized using a quasi-Newton variable metric algorithm. The model then produces a representation of chemical species concentration in different hemispheres and atmospheric layers across time at significantly reduced computational cost.

Use of a Jacobian matrix and spatial reduction both compartmentalize a given model into two separate regions, with independent variables being solved at each time step and dependent variables either relying upon a system of PDEs or being identified as overlapping sub-problems and subsequently eliminated entirely by dynamic programming. As one or more regions of the system run the model at a normal speed (“fast”) using conventional modeling methods and solving the full mechanism of all equations , other sections run pre-specified components of the model at an accelerated speed using extrapolation (“slow”). These functions both lead to PDEs, requiring numerical analysis techniques to arrive at an optimal solution. The separate sections are then paired in order to extrapolate basic knowledge of the system without having to run every component of the model, substantially reducing computational strain and overall run-time.

The most pronounced drawback of spatial reduction models is that the nature of extrapolation produces additional error, and can lead to highly inaccurate results over the course of the model due to compounding. When applied to models involving time series in high dimensional space, the error produced by extrapolation is often too high to justify the computational savings. For this reason, successful implementation of spatial reduction models has been limited to those with an advanced understanding of the underlying system, in this case primarily geophysicists with atmospheric transport modeling expertise. Improvements in constraining this error without excessive manual adjustments to the model can be achieved by isolating optimal substructures and introducing a trainer.

**Application**

A large-scale system with some dynamical substructures is desirable for the purposes of examining the range and degree of impact of the algorithm framework relative to both basic extrapolation and standard computation methods. Though generally applicable to many natural sciences systems, this analysis shall focus upon a global atmospheric modeling application due to the expansive size and diverse nature of the system, constant geophysical laws, variety of chemical species, recognizable regions, presence of overlapping substructures, and ease of discretization.

This simulation considers the long-lived greenhouse gas . The radiative forcing of is the one of the largest of all greenhouse gases, third only behind and . In the case of a tropospheric species harmful to the human respiratory system, point source attribution is essential to maintaining or improving public health in the affected area.

Atmospheric concentration of is estimated using a five box model with linear transport and removal that has been fit to a time series of observations. In order to evaluate model performance, chemical impulses were used to simulate a change in emissions during periods which would not have ordinarily been predicted by an extrapolation method.

Linear transport coefficients were optimized by in-situ concentration observations of , which was selected because it is known to be correlated to concentrations of . This was deliberate not simply in the hope of ensuring an accurate final model output, but also due to the fact that the two species likely have a variety of overlapping sub-problems and are less likely to diverge significantly during accelerated extrapolation. Transport between boxes in the model is fully constrained by four parameters (stratospheric turnover time, interhemispheric exchange time, intrahemispheric exchange time, and the fraction of stratospheric air extruded into the Northern Hemisphere), global mass conservation, and the assumption that the mass of each box remains constant in time.

The Jacobian is ideal in dynamical systems for which and for inverse functions, and can be plainly written as . Given that the Jacobian generalizes the gradient of a scalar-valued function of multiple variables, if a function is differentiable at point , the function need not be differentiable for the Jacobian to be accurately defined due to the fact that only partial derivatives are required. Therefore the Jacobian would still give the coordinates of the derivative, .

The Jacobian relates to a first order Taylor expansion in that is a point in , and given that is differentiable at , , enabling a linear map to approximate in the neighborhood of s.t. the following holds, where and the distance between and = :

For these reasons, the Jacobian is used commonly in global physical sciences modeling. In the application used for this project, the first and second order derivatives of independent variables are known. In such scenarios, most mainstream atmospheric models simply use the Jacobian to create a PDE matrix, solve the inverse problem, and adjust parameters as necessary to obtain a highly accurate fit.

After an initial estimate of the parameter values, the Jacobian matrix is constructed, and the standard objective function is minimized in order to solve the inverse problem. The gradient can then be expressed in a manner equivalent to ordinary least squares:

Here the Jacobian itself is performing a reduction function in that the configuration of partial derivatives creates a system in which dependent variables are already being partitioned as overlapping sub-problems and computational resources are not being expended upon optimizing them. A Nelder-Mead algorithm is then implemented in order to find the optimal initial starting chemical species concentration in each of the atmospheric layers.

Given the inverse has now been established and the optimal parameters are known, the model is run and compared to empirical data from the Intergovernmental Panel on Climate Change (IPCC). The values are found to be consistent for concentrations, and therefore the transport values are known to be accurate. The system is then applied to forcing values, and the model-predicted figures again align with EDGAR observations and third party calculations.

Evaluating means of reducing computational expense, a scenario is presented in which no in-depth familiarity with the system is assumed. A trainer and Greedy algorithm are then used in a vector autoregression (VAR) scenario in an attempt to reduce computational expense.

**Findings**

The inverse model accurately reflects the time-series concentrations of , and therefore optimization of the transport parameters and initial concentrations have been solved for both chemical species. Implementation of VAR and the Greedy algorithm perform as expected, though the exercise highlighted the fact that in most scenarios, the analyst must have a high degree of familiarity with the technical aspects of the system in order to observe and remedy any anomalies in the data or model trajectory.

Recommendations for spatial reduction techniques going forward include increased specification in trainers, ranking/weighting depending upon past performance and known correlations or other period/parameter-specific relationships, use of partial derivatives as a prompt to sample from true observations in areas of decreased stability, and improved means of isolating optimal substructures.

]]>

Hello! In 2013, Madeleine Udell and Stephen Boyd defined a class of convex, NP-hard problems called

`sigmoidal programs`

, which resemble convex programs but allow a controlled deviation from convexity in the objective function. Maximizing the sum of sigmoidal functions over convex sets is a problem with many applications, including problems dealing with decreasing marginal returns to investment, mathematical marketing, network bandwidth allocation, revenue optimization, optimal bidding, and lottery design. In this paper, we concentrate on maximizing the sum of sigmoidal functions as related to the Affordable Care Act. We demonstrate the hardness of sigmoidal programming, and discuss possible approximation algorithms. Additionally, in their paper, Udell and Boyd described an approximation algorithm to find a globally optimal approximate solution to the problem of maximizing a sum of sigmoidal functions over a convex constraint set. To demonstrate the power of the algorithm, we compute the optimal state ACA allocations which may allow states in the US to maximize their overall level of health, as measured by birth rates, number of cancer cases, number of cancer deaths, and number of nursing homes.
**Sigmoids**

We consider the `latex sigmoidal programming problem`

as considered by Udell and Boyd, namely

where is a sigmoidal function for each , and the variable is constrained to lie in a nonempty bounded closed convex set .

A continuous function is defined to be $sigmoidal$ if it is either convex, concave, or convex for and concave for . Equivalently, we call a function sigmoidal if it can be written as the integral of a quasi-concave, bounded function on the same domain.

**Motivation for Sigmoids**

The effect of funding on societal health can be described using sigmoidal functions: society expects to enjoy increasing marginal returns as the amount of money given to each individual increases and their level of health increases. However, as more resources are allocated to a single individual, the marginal return for that individual will eventually diminish. Further, it is not a bad assumption that eventually the marginal return for that individual will increase again — at very large increases in money allocated to each individual, the funds available begin to enable very expensive cures to be possible, completely eradicating issues and permanently increasing the level of health of that individual. In particular, in this project, we look at one specific type of sigmodial function: the `logistic function`

.

The logistic function is a function of the equation:

The initial state of growth is approximately exponential; then, as saturation begins, the growth slows, and at long time, the growth stops. The logistic function has a large number of applications, particularly within economics. In particular, it describes the impact of ACA funding on the general state of health well: when funding is introduced there is dramatic improvement in quality of healthcare received, which leads to a period of rapid growth in the effect on health indicators. Eventually, dramatic improvement opportunities are exhausted, however, and the effect of increased funding stabilizes.

**Solving the Sigmoidsal Program with ACA Data**

Madeleine Udell developed a python package to solve sigmoidal programs based on this method, which we use to solve our problem.

To use Udell’s solver, we first best-fitted coefficients for each of the four logistic functions, corresponding to the four indicators, of each state. Then, we averaged the coefficients for each state to acquire one set of coefficients for one objective function that represents a state. The solution to the objective function then would represent the percentage of each dollar of government grant that would maximize the impact of the grant in that state.

Please see below for the Python script for Alabama, which we adapted from Udell’s original script:

In the code, we defined both a logistic function for each state using the averaged coefficients as well as that function’s first derivative as parts of the input for the solver.

And using this script, we acquired the following results:

The first column signifies the optimal solution that means the percentage of each dollar of ACA grant that would have the optimal impact on health outcomes of a particular state. Summing the solutions across 50 states, we get a number greater than 1, which makes some sense because ideally states would like more than they have been receiving to improve further the current healthcare outcomes. We normalized the solutions so that they would add up exactly to 1. In this way, the `Normalized’ column represents the recommendation we give for how big a fraction of ACA grant could be given to any U.S. state to maximize the improvement on health outcomes of that state.

** Future Work **

When our datasets get much larger, we run the risk of being unable to solve the problem because of its NP-hardness. Because certain sigmoidal optimization problems are generalized forms of MAXCUT, some approximation algorithms for MAXCUT such as `Simple Randomized 0.5-Approximation Algorithm`

and `Semidefinite Programming Algorithm`

could be used to solve our problem, when the number of years or factors considered for optimization may rise in the future.

Thank you very much!

Jasmine and Jinzhao

]]>

We explore optimization problems under partial differential equations

constraints. We describe adjoint sensitivity method that allows efficient

computation of the objective function gradient. We further describe

conjugate gradient method and analyze its convergence properties.

Finally, we implement both algorithms toward solving 1D and 2D time

independent Schrodinger equations.

On an abstract level, the adjoint sensitivity method can be derived

using Lagrange multipliers. We define the Lagrangian as

In most cases, the objective function only depends on so we

make the simplication that . Since , we have

This expression can be simplified into

Note that is difficult to evaluate, hence for convenience

we set

This is called the adjoint equation. After we solve for

we can evaluate

This method is efficient because it avoids the costly evaluation of

. In addition, this method gives the exact value of

Several conditions need to be imposed on the function

First of all, must be smooth so that

is well defined over the domain We haven’t specified

in this general setting it can be both space and time. Also, the ode/pde

are usually initial value or boundary value problems that are define

on domain boundary We require the system

to be well posed. In addition, as we will show, the adjoint equation

needs to be solved backward in time, hence we impose a time dependent

ode/pde system must not exhibit sensitive dependence to initial conditions.

In most physical systems, these conditions are satisfied. The Navier

Stokes equation is one that is extensively studied, and its adjoint equation is linear. Hence, the adjoint method

gives a framework for optimization of Navier Stokes equation.

Note that the adjoint method only gives but not the

optimal solution. To search for an optimum, we need to use another

algorithm that works on the objective function Simple algorithms

can be gradient descent or steepest descent. We know that if the objective

function is convex, then there exists a unique optimum. However, in

most physical systems, convexity is usually not guranteeted. We discuss

conjugate gradient method in the following section.

We compared four optimization methods: steepest descent algorithm, Newton’s method, BFGS method, and Fletcher-Reeves algorithm (conjugate gradient method).

Among the four algorithms, steepest descent method and Fletcher are gradient-based methods and Newton’s method and BFGS method are Newton-based methods. Gradient-based methods has linear convergence while Newton’s methods has locally quadratic convergence. We can clearly see that the gradient-based methods were trapped in a certain region for a while at the beginning and then converges at a linear rate, and the conjugate gradient method converges much faster than the steepest descent method. The Newton’s methods converge slowly (or even zigzag for a long time) at the beginning, but when they approach to the optimal point, it converges very rapidly at a quadratic rate.

The Rosenbrock’s function used in this example is a particularly good case to illustrate this difference. The minimal values of the Rosenbrock’s function are located in a very narrow curved valley. The gradient-based methods only have information form the first-order derivative, so they often miss the valley and take long time to search along the curved valley to find the optimal point. The Newton’s methods, which take the second-order information into account, can adjust the search with the curvature information from the Hessian and thus can find the optimal direction and searching length much more accurately. We know Newton’s methods can locate the optimal point of a quadratic function in a single one shot, and most functions can be approximated locally by a quadratic function, so Newton’s methods worked out especially well in this example, where the function is mostly quadratic curvature-like.

In terms of computational cost, the conjugate gradient method shows significant advantage. First, the Fletcher-Reeves method only needs the function value and the residual vector so does not need to store the matrix. This is particularity suitable for very large scale sparse matrices. Second, the conjugate gradient method does not need to calculate the Hessian, which can be very computationally expensive for very large scale matrices. Last, although we need the conjugate set of directions to drastically reduce the number of iterations, we do not need to store all of them. From our construction, we can calculate the direction of the nest step by using just the previous step’s direction. These features give the conjugate gradient method a very broad scope of applications in physics, economics, and operation research.

One of the most significant results of quantum mechanics is revealing

the wave particle duality of an photon. In this case, we solve the

Schrodinger equation in 2D to emulate that of a quantum pinhole camera.

We assume the probability distribution of many photon particles passing

through a quantum pinhole to be a gaussian pulse, and we use the adjoint

method to solve for the quantum potential in 2D. The problem is set

up on a 2D square domain with Dirichlet boundary conditions. The continuous

laplace operator is discretized using a two stencil representation.

A sparse matrix of size is assembled at each time

step. Figure below shows the designed wave function and the simulated

quantum potential.

This 2D optimization problem has 1600 spacial variables and conjugate

graident method is able to converge after approximately 150 iterations.

This fast convergence reduces the computational cost and illustrates

the power of the Krylov subspace method.

]]>

In this blog post I focus on interesting characteristics of convexity that we didn’t get to cover in class, borrowing problems and examples from Kreyszig’s “Introductory Functional Analysis with Applications” and Rockafellar’s famous “Convex Analysis” text. I’m particularly excited to discuss a theorem from Rockafellar’s text, as this text has been cited in virtually every analysis course lecture I’ve ever taken, though I’ve not gotten an opportunity yet to consider it. I suppose that the reader is familiar with basic notions of analysis.

I borrow this example from Problem 3 of Kreyszig’s Chapter 6.2. For normed space and finite dimensional subspace, and for a given we want to best approximate with an element . To do this, most sensible will be to find a basis for and approximate using a linear combination for scalars chosen to minimize the distance between and our linear combination approximation. So, putting

we look to find that minimizes . Before we proceed, we might like to be sure that is continuous in , so we can be sure that doesn’t `jump’ at a point e.g., of a possible minimum (e.g., is not well defined).

So, taking to be arbitrary fixed, writing and , and putting we see that whenever and so , by triangle inequality we have

But was arbitrary, so we’re done. So is continuous in , as desired.

Now we proceed to minimization of . Indeed, we recall that convexity gives us that any local minimum is also a global minimum, giving us a guarantee that any minimum we find is *the *minimum.

I claim that is convex in . That is, I plan to show for and , we have for fixed . To see this, noticing and again applying triangle inequality we have

as desired! This result is particularly useful, e.g., in linear regression for normed cost function!

I now change gears from functional analysis to convex analysis. Indeed, students of analysis often like to know if properties of functions are preserved when taken to the limit. In this way, it’d be particularly nice to know that \emph{convexity} is preserved in the limit. I first make this claim precise, and then I prove it.

Take to be a sequence of finite convex functions on open convex set of finite dimensional normed space. I claim that if pointwise (i.e., for and arbitrary fixed, we may find large enough such that whenever ), then is convex. I prove this claim.

Let be arbitrary, and take . Since pointwise, clearly we have

and

But since is convex, we have

and so

as desired. So, as we’d expect, convexity is preserved when taken to the limit. Rockafellar proves a more general form of this statement (in his Theorem 10.8), using weaker assumptions and more ideas from analysis. But since most of these ideas are outside the scope of this course, and the assumptions I use are reasonable or `practical’ ones, I leave it here.

]]>

Facility placement is a class of problems investigating strategies for locating facilities such that some objective function, like market share, market capture from competitors, or overall profit, is maximized. We apply methods from this area of the optimization literature to reconsider grocery store placement in Boston.

The current organization of grocery stores in Boston and the surrounding area is ostensibly suboptimal. Grocery stores are inaccessible or extremely inconvenient from some areas, but ubiquitous in others. Can we improve the distribution of grocery stores using a mathematical model? If we were to build a new grocery store in Boston, where would we place it to optimize market share?

We frame our model in keeping with the network formulation of Marianov *et al.*, with refinements that adapt the model to our particular facility placement scenario. As such, we envision Boston as a grid, with stores and consumers located at nodes on the grid. To ease modeling, we consider the center of each zip code as the node of origin for all residents of that zip code, and do not consider any maximum facility capacity constraints on new or existing stores.

Our model has parameters as defined in the table below.

Parameter |
Description |
---|---|

The set of proposed locations for new facilities. | |

The set of locations for existing facilities. | |

The total number of nodes in the system. | |

A binary variable reflecting placement of a store at node . Optimization occurs over . | |

Quantifies the uniformity in consumer preference, with formulation taken from Marianov {\em et al.} paper. | |

Relative importance of distance over store rating. | |

Distance from node to node . | |

Rating of an existing store at node . | |

Standardized rating of an existing store at node . | |

Cost to a consumer at node going to a store at node . | |

Demand generation rate at node . | |

Probability that a consumer at node will patronize a store placed at node . | |

Consumer demand at node . |

Marianov *et al.* consider a facility’s market share as a function of two factors, the distance consumers must travel to reach the facility, and the total time spent being served at the facility. Since service time is not relevant to the grocery store case, we instead consider the store’s consumer rating as determined by the popular website Yelp!. Thus, we structured our facility placement problem as a nonlinear integer program, defined in below.

This model seeks to maximize the total consumer demand at the new facilities. The variable represents the total demand at node and is parameterized as as a sum over the product of demand generated at the source node , a zip code, and the probability that a consumer who lives at node would venture to the facility placed at node .

The formulation of consumer probabilities given in the above program incorporates the decision variables , which we will also represent as , variation in consumer behavior, and costs into the objective function as a reflection of the relative cost to a consumer at node of patronizing a store at node . The use of is an artifact of the queueing theory component central to the modeling of waiting times in Marianov *et al.*‘s specification.

We apply this model to Boston population and grocery store data collected from the US Census Bureau and Yelp! respectively. The data comprise the 112 stores and 41 zip codes that fall within a five-mile radius of central Boston.

It is evident from the figure below that most stores in the Boston area serve up to 10,000 patrons. Some stores, however, serve as many as 30,000 patrons; the areas where these stores are located are thus the strongest candidates for additional facilities.

Geographical factors may also affect the parameters of our model; a grocery store placed in the middle of the Charles, for example, while possibly very close to a densely populated area, is not a valid optimum. The figures below superimpose the location data obtained from Yelp! and the Census Bureau on a map of Boston. These are plots demonstrating the 41 zip codes, 112 grocery stores, and both the zip codes and stores. Opaque store nodes are actually overlapping points, indicating several stores in very close proximity.

This visualization of the data reaffirms the intuition that motivates the project and confirms the sense of disassociation between population and store placement that the analysis of the combined data suggests.

We created maps to show the results of a 1-store placement, and a 5-store placement. The store is a cyan color.

While our algorithm is able to return a reasonable optimum for the Boston data, it is not immediately obvious that our model specification and corresponding code are selecting the “correct” node. To this end, we generated a test synthetic dataset.

We considered the simplest case, where we place only one store in a relatively small network. In this scenario, we need only evaluate the objective function at all possible nodes, then ensure that the algorithm selected the node with maximal objective function value.

Within the synthetic dataset, our city is represented as a two-dimensional array. There is high population density in the southwestern quadrant of the city and no population density in the northeastern quadrant. Store placement is determined by flipping a coin with probability inversely proportional to the population density; thus, there is a high density of stores in the northeastern quadrant of the city. The clearly “correct” placement is the southwestern quadrant, which boasts the suboptimal combination of high population density and no grocery stores. As we see in the figure below below, our optimization procedure produced exactly this result, placing a store in the southwestern corner of the synthetic data set.

As formulated by Marianov et al., the model contains an equilibrium condition that accounts for the relationship between overcrowding and consumer preference. As market share increases, stores become more crowded, deterring consumers until the system reaches equilibrium. Initial heuristics suggested that new stores’ market share would not be high enough to induce overcrowding, so we did not implement this aspect of the model. A more complex model could include an overcrowding penalty in the cost function:

for the maximum capacity of a store and a constant to ensure the logarithmic argument is nonnegative, with the remaining parameters as defined in the table at the start of this post. This formulation penalizes excessive market share while retaining the emphasis on convenience and store rating.

]]>

The differential equation formulation of the Bass Model is as follows:

where is the cumulative adopters at time . is the coefficient of innovation, implying that there is a certain percentage of researchers who independently begin to adopt the innovation at each time step. is the coefficient of imitation, representing the “word of mouth” effect where the adopters spread the innovation to potential adopters at each time step. is the total number of adopters over the whole period of the innovation.

The discrete analog of the Bass model allows the formulation to be written accordingly:

Then we can estimate parameters $p$, $q$, and $m$ by employing the OLS method to estimate parameters , , and , given the analog . The residual sum of squares can be written as

It is well known that , the instance of that minimizes the , has analytical solution

so we can compute it very easily.

This approach seeks to minimize the objective function:

where is the actual number of total adopters at time and is the predicted number of total adopters based on the Bass model, given the parameter vector .

Genetic algorithm (GA) is a searching technique to look for exact or approximate solutions for optimization and searching problems. It is considered as a global search heuristic. GA uses techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover. A typical genetic algorithm requires a genetic representation of the solution domain and a fitness function to evaluate the solution domain. In GA, an abstract representation of candidate solutions is called chromosomes, and it could be used in an optimization problem to evolve toward better solutions. Solutions are represented in some encoding method, such as binary encoding. A fitness function is a particular type of objective function that prescribes the optimality of a solution so that a particular chromosome may be ranked against all the other chromosomes. The evolution usually starts from a population of randomly generated individuals. In each generation, the fitness of every individual in the population is evaluated. Based on their fitness, the fittest group of individuals are selected and through reproduction, crossover or mutation to form a new population. The new population is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.

Let denote the probability density function of adoption at time t, the cumulative density function. Then the Bass equation can be written as

This is an ordinary differential equation that can be solved with the initial condition . Take the integral of the both sides of the above equation we get

where and . However this solution assumes all consumers in the population eventually adopt the new product, i.e., the cumulative density function ends at 1. Bass equation is a diffusion system where the eventually adopting probability is , so the analytical solution to the cumulative density function should be

Note that the population in this system is , where is the number of population in our previous models. We know the likelihood function is

where are the time points of the data, . and is the number of people who did not adopt by time , i.e., . Thus the analytical form of the log-likelihood function is given by

After solving for , and , we can easily get the estimators of , and by the one-to-one correspondence between them.

]]>