# Got Milk? Grocery Store Placement in Boston

by Sarah Anoke and Lauren Vollmer

### The Problem

Facility placement is a class of problems investigating strategies for locating facilities such that some objective function, like market share, market capture from competitors, or overall profit, is maximized. We apply methods from this area of the optimization literature to reconsider grocery store placement in Boston.

The current organization of grocery stores in Boston and the surrounding area is ostensibly suboptimal. Grocery stores are inaccessible or extremely inconvenient from some areas, but ubiquitous in others. Can we improve the distribution of grocery stores using a mathematical model? If we were to build a new grocery store in Boston, where would we place it to optimize market share?

### The Model

We frame our model in keeping with the network formulation of Marianov et al., with refinements that adapt the model to our particular facility placement scenario. As such, we envision Boston as a grid, with stores and consumers located at nodes on the grid. To ease modeling, we consider the center of each zip code as the node of origin for all residents of that zip code, and do not consider any maximum facility capacity constraints on new or existing stores.

Our model has parameters as defined in the table below.

Parameter Description
$N$ The set of proposed locations for new facilities.
$N'$ The set of locations for existing facilities.
$J = |N \cup N'|$ The total number of nodes in the system.
$y_j$ A binary variable reflecting placement of a store at node $j$. Optimization occurs over ${\boldsymbol y} \in \mathbb{R}^J$.
$\gamma = \frac{\pi}{\sigma\sqrt{6}}$ Quantifies the uniformity in consumer preference, with formulation taken from Marianov {\em et al.} paper.
$\alpha \in(0,1)$ Relative importance of distance over store rating.
$d_{ij}$ Distance from node $i$ to node $j$.
$r_j$ Rating of an existing store at node $j$.
$R_j = \frac{r_j - \bar{r}}{\sigma_r}$ Standardized rating of an existing store at node $j$.
$c_{ij}= \alpha d_{ij} - (1-\alpha)R_j$ Cost to a consumer at node $i$ going to a store at node $j$.
$h_i$ Demand generation rate at node $i$.
$x_{ij}$ Probability that a consumer at node $i$ will patronize a store placed at node $j$.
$\lambda_j = \sum_{i=1}^m h_i x_{ij}$ Consumer demand at node $j$.

Marianov et al. consider a facility’s market share as a function of two factors, the distance consumers must travel to reach the facility, and the total time spent being served at the facility. Since service time is not relevant to the grocery store case, we instead consider the store’s consumer rating as determined by the popular website Yelp!. Thus, we structured our facility placement problem as a nonlinear integer program, defined in below.
$\max \sum_{j \in N} \lambda_j = \max \sum_{j \in N}\sum_{i=1}^m h_i x_{ij} = \max \sum_{j \in N}\sum_{i=1}^J \frac{h_i y_j e^{-\gamma c_{ij}}}{\sum_{k \in N} y_k e^{-\gamma c_{ik}} + \sum_{k \in N'} e^{-\gamma c_{ik}}}$
$\text{such that}$
$\sum_{j \in N \cup N'} x_{ij} =1 \enspace \forall \, i \in N$
$\phantom{such that} x_{ij} \in [0, 1] \enspace \forall \, i \in N, \; \forall j \in N \cup N'$
$\phantom{such that} \sum_{j \in N} y_j = p$
$\phantom{such that} \sum_{j \in N'} y_j = |N^\prime|$
$\phantom{such that} y_j \in \{0, 1\} \enspace \forall j \in N$
This model seeks to maximize the total consumer demand at the new facilities. The variable $\lambda_j$ represents the total demand at node $j$ and is parameterized as as a sum over the product of demand generated at the source node $i$, a zip code, and the probability $x_{ij}$ that a consumer who lives at node $i$ would venture to the facility placed at node $j$.

The formulation of consumer probabilities given in the above program incorporates the decision variables $y_j$, which we will also represent as ${\boldsymbol y} = (y_1, \ldots, y_J)^\top$, variation in consumer behavior, and costs into the objective function as a reflection of the relative cost to a consumer at node $i$ of patronizing a store at node $j$. The use of $\exp\{ \cdot \}$ is an artifact of the queueing theory component central to the modeling of waiting times in Marianov et al.‘s specification.

### Data

We apply this model to Boston population and grocery store data collected from the US Census Bureau and Yelp! respectively. The data comprise the 112 stores and 41 zip codes that fall within a five-mile radius of central Boston.

It is evident from the figure below that most stores in the Boston area serve up to 10,000 patrons. Some stores, however, serve as many as 30,000 patrons; the areas where these stores are located are thus the strongest candidates for additional facilities.

### Geographic Data Analysis

Geographical factors may also affect the parameters of our model; a grocery store placed in the middle of the Charles, for example, while possibly very close to a densely populated area, is not a valid optimum. The figures below superimpose the location data obtained from Yelp! and the Census Bureau on a map of Boston. These are plots demonstrating the 41 zip codes, 112 grocery stores, and both the zip codes and stores. Opaque store nodes are actually overlapping points, indicating several stores in very close proximity.
This visualization of the data reaffirms the intuition that motivates the project and confirms the sense of disassociation between population and store placement that the analysis of the combined data suggests.

### Results

We created maps to show the results of a 1-store placement, and a 5-store placement. The store is a cyan color.

### Benchmarking

While our algorithm is able to return a reasonable optimum for the Boston data, it is not immediately obvious that our model specification and corresponding code are selecting the “correct” node. To this end, we generated a test synthetic dataset.

We considered the simplest case, where we place only one store in a relatively small network. In this scenario, we need only evaluate the objective function at all possible nodes, then ensure that the algorithm selected the node with maximal objective function value.

Within the synthetic dataset, our city is represented as a two-dimensional array. There is high population density in the southwestern quadrant of the city and no population density in the northeastern quadrant. Store placement is determined by flipping a coin with probability inversely proportional to the population density; thus, there is a high density of stores in the northeastern quadrant of the city. The clearly “correct” placement is the southwestern quadrant, which boasts the suboptimal combination of high population density and no grocery stores. As we see in the figure below below, our optimization procedure produced exactly this result, placing a store in the southwestern corner of the synthetic data set.

### Further Directions

As formulated by Marianov et al., the model contains an equilibrium condition that accounts for the relationship between overcrowding and consumer preference. As market share increases, stores become more crowded, deterring consumers until the system reaches equilibrium. Initial heuristics suggested that new stores’ market share would not be high enough to induce overcrowding, so we did not implement this aspect of the model. A more complex model could include an overcrowding penalty in the cost function:
$c_{ij} = \alpha d_{ij} - (1-\alpha)R_j + \log(\lambda_j - T + c)$
for $T$ the maximum capacity of a store and $c$ a constant to ensure the logarithmic argument is nonnegative, with the remaining parameters as defined in the table at the start of this post. This formulation penalizes excessive market share while retaining the emphasis on convenience and store rating.