Matching in Observational Studies

In observational studies, we do not know the treatment assignment mechanism, which leads to bias in naive estimates of treatment effect, where `effect’ is meant in a causal context. A major source of this bias is covariate imbalance between treatment groups.
Randomization of treatment leads to average balance of both known and unknown covariates among treatment groups, facilitating unbiased estimates of treatment effect. In cases where it is unethical or financial infeasible to assign treatment, we must rely on observational data which is notoriously imbalanced.

One solution to this imbalance is to balance on all known influential covariates. If the completeness of this list can be reasonably justified, then the problem can be considered mitigated.

One common method of balance is by matching individuals in the treatment and control groups (i.e., one-to-one or one-to-many matching) via minimization of some distance metric (i.e., propensity scores). The issue is that data from unmatched individuals is discarded. So, we would like to optimize our sample size – what subset individuals leads to best balance and maximal sample size?

BOSS: Balance Optimization Subject Selection

Cho et al. have proposed a method entitled Balance Optimization Subset Selection, which involves a more holistic view of matching. Instead of matching individuals, the issue is reframed as a best subset selection optimization problem. The treatment effect across different subsets of treatment and control groups with identical balance is investigated. Considering the distribution of treatment effect across these different possibilities is arguably more statistically sound (from a frequentist perspective) and yields a nice framework for standard error calculation. Below I summarize results from their paper.


The goal is to find a subset of the treatment pool \mathcal{S}^T and a subset of the control pool \mathcal{S}^C so that a measure of balance b(\mathcal{S}^T, \mathcal{S}^C) is maximized, or, some measure of distance is minimized. This measure of balance or distance is the objective function. Examples of common distance measures include the Mahalanobis metric matching using the propensity score, Mahalanobis metric matching using calipers, or the propensity score itself.

Description of the Method

Imagine creating a set of B uniformly-sized data bins, and each covariate value is assigned to the bin that includes its value. Small B leads to a simple optimization problem. Large B leads to a more complex problem, but more similar covariate distributions between the treatment and control groups.

Consider covariate X_p. Within the treatment group, this covariate takes values within the closed interval [\min_{\mathcal{T}} X_p, \max_{\mathcal{T}} X_p]  =  [L_p, U_p]. We can separate this range into B bins with B+1 breakpoints. This leads to a covariate distribution (imagine a histogram).
The BOSS method then selects control units such that the control covariate distribution and the treatment covariate distribution are as similar as possible.

For a set of P covariates, \exists \; K = P + \binom{P}{2} + \binom{P}{3} + \cdots + \binom{P}{P} joint and marginal distributions:

  • P marginal distributions
  • \binom{P}{2} joint distributions of 2 covariates
  • \binom{P}{P} = 1 joint distribution of all P covariates

The BOSS method described above can be repeated for all, or any subset, of the K possible covariate distributions. One usually doesn’t optimize over all K distributions because of redundancy of information.

Let b^* represent the fixed number of bins we’re using, and let’s arbitrarily order the bins from 1,2,\ldots K. Let \#(\mathcal{S}_b) represent the cardinality of \mathcal{S} with values in bin b, \mathcal{T} represent the treatment group of N individuals, P the set of pre-treatment covariates to balance on. Our objective function is
\displaystyle \sum_b \frac{[\#(\mathcal{S}_b^C) - \#(\mathcal{T}_b)]^2}{\max\{\#(\mathcal{T}_b), 1\}}

  • \#(\mathcal{S}_b^C) represents the number of observations in bin b from some subset \mathcal{S}^b \subset \mathcal{C} of control observations,
  • \#(\mathcal{T}_b) represents the number of observations in bin b across all members of the treatment group,
  • \max\{\#(\mathcal{T}_b), 1\} is to ensure we don’t divide by zero when there are no observations in bin b from the treatment group.

Limitations and Discussion

“Optimizing over subsets” can be difficult to explain to a collaborator. One might use more common measures of difference like the two-sample t-statistic for the difference in means.

However, this method is really nice in that one doesn’t have to choose a particular measure of distance between the two groups, and one doesn’t have to stress over a good model for the propensity score. “Human bias is replaced with computational constraints… instead, the quality of treatment effect estimation is now limited just by the complexity of an NP-Hard optimization problem and available computational power.”


Cho WKT, Sauppe JJ, Nikolaev AG, Jacobson SH, Sewell EC.
An optimization approach to matching and causal inference.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s