Propensity score matching as an evaluation technique

Propensity score matching is a popular evaluation technique used to try to recreate the conditions of a randomised trial. Perhaps a familiar context to explore these ideas is the clinical drug trial, e.g., let’s use a drug company wanting to find out whether a drug for heart disease is effective.

In this context, we could run a trial, and randomly split participants into either a treatment or control, with the treatment group receiving the drug and the control group a placebo. If we then follow these groups over time and measure an outcome (say risk of heart attack), then we can see what effect the drug has.

This is generally a high standard of evidence because the randomisation enables us to roughly balance the two groups across key characteristics such as age, diet, hereditary factors etc. If the two groups aren’t balanced, then any difference in outcome might just be because the two groups are different. For example, older people are more likely to have heart attacks and so if they are overrepresented in one group this will skew the results.

As a ‘toy’ example of how randomisation can work to achieve balance, we can think of the humble coin-toss. It turns out that if you flip a coin 100 times there is a 95% probability that the number of heads will fall between 40 and 60 – i.e., pretty close to the balanced ideal of 50. So with enough numbers, randomisation tends towards balance.

That’s the randomised trial, but what happens if we don’t have the option to setup a trial approach but already have a group of people treated with the drug? Well, all is not lost, because while a prospective trial (looking forward) is ideal, we can also do a retrospective study (looking back) if we can find an appropriate group to compare against. Perhaps we know of a group of similar patients that haven’t had the drug – might they be used as a control?

The short answer is yes, but we still have the problem of accounting for any differences that might occur between the two groups, which might bias the result. This is where propensity score matching comes in. Propensity score matching attempts to rebalance any differences between treatment and control groups that might affect the outcome (so-called confounding effects).

There’s a variety of methods that go by the name propensity score matching, but a common method achieves balance by assigning ‘weights’ to each individual based on their likelihood of being in the treatment group. These weights are like sample weights in a survey where we adjust the survey responses to try to match the population, i.e. upweighting groups that are not well represented in the survey and vice versa. In our example, if an individual in a particular age group is much less likely to be in the treatment group than in the control group, then we can effectively rebalance the groups by ‘upweighting’ such individuals.

Ideally we have enough data on potential sources of bias between the groups – while propensity score matching can rebalance on known characteristics the technique obviously can’t ensure balance on characteristics we don’t know about. But if there’s no reason to suspect these factors are unbalanced between the two groups, i.e. if we can assume they are randomly distributed, then we can have confidence in our results.