How does a policy analyst impute missing public benefits data?

Last week, Scioto Analysis released our updated Ohio Poverty Measure, a report that we’ve been working on since November. In this measure, we use publicly available data to understand the state of poverty in Ohio. Our methods are based on a wide range of other state and city poverty reports, all of which are heavily influenced by the Census Bureau’s Supplemental Poverty Measure.

To calculate the Ohio Poverty Measure, we primarily used data from the American Community Survey. The American Community Survey is one of the most useful datasets because it has a higher sample size than the Current Population Survey, which is used to calculate the Supplemental Poverty Measure. This makes it the best way to estimate what poverty looks like at smaller geographic resolutions. Though the American Community Survey has such a wide reach, it does have a few important drawbacks.

The most important limitation of the American Community Survey is that it doesn’t ask as many questions as other surveys do. It succeeds in providing detailed information about things like employment and income, but it doesn’t ask about things like medical expenses which we need to know for our poverty report.

For this information, we turn to information in the Current Population Survey. The Current Population Survey is similar to the American Community Survey, but it asks a smaller number of people a larger number of questions. Here, the tradeoff is sample size for more detailed responses.

While we could have used the Current Population Survey as the base data for our analysis like the Supplemental Poverty Measure, we’d be relying on a much smaller sample to make claims about all of Ohio. Since we performed our analysis at the Public Use Microdata Area level (the smallest identifiable geographic area in these datasets), this would subject our results to sampling error.

So how do we use data from the Current Population Survey to fill in the missing data from the American Community Survey? Formally, this process is called “data imputation,” and there is a great deal of statistical research on the topic.

There are many ways to conduct data imputation. One simple example is simply assigning every person the average value of a missing variable. In our context, this would be bad since something like medical expenses will be zero for many people and quite large for a small portion of people, though it does have the desirable characteristic that the imputed data will have the same mean as the original data.

For the Ohio Poverty Measure, we follow the same steps for imputing missing data that other poverty reports before us have. We use a two-step modeling process to first determine who is likely to have non-zero missing values, and then isolating that group we try to determine what the value is.

To do this, we build two regression models from the Current Population Survey data. The first is a binary outcome regression that predicts the probability of an individual response having a non-zero value. The second looks only at those responses that have non-zero values and predicts the size of the missing variable.

We then take these two regression models and use the American Community Survey data to get predicted values for the probability of a non-zero value. We then estimate the total size of the missing variable.

Then, we rank the American Community Respondents by their predicted probability of having a non-zero missing value. We want to make sure that the same percentage of people in the American Community Survey have non-zero values as in the Current Population Survey, so we only count the most likely people until the proportion in the American Community Survey matches that in the Current Population Survey.

Making predictions is one of the most important parts of policy analysis. We often think that the predictions are the outputs of our work, not part of the input. However, with some clever statistical thinking, we can give ourselves access to really amazing data like the American Community Survey, even if it doesn’t have exactly all the information we need. As long as we can find a good way to impute it, we can take advantage of everything else it has to offer.