Estimating ACS Data for Smaller Counties

A Proposed Methodoloy

John Blodgett, Missouri Census Data Center
(Not yet for general public access)

Overview of The Problem

It is really exciting that we can now get freshly updated county-level data every year from the American Community Survey. Of course, as this is written (August, 2008) we only get that data for the 16 counties here in Missouri with a minimum of 65,000 population. That is 16 out of 115. Now those 16 counties do have about two thirds of the state's population (for the whole country the figure is much higher: just over 83% of the country's population lived in counties with 65,000+ population per the 2006 estimates). On the other hand, the 16 largest Missouri counties comprise only about 1/8th of the land area of the state. So if we were to do a statewide map attempting to show a spatial trend in some interesting data item, such as the poverty rate or the percentage of the population living alone, we would have mostly (about 87%) white space indicating "No Data Available". Similar problems occur when trying to generate the routine reports that researchers and the general public are used to gettting with the latest demographic updates. They are used to seeing county level reports, frequently organzied into county-based regions, such as MSA's or RPC's. There are just a lot of things that we are used to being able to do with decennial census data or even intercensal population estimates, that we cannot do with the ACS data.

At least not yet. We shall be getting some 3-year period estimates at the end of this year that will mean we'll have data for another 39 Missouri counties with another 22% of the state's population accounted for. And it means the 55 counties with data will represent about half the land area of the state. So the county level data map is still going to have a lot of white space. At the national level when we get the 3-year period estimates we'll have data for about 58% of all counties, representing over 95% of the population and just under 50% of the nation's land area. So getting the new estimates is a major step in the right direction but it still leaves a lot to be desired. In addition to the fact that we are still left only kid of half-way there in terms of county and land-area coverage, we also now face the prospect of dealing with 3-year period estimates, and of developing a strategy for melding these with the single-year estimates for those counties where we have both.

If we wait another two years, we'll be getting 5-year period estimates, which means we'll have some kind of data for all of the counties. The key phrase in that statement is "some kind". Some kind of data should be fine for counties where nothing much is happening, but much less so for those that are undergoing change. Going back to our opening sentence regarding the "really exciting" prospect of "freshly updated county-level data every year", we need to temper that excitement considerably. Because 5-year period estimates are not really new data for the most recent year; they are just 20% new data packaged with 80% older data that may do a fair job of getting the data right for the majority of smaller counties in the majority of data years. But what they will not do is tell you with any degree of certainty what just happened in that smaller county during the most recent year. That data is simply not available from the ACS.

So is there any way we can make it available? We think there might be. The purpose of this document is to describe a methodology that we propose to use to fill in at least some of the unavailable data so that we can have reasonable estimates of what is happening in every county during every year. We are aware of the inherent limitations of our methodology, but still think it would allow us to have reasonable data for every county, for each single year of ACS data. This would allow us to generate the usual reports, maps, regional summaries, etc. with an acceptable degree of reliability in most cases.

Overview of the Solution

We propose to estimate values for key variables in smaller counties ("these areas") by utilizing three sources of data:

  1. We have 2000 census data for these areas and for PUMA regions (see more below), with many items that are very similar, if not identical to, items collected and reported in the ACS.

  2. We have the offical FSCPE (Census) current estimates for the counties, broken down by 5-year age cohorts, sex, race and hispanic origin. These data are crucial in order for us to get quantity estimates, just as they are used at the larger-county level as controls to effect the quantities being reported in the ACS. From the same source we also have housing unit estimates at the county level.

  3. We have ACS data for a region that is comprised in almost all cases of a set of smaller counties. We are talking about the Public Use Microsample Area (PUMA) geographic entities. We have readily available tools (MABLE/Geocorr, for example) to build the equivalency data needed to get the precise relationships of counties and their "parent PUMAs". (Many/most larger counties do not have such Parent PUMAs, but that is not a problem, since we do not have to estimate data for those counties, only for the smaller ones.)

(These are the sources we have available as of August, 2008. We may want to substitute other data as it becomes available later on from the ACS. Specifically, we may want to substitute recent period estimates for the 2000 census data.)

We believe these data sources can be combined to create guesstimated values that, while not perfect, will be quite reasonable in most cases. As with most estimation methods, it will rely heavily on some underlying assumptions. It will be important for data users to understand those assumptions and to trust the resulting estimates according to their knowledge of how well they apply to a specific county/region. The method involves estimating data for a smaller county by looking at values from the 2000 census for the county and for the PUMA region in which it resides. We assume a constant relationship between qualitative measures for the county vs. its "parent PUMA". (By "qualititative measures" we mean things like means, medians and percentages, as opposed to quantitative measures such as the number of persons or households. Most of the latter measures would be estimated either from FSCPE count estimates or by applying estimated percentage measures to an estimated universe count.) Specifically, we assume that:

Specific Examples of the Method

To take this from the abstract to the concrete we want to look at some very specific examples. As part of this we'll look at a specific PUMA and the smaller counties that comprise it. Missouri PUMA 00400 is located in the northeastern portion of the state, north of the St. Louis area and occupying about half the area between St. Louis and the Iowa border. The largest cities in the region are Hannibal and Moberly and it contains 6 counties: Marion, Monroe, Montgomery, Pike, Ralls, and Randolph. Two of the 6 counties have a little over 25,000 population and so will have 3-year period estimates available soon. The other 4 are all below the 20,000 limit with only Pike at 18,500 having any short-term prospects of becoming eligible for 3-year data. So what can we say or guess about Marion county in 2006? (the latest year for which ACS data are available as of today; this will change to 2007 within a few weeks). If you had to guess at what the median family income was for the county what would you say? Would you just go to the 2000 census and note that the median family income then was estimated to be $41,290 and just go with this as your "best guess"? This is what people have been doing for decades, just using the latest census figures until the next census comes along. This works pretty well early in the decade and perhaps in times of very low inflation, but it is clearly not the best we can do -- not with the new ACS information available. Even though we have no data published (yet) for Marion County from the ACS we do have something. We have data for PUMA 00400, and it tells us that in 2006 the median family income in that region was estimated to be $44,288. We also have data regarding the PUMA from the 2000 census -- that the median family income then for the PUMA was $39,749. Our method for estimating the current value for the county involves calculating the CPR (county PUMA ratio) as: CPR= 41920/39749 (i.e. the ratio of the county value to the PUMA value for the last point in time for which we have values for both). We then apply this ratio to the new (ACS) estimate for the PUMA and we have our current county estimate. The CPR value is 1.03877 and when we multiply this by the 2006 PUMA figure of 44,288 we get $46,005 as our guesstimate.

There are actually two (mathematically equivalent) ways of looking at the method. Our preferred way is to say that at a point in the not-too-distant past Marion county's median family income figure was about 4% higher than that of the PUMA region containing the county (i.e. the CPR was about 1.04). We therefore make an educated guess that the current value for the same statistic in 2006 is about the same amount (4%) higher than the current value for the PUMA. The second way of looking at it is to say that the county value has grown or shrunk in the same proportion as its PUMA parent. The growth ratio for the PUMA is 44,288/39,749 (=1.114) and if we apply this same ratio to the county's value in 2000 (41,920) we arrive at the identical county estimate (46,005) for 2006.