Ten More Things to Know (and Do) About the American Community Survey

(Continued from Page 1 containing the First Five Things)
6. Detailed Data via AFF   |  7. Compatibility Issues   |   8. Custom Extracts from MCDC   |  9. ACS PUMS   |   10. Control Totals/Census

  1. Accessing Detailed Characteristics via AFF. (Intremediate users). We have thus far focused on getting access to the commonly accessed "best seller" (profile) data in the ACS. Now, we want to turn our attention to the other end of the ACS data spectrun where the very detailed and often multi-dimensional detailed tables live. If you work in a State Data Center or other agency where you assist the public with trying to answer their questions using the ACS data as a resource, you should find this step one of the most helpful.

    What we are talking about here is accessing the ACS detailed tables (aka "base" tables). The real star of this show, the easily-overlooked resource that we want to lift up here is the Census Bureau's American FactFinder Detailed Tables/Tables search tools. We'll illustrate with an example. A user calls with a question: they would like to know if they can get data on the poverty ratios of senior citizens within the state. Ideally, they would like to know how many are below 130% of the poverty level. Can you help? The appropriate initial response to such a query is "I'm not sure, let me check and get back to you". So how do you "check"?

    Go to the American FactFinder site, the Data Sets options, and choose the appropriate Data Set. Since it is data at the state level you would probably want the latest single year estimates.

    It looks a lot like the screen shot we saw in Item 4 where we accessed the 3-year Data Profiles. In fact, it is the same screen except for the choices being made upon it (2008 instead of 2006-2008 dataset and Detailed Tables instead of Data Profiles). Click on the Detailed Tables option to proceed to the Geography screen. Here we choose our geography (the easy part when using American FactFinder, at least for something as simple as just a state), and then proceed to the Table Select screen.

    This is where you have to do something to justify why you are not next in line for the next round of budget cuts. Can you answer the question regarding the availability of the detailed poverty ratio data for the elderly population of your state? You are not going to be able to do it (not in a reasonable time frame, at least) by accepting the initial show all tables version of the Select Tables window; that is almost always the least useful of the 3 options. Select by keyword instead and try typing the 3 keywords as shown:

    Apparently somebody (you?) must have typed in the keywords age poverty ratio and then clicked Search, resulting in a new display of only three tables to choose from. There is, or course, the usual combining of art and science involved in choosing the keywords. You have to understand that "Persons over 65" translates to "Age", and that "poverty ratio" is a good phrase for locating tables that contain these kinds of data. This is the sort of thing that gets to be pretty easy after you do it a few years.

    So what happened here is that we went from over a thousand ACS detailed tables, down to just these 3 that have been associated with our 3 keywords. We still have to figure out which, if any, of these is going to have information suitable for answering the question at hand. What you shall find if and when you spend time dealing with detailed census tables is that the word Age in the title just means that there is some kind of age-based dimension. It could be persons under and over 18. It could be any of lots of age-cohort groupings which may not include a 65+ category, or some series of smaller cohorts that can be added together to get the 65+ grouping we need. In looking at our 3 choices here we can pretty much eliminate the first one:

    because it deals with a universe of families and we want people. We don't even see any mention of Age in the title. That is because the full title is too big for the box it is being displayed in. The full title can be seen by choosing the table and clicking on the What's this button down and to the right. That's when you'll see the full title: RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS OF FAMILIES BY FAMILY TYPE BY PRESENCE OF RELATED CHILDREN UNDER 18 YEARS BY AGE OF RELATED CHILDREN.

    So that leaves a pair of tables that share the same numeric part in their titles, just different first letters. You should recognize the naming convention demonstrated here: The table beginiing with "C" is a condensed version of the larger base table beginning with "B". Using the What's this button a couple of times shows you the exact difference between the two. Turns out there is more age detail on the B version than the C version, but that one of the age categories on the C17024 table is just what we were looking for: "65 years and over". Next look at the poverty ratio categories. They are: Under 0.50 | .50 to 0.99 | 1.00 to 1.24 | 1.25 to 1.99 | 2.00 to 2.99 | 3.00 to 3.99 | 4.00 to 4.99 | 5.00 and over . Not exactly what was requested, but pretty close. We can do summing to get < 125% of poverty instead of < 130% as requested. Go ahead and Add this table to the output box and then hit the Next button to display it. You have the data you need to answer the question and save your job for one more quarter.

    This is a pretty simple example. Often the questions are considerably trickier and require some knowledge of the data to navigate through the maze of available tables. But the Select Tables / by keyword tool is one of the most useful in the universe of things that make the ACS detailed tables actually useful.

  2. Understanding Data Comparability Issues (Anyone interested in comparing the data). This is not the easiest or the most fun thing to worry about regarding the ACS but it is something that you must look into before you use the data to say anything about trends. The Bureau has created a series of special matrix-format web pages such as the one for 2008 Single-year data. (There is one for each period, e.g. one for 2007 1-year data, one for for 2006-2008 3-year data, etc.) They begin with a section containing general guidelines, and follow this with a detailed matrix where each row corresponds to a specific kind of data (such as ancestry, household and family income, marital status, etc.) The columns correspond to the data source being compared, currently the 2000 decennial census and the most recent ACS data for a comparable number of years. They tell you whether or not the data from the two resources can (or should) be compared, with some cases indicated "Compare with caution". Links are then provided to get further explanation regarding the nature of the "caution".

    Another very useful feature here for the serious user is the "table crosswalk" links in the above comparison tables. These provide linkages between Summary File 3 table numbers and ACS Base Tables. This comes in extremely handy for serious users trying to find comparable data across these two important data sources.

  3. Downloading Custom Extracts from MCDC Site. (Somewhat advanced users). This item is for those of you who want to be able to get your hands on the ACS data (as much or as little as you need) and take it home with you. Sometimes this can be handled most easily by using Windows select-copy-paste sequences to take data from a report in html or pdf format and pasting it into a spreadsheet; but that technique is obviously for limited cases. What we are talking about here is something more powerful, where you need more control regarding just what data you want and/or the format of that data. For example, suppose you want to get a certain set of key variables for every state in the country or for every PUMA in your state for the most recent two ACS-survey years and put them in Excel spreadsheets for further analysis. The variables you want may be the standard ones found in the data profile reports from the Census Bureau or MCDC web sites, or they might be something more customized requiring access to the base (or "detailed") tables. The tools we describe here allow you to create such extracts in various formats including comma-delimited (with data-defining rows at the top, easily loaded into Excel), SAS dataset, and pdf or html report files. You get to custom choose your rows/geographic areas (no limit of 4 here -- you can have hundreds or even thousands if you like, or you can have just the specific 5 or 10 you need). Finally, you can choose your subject content, or variables. If what you are interested in is poverty-related items it is easy enough to code a query where you choose only those variables. The datasets from which you'll be extracting have good geographic identifiers: FIPS codes, the Census Bureau's Geoid fields, geographic name fields, etc. are all included on the datasets and thus can be easily kept as part of your downloaded extracts.

    Because of its length and because of the somewhat technical nature of this item we have moved it to a separate web page. To continue with this thing access the Item 8 page. (There is a link at the bottom of the Item 8 page that will bring you right back here.)

  4. ACS PUMS. (Intermediate users with a need for custom tabulations and access to a programmer; and advanced users with access to statistical software and programming skills). There are two important aspects of this item, the what and the how of PUMS processing. Any serious user of the ACS needs to be aquainted with what this capability is about, even if they never plan to personally learn the details of how it all works. In the latter case, it may be sufficient to know that something can be done and then to find somebody (such as your student assistant or your local State Data Center PUMS person) who knows how to make it happen.

    PUMS (Public Use Micro Sample) data are special files released by the Census Bureau containing disclosure-avoidance-enhanced versions of actual ACS survey questionnaires. The other data we have been talking about, the profiles and the base tables data, are all pre-tabulated summaries of geographic areas. The PUMS is microdata, rather than summary data. It is raw data that allows you to create your own custom data tabulations and analyses. It allows you to consider answering questions such as these:

    1. How many families in Missouri with children under 18 are below 130% of the poverty level?

    2. What is the mean poverty ratio of all persons living in mobile homes in Missouri portion of the Kansas City metro area?

    3. How many and what percentage of persons over 65 in the state of Illinois share a household with another person over 65 of the opposite sex?

    So what do each of these data questions have in common that make them amenable to a PUMS-based solution?

    1. They do not require great geographic detail. The only one of the three that needed something finer than an entire state was the one that involved the KC metro area. (The smallest geographic entity that is identified on a PUMS record is the PUMA, which is an area specifically designed to be the geography of the PUMS file, and which are required to have a population of at least 100,000 .)

    2. They involve a somewhat unusual data categorization, one that has not been included even among the over 1000 pre-tabulated base tables that are available with the ACS. Oftentimes, it is the crossing of dimensions (e.g. a certain age cohort with one or more other characteristics). The Census Bureau simply cannot aniticipate all the possible combinations that users may find helpful.

    3. Some special household-composition characteristic such as the elderly sharing a household with another elderly person of the opposite sex.

    So how does this work? The Census Bureau has published a special Compass Handbook for ACS PUMS users

    For those of you wanting to use SAS (or perhaps have it used on your behalf) to access the ACS PUMS datasets and generate custom tabulations you might find it useful to browse the MCDC's ACS PUMS data collection via our uexplore web utility (at http://mcdc.missouri.edu/cgi-bin/uexplore?/pub/data/acspums). Note that we not only have the datasets, but also the data dictionary files, which are indispensable for processing, and a Tools subdirectory. The latter is a library of SAS programs along with the printed outputs which they produce, which are for the most part fairly simple (and real world) examples of using these data to answer user queries.

  5. Control Totals and the Decennial Census/ACS Link. (Intermediate users with a curiosity about how they come up with these figures). In most instances for most users of survey data, knowing the gruesome details of how sample weights are assigned to the data is not something that needs to be of much concern. In most cases data users trust that the data providers have advanced statistical training and know what they are about. We trust that they will be doing what's best to make the data useful and as accurate as possible. But with the ACS there is so much being done in this area of statistical weight adjustment, and because it involves what some consider to be a somewhat radical make-it-fit approach, that we think it requires that you have some idea of how it works.

    As most of you know, a control total is just a number that exists independent of the survey and represents a known quantity. It is typically used to assign or adjust a weight to the sample data so that when we aggregate our data using our sample weights we get numbers that match the control figure. For example, suppose we knew that the total population of Missouri was 6 million, and that when we processed the ACS survey data and assigned the initial person weights they summed to 5,700,000. Using our control figure we could go back and adjust the weight assigned to each survey, multiplying it by 6,000,000/5,700,000, or about 5.3% . Now when we use the adjusted weights to aggregate our data it comes out with a total population of 6,000,000 to match the figure that we assume to be correct. This is a very simple example of the use of a control total. The ACS sampling scheme is nowhere near as simple, but it is based on the same concept. In the ACS scheme we don't have a simple total population or total households control figure; what we have is an entire array of control totals that involve the basic demographic categories of age, sex, race and Hispanic origin. These controls are imposed at the county level. The Census Bureau does official estimates of county populations using these four basic demographic categories (crossed with each other, i.e. we estimate the number of Under 18-White-Female-Nonhispanic persons in each county.) The exact details are not divulged by the Bureau but the general idea is sufficient to scare many of us. It is this adjustment process that results in the total population of all counties (and, of course, entities comprised of counties such as states and metropolitan areas) to exactly match the most recent official county estimates (or, in the case of multi-year period estimates, to the average of the estimates for those years.) We understand why the Bureau wants to do this, but we have concerns about the ramifications. Specifically, three things bother us about this approach:

    1. These adjustments work at the county level, but what does it do to data at other levels, such as place (city) or census tract? Doesn't it have the potential to distort the data for smaller geographic entities?

    2. The official estimates used as the control totals are derived from the most recent decennial census figures and are point-in-time July 1 estimates. As such they reflect census residency rules, which can lead to significant differences between what the population of an area is by ACS definitions vs. the estimates. College towns and areas with large seasonal populations would be the areas where such differences could be the most dramatic.

    3. The official county estimates are a good best guess at the true population of counties but, unlike decennial census enumerations, they are far from perfect and they tend to be a lot worse in 2009 than they were in 2001. So the numbers that will be used to adjust the 2009 data have to be taken with a grain of salt. (There is nothing built in to the MOE measures provided with the ACS data that take into account the amount of error that may be present because the underlying control totals are wrong.)

    So where does the decennial census fit it? Every ten years the country spends billions of dollars in order to get an extremely accurate and detailed enumeration of its population. Because it is an enumeration (a "complete count") rather than a sample survey, it is not subject to the kinds of sampling error that the ACS is. Unlike the official estimates program, these demographic profiles are not limited to the county level; we have them all the way down to the smallest census block.

    What happens when the 2010 census results are processed in early 2011 and we get to see just how good (or bad) those official county estimates that we've been using to adjust the ACS figures really are? Can we use the new information to improve the ACS figures? Well, yes, but not without some down side.

    Starting with the ACS data collected in 2010, which gets processed and tabulated in 2011, the Census Bureau will have the results of the 2010 decennial census (perhaps slightly adjusted to go forward from April 1 to July 1, 2010 - there are always footnotes) to use as the control figures. Just as the Bureau wants the 2009 ACS data to match the 2009 official county population estimates, they would like the ACS data coming out in 2011 to more or less line up with the 2010 decennial census data that will have been released earlier that year. They could now do controls for geography considerably smaller than the county level -- they could go all the way down to the smaller neighborhood - census tract and block group - levels. And tracts and block groups sum to counties, they would still be getting numbers controlled to the county level. That works for 2011 with data for 2010, except that the data published for the smaller neighborhoods is not based just on 2010 data; it is based on five years of data, 2006-2010. What about the control totals for those earlier years, 2006 through 2009? That data has already been processed and weighted using the old control totals of the time. We already have single year data for 2009 and 3-year period estimates for 2007-2009 and 2006-2008, etc. all based on the official post-censal estimates available at the time. So do we use four years of data with the old estimates used as control totals and one year of the decennial-based new estimates? No. The Bureau has indicated that they will be going back and adjusting the county level estimates for the years 2001-2009 to create what they call Intercensal estimates (as distinguished from postcensal estimates). This adjustment process involves taking the old estimates, (including a 2010 unpublished figure) as a time series and plugging in the 2010 census figures. The old estimates are then mathematically adjusted so that they arrive at the known final figures where the discrepancies between the estimates and the counts are uniformly districuted across the decade. So if this county was 1000 low in its hispanic count they can go back and adjust the hispanic estimates for each year of the decade so that the final estimate matches the decennial figure.

    Once the Bureau has the new and improved Intercensal county estimates, they plan to use these to go back and adjust the weights on all the ACS data throughout the decade. So the data for the 5-year period estimates released in 2011 will used adjusted weights for each of the earlier years 2006-2010 and the decennial-based weights for 2010. These figures (released for all geographic areas including census tracts, block groups, cities of any size, ZIP codes/ZCTAs, state legislative districts, etc.) are expected to be significanly differ from the ones published for 2005-2009. The difference is not just because they differ (by 20%) in the sample universe, but because the surveys are going to be assigned new weights. Will we ever see the 2005-2009 figures (or the 2007-2009 three-year estimates) based on the revised weights published? Probably not. The Bureau worries that putting out such revised, alternate sets of data would be very confusing for users.

    We are fairly (but not totally) certain of the exact details regarding what the Bureau plans to do with respect to releasing reweighted estimates but we should be getting more definitive answers some time in mid to late 2010. At the State Data Center steering committee meeting in February of 2010 Mark Asiala of the DSSD office at the Census Bureau did a presentation regarding what to expect in 2010 and 2011. His presentation (ppt file) titled Break in Series Due to Controls provides a good overview of what to expect.

    Bottom line here, the thing that you really need to know based on all this, is that you can expect a rather serious "bump" in the ACS data that are released in 2011 versus those released in 2010 because of the (new and improved) re-weighting. Trends, especially in counts as opposed to means, medians and percentages, are likely to be very misleading across this statistical divide.