A few years ago, we created Ten Things to Know about the American Community Survey, which became one of our most popular pages. Its purpose was to provide a general overview for people just getting started with this relatively new and important data resource. That page is a little bit dated now, but most of what was true and important then is still true and important today.
But some things have changed, a few pretty dramatically, in the three years since it was written. The ACS has gotten a bit more complicated. We now have things called three year period estimates, and we are less than a year away from the first of the long awaited five year period estimates, which will give us all our favorite small-area geography back.
One of the things that has happened regarding the ACS is that a lot more people have started using it. It is not quite the mystery resource it was back then. The Census Bureau has enhanced the way the data can be accessed via their American FactFinder web query tool, and they have also been creating all kinds of metadata and tutorial material to help users. Several of the things we'll be talking about here will involve these new materials.
We here at the Missouri Census Data Center have been busy as well, creating a number of resources to assist users in finding and accessing the American Community Survey data. We have created our own set of data profiles (modeled after but significantly different from those available from the Census Bureau) and, for us data junkies, datasets that can be accessed using our Uexplore/Dexter software tools. Several more of the items being covered here will reference these resources.
The Census Bureau has created an extensive set of reference products for the ACS. Go to the American Community Survey home page at the Bureau's web site and select "Guidance for Data Users". These provide user-friendly information about the ACS and the new multi-year estimates available in 2008. Each handbook targets a specific user group, including first time ACS data users. There are now a dozen of these handbooks available with specially targeted materials aimed at a certain class of user (such as general data users, business community, or state and local governments).
These are PDF files that run about 65 pages, including about 40 pages taken up with a glossary and 8 appendices, which vary widely in general relevance and degree of difficulty. You certainly do not have to read the entire booklet to get a good handle on what the ACS is all about. But it wouldn't hurt if you read (or at least skimmed) most of it at least once. You can always come back to the sections written by, if not exactly for, statisticians.
Like most things coming from the Census Bureau, most ACS data products consist of data aggregated to various geographic areas: states, counties, cities, census tracts, school districts, etc. You really cannot use the products effectively unless you have a good basic understanding of the geographic dimension of the data. The Census Bureau provides an overview of geographic areas page that is must-read material for anyone hoping to get data out of the ACS. The page currently (early 2010) deals with the units for which there are 1-year and 3-year estimates. At the end of 2010 there will begin to be 5-year estimates, at which time this page is going to get considerably larger as a whole new set of geographic entities (such as ZIP codes, state legislative districts, census tracts and even block groups) will be added. The first summary table ("1a") is a very informative document:
If you understand what this table is telling you, then you have a reasonable expectation of being able to make productive use of ACS data. It is not the specific count values shown that are so important; it is the understanding of the role of population thresholds in determing which areas get which kind of data. For example, the 5th row of this table has statistics for the county geographic level. It shows you the summary level code for counties (050) and indicates that of the total number of such areas (3,220) only 802, or 24.9%, have a population of 65,000 or more, while 1,887, or 58.6%, have populations of 20,000 or more. Why are these figures so relevant to the ACS? Because this means that less than a fourth of the counties in the country are large enough to have 1-year data published for them. Just under 60% are large enough to get data based on three consecutive survey years. This leaves a little over 40% of the nation's counties that will only get data based on five consecutive years of data (and which currently have no ACS data published for them).
We have a big jump here, as we go from pages you access to get a handle on the data, to ones that actually let you access the data. A profile, in Census terminology, is a report that focuses on a single geographic area, and provides carefully chosen key indicator data to help you get a quick overview of that area. Profiles typically have a subject area or theme associated with them; so we have economic profiles, demographic profiles, social profiles, etc. We are going to point you to a pair of web sites for accessing similar (but not identical) ACS profile reports, one at the Census Bureau and one at the Missouri Census Data Center site. Though very similar in content, you'll notice significant differences in presentation format and linkages.
If you are almost anywhere on the Missouri Census Data Center (MCDC) web site you should see a light blue navigation box labeled Data Applications. The first link is ACS Profiles; this link takes you to the main menu page for the MCDC ACS Profiles dynamic report generator. The front end here is fairly simple. You get to choose from three drop-down menus across the top: a time period, a state, and an area type. Based on your choices from these three menus, you will see the available choices in the dynamically-generated select list on the left of the screen. So, for example, if you choose the 2006-2008 time period (the default), Missouri as the state (again, it's the default), and Places (Cities) as the type of area, what you should see in the select menu is a list of all cities in Missouri that had 3-year data for this 3-year time period. You can now choose any city or cities (up to 4) from the choices displayed. The chosen areas appear in the "cart" window on the right. The order in which you make your choices is singificant, in that this is the left-to-right order in which the data will appear in the profiles. Here is what the menu page looks like after making these choices:
Note that we also have some checkboxes across the bottom corresponding to the four sub-profiles (demographic, economic, social, and housing) that can be used to be selective about which of the four topics are of interest. At this point click on the Generate report button and you should see something like this:
Use this URL to bypass the menu and display this report as specified. Then you can scroll down and see the entire report, which is rather lengthy. There is actually quite a bit that can be said about these profile reports, and we tried to say most of it in the usage notes link at the bottom of the report.
Here is a close-up view of a portion of the economic profile portion of the report, showing a table of household income distribution for the two cities in our report. Columbia is over twice as large as Jefferson City, so you might expect less sampling errors in the larger town.
This illustrates the use of fonts to indicate statistical reliability without sacrificing any valuable screen space. Focus on the Jefferson City column of numbers. Of the ten income interval counts ("Less than $10,000" thru "$200,000 or more") we see that only the $50,000 to $74,999 interval is displayed using a bold font, while the entry for the last interval is barely readable due a very light font. This is quick and easy visual way of warning the user that the 240 estimate of households with the highest income level is not very statistically reliable. If the user places the mouse pointer over the 240 value in the report, the program will display
+/- 47.5% (126,234), which reports the relative margin of error (i.e., the MOE as a percent of the estimate) followed by the confidence interval. There is a 90% chance that the true count is somewhere between 126 and 234, which is a rather large grain of salt. Details of how this works are provided on the usage notes page.
We have to admit to an obvious bias in terms of which ACS profiles application we prefer. It comes as a result of our looking at what the Bureau did with their profiles and then designing ours to serve as an alternative that kept what we liked about theirs and adding lots of stuff we thought would make it better. Their site is official, it will always certainly be there and the data will almost never contain an error. Our site is not official, will be there as long as our state funding holds out and may occasionally contain an error, especially in the first few days of a new release.
Access the Bureau's data profiles by starting at the American FactFinder site, the data sets options with American Community Survey selected and whichever year/period you would like. Keep in mind that if you choose the 1-year period instead of the 3-year period, you will have many fewer geographic areas to choose from. On the other hand, the 3-year data are not as timely. These profiles are well formatted and contain extensive footnotes. The latter is sometimes helpful, but if you want to print the same profile a lot, it can become a bit of a paper-waster. There is a special narrative profile, where they put highlights into regular English and throw in some bar charts. We are not overly impressed but they are popular among some users.
The trend reports many people would like to see are where we look at data from the 2000 census and compare that to the latest data from the ACS. But that we do not have, and perhaps never will. There are many issues regarding comparability of data collected from these two surveys (see item 7). So that leaves us with trends within the ACS data based upon different time periods.
There is a general rule of thumb that prohibits comparing multiyear period figures where the two periods overlap (at least for the sake of detecting any trend). If you accept that limitation, then the only trends we have to date (2010) are for single-year data. (We will not be able to do trends using 3-year period estimates until we have two overlapping 3-year periods. That will not be the case until they release the 2008-2010 period estimates in late 2011. For now, you can access the Census Bureau's comparison profile products or the MCDC's ACS Trends application. This application is a companion to the MCDC's ACS Profiles application (see above). It only allows you to select a single geography. Here is a snapshot of a portion of an ACS Trends report for the state of Missouri, comparing 2016 to 2015:
We have thus far focused on getting access to the commonly accessed profile data in the ACS. Now, we want to turn our attention to the other end of the ACS data spectrum, where the very detailed and often multi-dimensional detailed tables live. If you work in a State Data Center or other agency where you assist the public with trying to answer their questions using the ACS data as a resource, you should find this step one of the most helpful.
What we are talking about here is accessing the ACS detailed tables (aka base tables). The real star of this show, the easily-overlooked resource that we want to lift up here is the Census Bureau's American FactFinder detailed tables/tables search tools. We'll illustrate with an example. A user calls with a question: they would like to know if they can get data on the poverty ratios of senior citizens within the state. Ideally, they would like to know how many are below 130% of the poverty level. Can you help?
Go to the American FactFinder site, Advanced Search, the Data Sets option. Choose the appropriate data set. Since it is data at the state level, you would probably want the latest single year estimates.
Proceed to the Geographies selector and add Missouri:
Can you answer the question regarding the availability of the detailed poverty ratio data for the elderly population of your state? Enter the words "age poverty ratio" into the refine box and click Go:
So what happened here is that we went from over a thousand ACS detailed tables, down to just these few that have been associated with our keywords. We still have to figure out which, if any, of these is going to have information suitable for answering the question at hand. What you shall find if and when you spend time dealing with detailed census tables is that the word "Age" in the title just means that there is some kind of age-based dimension.
In looking at our choices here, we can eliminate the ones concerning health insurance. That leaves a few tables that share the same numeric part in their titles, just different first letters. You should recognize the naming convention demonstrated here: The table beginiing with "C" is a condensed version of the larger base table beginning with "B". You can then click on table titles to view the data in each, or tick the checkboxes to view several at once.
This is a pretty simple example. Often the questions are considerably trickier and require some knowledge of the data to navigate through the maze of available tables.
This is not the easiest or the most fun thing to worry about regarding the ACS, but it is something that you must look into before you use the data to say anything about trends. The Bureau has created a series of special matrix-format web pages such as the one for 2008 single-year data. They begin with a section containing general guidelines, followed by a detailed matrix where each row corresponds to a specific kind of data (such as ancestry, household and family income, marital status, etc.). The columns correspond to the data source being compared, currently the 2000 decennial census and the most recent ACS data for a comparable number of years. They tell you whether the data from the two resources can (or should) be compared, with some cases indicated "Compare with caution". Links are then provided to get further explanation regarding the nature of the "caution".
Another very useful feature here for the serious user is the "table crosswalk" links in the above comparison tables. These provide linkages between summary file 3 table numbers and ACS base tables. This comes in extremely handy for serious users trying to find comparable data across these two important data sources.
This item is for those of you who want to get your hands on the ACS data and take it home with you, where you need more control regarding just what data you want and/or the format of that data.
For example, suppose you want to get a certain set of key variables for every state in the country or for every PUMA in your state for the most recent two ACS-survey years and put them in Excel spreadsheets for further analysis. The variables you want may be the standard ones found in the data profile reports from the Census Bureau or MCDC web sites, or they might be something more customized requiring access to the base (detailed) tables.
The tools we describe here allow you to create such extracts in various formats, including comma-delimited, SAS dataset, and PDF or HTML report files. You get to choose your rows/geographic areas (no limit of four here). Finally, you can choose your subject content, or variables. The datasets from which you'll be extracting have good geographic identifiers: FIPS codes, the Census Bureau's Geoid fields, geographic name fields, etc. are all included on the datasets and thus can be easily kept as part of your downloaded extracts.
Because of its length and because of the somewhat technical nature of this item, we have moved it to a separate web page.
There are two important aspects of this item — the what and the how of PUMS processing. Any serious user of the ACS needs to be aquainted with what this capability is about.
PUMS (public use microsample) data are special files released by the Census Bureau containing disclosure-avoidance-enhanced versions of actual ACS survey questionnaires. The other data we have been talking about (profiles and base tables) are all pre-tabulated summaries of geographic areas. The PUMS is microdata, rather than summary data. It is raw data that allows you to create your own custom data tabulations and analyses. It allows you to consider answering questions such as these:
So what do each of these data questions have in common that make them amenable to a PUMS-based solution?
So how does this work? The Census Bureau has published a special Handbook for ACS PUMS users.
Those wanting to use SAS to access the ACS PUMS datasets and generate custom tabulations might find it useful to browse the MCDC's ACS PUMS data collection via our Uexplore utility. Note that we not only have the datasets, but also the data dictionary files, which are indispensable for processing.
In most instances for most users of survey data, knowing the gruesome details of how sample weights are assigned to the data is not something that needs to be of much concern. In most cases, data users trust that the data providers have advanced statistical training and know what they are about. We trust that they will be doing what's best to make the data useful and as accurate as possible. But with the ACS, there is so much being done in this area of statistical weight adjustment, and because it involves what some consider to be a somewhat radical make-it-fit approach, that we think it requires that you have some idea of how it works.
A control total is just a number that exists independent of the survey and represents a known quantity. It is typically used to assign or adjust a weight to the sample data, so that when we aggregate our data using our sample weights, we get numbers that match the control figure.
For example, suppose we knew that the total population of Missouri was 6 million, and that when we processed the ACS survey data and assigned the initial person weights they summed to 5,700,000. Using our control figure, we could go back and adjust the weight assigned to each survey, multiplying it by 6,000,000/5,700,000, or about 5.3% . Now when we use the adjusted weights to aggregate our data, it comes out with a total population of 6,000,000 to match the figure that we assume to be correct. This is a very simple example of the use of a control total. The ACS sampling scheme is nowhere near as simple, but it is based on the same concept.
In the ACS scheme, we don't have a simple total population or total households control figure; what we have is an entire array of control totals that involve the basic demographic categories of age, sex, race, and Hispanic origin. These controls are imposed at the county level. The Census Bureau does official estimates of county populations using these four basic demographic categories (crossed with each other, i.e. we estimate the number of Under 18-White-Female-Nonhispanic persons in each county.) The exact details are not divulged by the Bureau, but the general idea is sufficient to scare many of us. It is this adjustment process that results in the total population of all counties (and entities comprised of counties) to exactly match the most recent official county estimates (or, in the case of multi-year period estimates, to the average of the estimates for those years).
We understand why the Bureau wants to do this, but we have concerns about the ramifications. Specifically, three things bother us about this approach:
So where does the decennial census fit it? Every ten years the country spends billions of dollars in order to get an extremely accurate and detailed enumeration of its population. Because it is an enumeration (a complete count) rather than a sample survey, it is not subject to the kinds of sampling error that the ACS is. Unlike the official estimates program, these demographic profiles are not limited to the county level; we have them all the way down to the smallest census block.
What happens when the 2010 census results are processed in early 2011 and we get to see just how good (or bad) those official county estimates that we've been using to adjust the ACS figures really are? Can we use the new information to improve the ACS figures? Well, yes, but not without some downside.
Starting with the ACS data collected in 2010, which gets processed and tabulated in 2011, the Census Bureau will have the results of the 2010 decennial census (perhaps slightly adjusted to go forward from April 1 to July 1, 2010 — there are always footnotes) to use as the control figures. Just as the Bureau wants the 2009 ACS data to match the 2009 official county population estimates, they would like the ACS data coming out in 2011 to more or less line up with the 2010 decennial census data that will have been released earlier that year. They could now do controls for geography considerably smaller than the county level: They could go all the way down to the smaller neighborhood (census tract and block group) levels. And tracts and block groups sum to counties; they would still be getting numbers controlled to the county level. That works for 2011 with data for 2010, except that the data published for the smaller neighborhoods is not based just on 2010 data; it is based on five years of data, 2006-2010.
What about the control totals for those earlier years, 2006 through 2009? That data has already been processed and weighted using the old control totals of the time. We already have single year data for 2009 and 3-year period estimates for 2007-2009 and 2006-2008, etc. all based on the official post-censal estimates available at the time. So do we use four years of data with the old estimates used as control totals and one year of the decennial-based new estimates? No. The Bureau has indicated that they will be going back and adjusting the county level estimates for the years 2001-2009 to create what they call intercensal estimates (as distinguished from postcensal estimates). This adjustment process involves taking the old estimates, (including a 2010 unpublished figure) as a time series and plugging in the 2010 census figures. The old estimates are then mathematically adjusted so that they arrive at the known final figures where the discrepancies between the estimates and the counts are uniformly distributed across the decade. So if this county was 1000 low in its hispanic count, they can go back and adjust the hispanic estimates for each year of the decade so that the final estimate matches the decennial figure.
Once the Bureau has the new and improved intercensal county estimates, they plan to use these to go back and adjust the weights on all the ACS data throughout the decade. So the data for the 5-year period estimates released in 2011 will used adjusted weights for each of the earlier years 2006-2010 and the decennial-based weights for 2010. These figures (released for all geographic areas including census tracts, block groups, cities of any size, ZIP codes/ZCTAs, state legislative districts, etc.) are expected to be significanly differ from the ones published for 2005-2009. The difference is not just because they differ (by 20%) in the sample universe, but because the surveys are going to be assigned new weights. Will we ever see the 2005-2009 figures (or the 2007-2009 three-year estimates) based on the revised weights published? Probably not. The Bureau worries that putting out such revised, alternate sets of data would be very confusing for users.
We are fairly certain of the exact details regarding what the Bureau plans to do about releasing reweighted estimates, but we should be getting more definitive answers some time in mid to late 2010. At the State Data Center Steering Committee meeting in February of 2010, Mark Asiala of the DSSD office at the Census Bureau did a presentation regarding what to expect in 2010 and 2011. His presentation (Powerpoint file) titled Break in Series Due to Controls< provides a good overview of what to expect.
Bottom line here, the thing that you really need to know based on all this is that you can expect a rather serious "bump" in the ACS data that are released in 2011 versus those released in 2010 because of the (new and improved) re-weighting. Trends, especially in counts as opposed to means, medians and percentages, are likely to be very misleading across this statistical divide.