Census data is almost all summary data. It usually starts with a survey but the results are provided as summaries, and the first and universal stratefier of such summaries is a geographic area. We do not get census data telling us that according to the latest ACS data we see that Joe and Mary Miller have a household income of $15,000 and are therefore below the poverty level. What we do get is information regarding the median household income for the state of Illinois or the percentage of persons below the poverty level for the city of Chicago. It's always a summary statistic and it always describes a geographic area. The subject of this essay is the extensive collection of geographic area types for which we can get census data, with particular emphasis on the coding scheme used to identify all the types and the way in which they relate to one another. We shall provide examples of how all this is reflected in
the Missouri Census Data Center's archive datasets and will conclude the module with a somewhat extended example of how an understanding of the geography and summary level codes is crucial to doing a custom data query.
The Basic Geographies
If you use the Bureau's American FactFinder web-based data access system you have probably followed the link they feature on that
site optimistically labeled "Explain Census Geography". It takes you to the following chart with explanation:
We count 27 different geographic entities on this chart, which they claim represent all the "types for which data are available in FactFinder" (we're pretty sure there's actually more, or at least there should be). The Bureau publishes a slightly different version of this chart at http://www.census.gov/geo/www/geodiagram.pdf (linked to fromt the top of their very helpful Reference Resources for Understanding Census Bureau Geography page). It's a 2-page pdf and does not have the explantory text, but it's a somewhat nicer graphic.
There are two basic categories for the entities shown on these charts:
- Those that have legal status and are not controlled by the Bureau (includes counties, places (incorporated cities), congressional districts, state legislative districts, school districts, etc.)
- Those which the Bureau is responsible for defining (census tracts, block groups, blocks, Public Use Microdata Areas, ZCTAs, etc.)
What is not at all apparent in the Bureau's explanation page is just how complicated the world of census geography really can be. The thing to keep in mind is that census geography, once you get far beyond the very basics, is way too complicated to cover in a web page or two. There are gotchas and footnotes and OMB Tech Docs associated with just about every one of the geographies listed on the Bureau's diagram. One of the most complicating factors which is not addressed by this diagram and explanation is the time dimension. Regions, Divisions and States are no problem, since they tend to stay put over time. You might think Counties would fall into that catgegory as well, but not so. There are small changes going on to counties all the time. Since the 2000 census we have had county changes in Colorado, Alaska and Virginia. (See Substantial Changes to Counties
and County Equivalent Entities: 1970-Present for details.) The down-the-middle census geography hierarchy of Census Tracts - Block Groups - Blocks is redefined every ten years. So the entities in use when accessing a 1990 STF1 dataset are not the same ones used for tabulating a 2000 SF1 dataset. Same concept, and in many areas you'll see a lot of unchanged tracts, but absolutely a different geographic layer. The 2010 tract-BG-block geography has been mostly defined (the Bureau makes final tweaks based on what they find in taking the census) and will be unveiled next March (or so) when the first results of the 2010 census are released. The Public Use Mocrodata Areas ("PUMA"s) are similar, although the 2010 PUMAs will not be defined until after the 2010 population counts become available - probably in 2012 or 2013.
Blocks As Atomic Units
In the explanatory text portion of the Bureau's geography chart they say: Notice that many lnes radiate from blocks, indicating that most geographic types can be described as a collection of blocks.
Actually there is only one entity shown on the chart that is not built from blocks, and that is ZIP codes. But one could argue that ZIP codes do not belong on the chart, since the Bureau does not really publish any data at the ZIP code level, strictly speaking. They use ZCTA's as a proxy to provide data for users who would really like to have it by true ZIP codes but understand that ZCTA's are close enough.
So blocks are very important geographic entities. Not for data reporting, per se -- there is very little data available at the block level, mostly just basic census counts from the decennial census. But everything-is-made-of-blocks turns out to be a very useful situation for someone trying to do analyses that involve relationships between the various geographic layers. It turns out to be the basis for building a large set of block-level geographic-lookup tables that we call the Master Area Block Level Equivalency ("MABLE") files. These tables, when combined with the geocorr2k web application, provide a tool that lets users generate reports/spreadsheets documenting and measuring geographic relationships (for example, what counties a ZIP code intersects with, and the population of those intersections). See the MABLE/Geocorr application (on this MCDC web site).
History buffs will be interested in knowing that this situation of defining census blocks in such a way that they are not split by any
other census-recognized geographic entities did not start until 1990. Lots of things have changed about blocks over the decades. They started out as 3-digit codes, then became 3-digit with a 1-character alpha suffix, and then finally the current 4-digit numbers. We believe they are going to get 5-character values for 2010.
Geography Over Time
The time factor is particularly troublesome for entities such as places (cities), school districts and ZIP codes, which tend to change all the time. While the Bureau will always tabulate population estimates for places using the latest available boundaries, the data from the previous census will always be frozen to reflect the city's boundaries as of January 1 of the census year. (Which many of us think is a good thing.) We mention ZIP codes, but the Bureau really does not publish very much (nothing?) based on true ZIP codes since they are not the type of geographic entity that lend themselves to use as a geographic entity. Many ZIP codes represent buildings, not communities. The Bureau came up with the ZCTA concept to allow them to tabulate data at some unit that approximates ZIP codes, but are different from them in ways that are not always trivial, depending on the application. See our All About ZIP Codes page for a more detailed discussion of these.
Congressional Districts are subject to change every two years, but for the most part only undergo major changes every ten years following a post-decennial-census reapportionment. But smaller changes can and do occur throughout the decade (especially in Texas) and getting data for the instances of this coverage can be difficult or impossible.
Metropolitan Areas is an umbrella term that actually covers a number of different geographic entities. If you take the term as it has been used over time then there are numerous other entities that could be included. On the alternate chart we mentioned above, they have an entry labeled "Core Based Statistical Areas" instead of Metropolitan Areas. The "CBSA" terminoloy, used to describe a new system for defining these kinds or urban media-market type areas that was developed around 2000, has been discouraged by the Bureau and is not seen too often any more on their web site or in their documentation. A CBSA is either a Metropolitan Statistical Area or a Micropolitan Statistical Area, depending on the size of the core area (it's complicated). The CBSA entities were created as a replacement for the previous generation of comparable entities used from about 1982 through 2000, which were referred to as MSA's, CMSA's (Consolidated MSA's) and PMSA's (Primary MSA's). MSA's were simple stand-alone metro areas like St. Louis and Kansas City, while CMSA's were much larger urban entities that were made up of adjoining PMSA's. For example, the Dallas-Forth Worth, TX CMSA is comprised of the Dallas and Ft. Worth PMSA's). It was a little bit complicated and confusing and the CBSA's were an attempt to improve on the concept, especially by introducing the Micropolitan Statistical Areas which were just like the MSA's but on a smaller scale. Accompanying the new CBSA entities came two other related entities, Consolidated Statistical Areas (CSA's - combinations of adjoining CBSA's) and Metropolitan Divisions (sub-areas of CBSA's, which are roughly equivalent to the PMSA's of the earlier system.) While just trying to keep up with the various entities and the summarly levels and codes that go with them, there is also the variation over time which make using these geographic areas challenging. The CBSA's can and do change year by year. In most years the changes are rather few and somewhat small, but not always. You add a county here or there, occasionally (rarely) a county gets subtracted. New CBSA's can be created at any time during the decade and occasionally one gets decommissioned because it no longer meets the criteria. These changes do not draw much publicity and area easy to miss. OMB issues the formal bulletins that signal when changes occur. The Bureau then incorporates those changes into a text file with complete definitions (sometimes several months following the official OMB release). We currently use the definitions posted on the Bureau's web site at http://www.census.gov/population/www/metroareas/lists/2008/List1.txt, which advertises itself as being updated as of November, 2008 (which is about 20 months ago as we write this, and about 10 months prior to the "Internet Release Date: August 2009" specified in the file header). But as far as we can tell these are the latest definitions.
The 2000 Census Summary File 3 Summary Level Sequence Chart
A familiar sight to anyone who has ventured into the field to Census Bureau technical documentation for their data prodicts (specificially, for the "Summary (Tape) File" products) is the document summarizing the geographic levels being summarized on these
files. Here is a partial snapshot (only about a fourth of the entire 2-page document) of the Summary Level Sequence Chart provided with the 2000 Census SF3 data product.
Some things to note about the SLSC:
- The Geographic Component column alerts the user to the existence of special summaries that only consider a subpopulation of the area. For example a geographic component code of 01 indicates an "Urban portion" summary, 02 means an "Urban in centrl place of Urban Area" summary, etc. For most users all this means is that they are probably going to want to filter those rows where the value of the Geographic Component code is not 00 (the code indicating that is not a geographic component summary, but rather a summary of the entire geographic area.)
- The summary level codes and their meanings are indicated in the rightmost
Summary level column. The first row entry of
040 State says we have state level summaries on this file indicated by a summary level code of
040. The second line indicates County level summaries with a code of
050. Note the indentation, which is crtitical to understanding these charts, as it indicates a hierarchy of entities. Counties are obviously nested within states. The footnote reference is provided to remind or inform the user that when we say "County" we include other entities that serve as county equivalents, such as parishes in Louisiana or Boroughs in Alaska.
- Note the use of hyphens and slashes in the Summary level description, as explained above the table. In the row for level 070, for example, we have 3 dashes which means a 4-level hierarchy (starting with
State and ending with
Place/Remainder). The "/" in the last level says that this can either be a place (city or Census Designated Place) or it can be the "Remainder" of a county--the portion not within any place.
- The four codes 070, 080, 085 and 090 form a classic census geography hierarchy in which each level is subordinate to the previous one, all of them contained within the County Subdivision level. In Missouri a County Subdivision is called a "township" (in New England it is called a "town", and other names apply in other states). These levels are referred to as "split" (or "hierarchal") geographies, i.e. as "split place" (070), "split tract" (080), and "split block group" (090). There are other summary levels (160, 140 and 150, appearing just below in the chart) which are the "un-split" summaries for these geographies. Confused?
- Being from Missouri, I have a certain bias regarding what summary levels are most useful and which ones we could almost get by without. Because it is rare for Missourians to care about township geography, there is not much interest in data tabulated to any of the gegraphies in the 070 to 090 hierarchy. These are very voluminous levels which typically can occupy a large majority of the space on a census summary file, and yet people in Missouri almost never care to use them. The exception to the rule is the
090 ("split block group") summaries. These are important summaries not because anyone cares about such data per se, but because they serve as building blocks when aggregating census data to other geographic levels. This is because these are the smallest geographic areas on Summary File 3, which means the smallest unit for long-form (aka "sample") data in the census. If you have only the 090 level summaries and the right programmer/software you can just about recreate through aggregation any of the component geographies (e.g. place, complete census tract, township, etc.)
- When people refer to "tract" and "block group" level data they are almost always referring to the un-split versions - summary levels
150, respectively. We (and the Census Bureau) refer to these un-slit levels as "inventory" levels and the split versions as "Hierarchal" levels. This can somewhat explain why within the MCDC data archives you will sometimes see data files with names such as moi and moh. These contain inventory and hierarchal summaries, respectively. In 2000 there were 12,631 inventory summaries for Missouri and 30,172 hierarchal summaries. The moh dataset is 2.5 times larger than the moi dataset and about 1/10th as useful. See this page for a report indicating the SumLev variable values for the moh dataset and how often each occurs on the dataset, and this page for comparable information for the moi dataset. (These datasets do not contain any geographic component summaries - those are stored separately, in a usgeocomps dataset.)
- See the complete Summary Level Sequence Chart document stored in the MCDC data archive at
Order Matters: Summary Levels 390 and 381 (for example)
If you follow the link just above to the complete Summary Level Sequence Chart for SF3 you will notice that it is actually comprised of two charts, labeled
A: State Summary File 3 and
B: National Summary File 3. Experienced census data users will recognized the Bureau's convention of release summary file products as a series of alpha-coded "files", such "Stf 3, File A" and "Stf 3, File B". The different "files" usually contain the same tabular data but they do it for different geographic universes and summary units. In the case of the 2000 census, Summary File 3, there was a set of state-universe files that formed the "File A" series and a single national file ("File B") that presented data for mostly larger geographic areas but for the entire U.S.
On the second page of the A file SLSC you should see the entry:
390 State-Metropolitan Statistical Area/Consolidated Metropolitan Statistical Area
and on the second page of the B file SLSC the entry:
If you read carefully you'll notice that the only difference between the description for the 390 and 381 summary levels is the order in which the component geographies are listed. On the A file State comes first, then the MSA/CMSA, while on the B file it is reversed. Of course we are really talking about the same kind of geographic entity -- the state portion of metro area (which may or may not be in multiple states, by the way). It's not a signficant difference unless you think you know what the code is for such an entity and you try to plug the value in to your Dexter filter spec and have it fail because you are accessing the national file and using the state-file code. We have used this code-pair as an example of how this works. It also applies to other codes for geographic entities that can cross state lines such as Urbanized Areas, Core Based Statistical Areas and even ZIP (ZCTA) codes.
380 Metropolitan Statistical Area/Consolidated Metropolitan Statistical Area
381 Metropolitan Statistical Area/Consolidated Metropolitan Statistical Area-State
Summary Levels and Area Names
There are rigorously followed conventions for how the Bureau attaches names to geographic entities so that they can be readily identified. The name always describes the last entity for a hierarcical summary, sometimes followed by the notation "(part)". So here is what we get when we extract rows from the 2000 SF3 national ("B") file for the St. Louis metropolitan area;
Notice the Areaname values for the two 381 state portion summaries: you get the name of the state rather than the name of the metro area or a combination thereof. When viewed in context as it is here it makes pretty good sense. But when you work with the file and extract all the state-portion summaries and try to do analysis on them, it becomes a problem not having the name of the metro area on the records. Even when the MSA is contained entirely within a single state, the Areaname identifies the state rather than the metro area -- we get "Texas (part)" instead of "Abilene, Texas Metropolitan Statistical Area".
Here is a listing of all the 390 level summaries on the Missouri "A" file:
The names here are much more informative. They even use the parenthetical "(part)" notation to inform the user that the metro area spans states and this is only the part within Missouri. On the national file 381 summaries the word "(part)" is always appended, regardless of whether or not the MSA spans states.
Notice the GeoCode column shown in these extracts. This is not a Census Bureau field, it is one that the Missouri Census Data Center tries to add to most of our multi-summary level datasets. It contains the codes for all the fields that comprise the summary level, separated by by dashes. Thus the 381 level has
7040-29 for the Missouri portion of the St. Louis MSA, while the corresponding 390 level value is
Summary Levels by Size and Type
The summary level code for a place (the Bureau uses the the term "place" to mean an incorporated city or town, or a Census Designated Place, which is a census-defined entity that has no legal definition but is used as a unit for data reporting) is 160. At least that is the code used on the 2000 SF3 files (per the Summary Level Sequence chart shown above). When the summary is for the portion of a place within a county the summary level code is 155.
So when the Bureau releases their "sub-county" population estimates, which include estimates for multiple geography types including places and Minor Civil Divisions (county subdivisions that are legally recognized governmental units), they use summary level codes to identify the various levels summarized. Both complete place and place-within-county summaries are provided. You might expect to see these codes (155 and 160) used on these files. But here is what we get:
The expected 160 codes are 162's and the expected 155's are 157's. The explanation is that the Bureau uses a different code for these geographic units because the estimate do not include any CDP's (census designated places). A similar situation exists for the data estimated at the MCD level. MCD's are county subdivisions so you might expect 060 summary level codes. But instead you will see only 061 codes instead. These are a subset of the entities included within the 060 category.
This same thing happens when there is a geographic size limitation. In the 1990 national summary file they reported data at the place (city) level only for cities of at least 10,000 population. The summary level code used on those files was 161 rather than 160.
The moral of the story is that there may be more than one summary level code used to describe a geographic entity, depending on the context of where the summary data appears.
The concept of a geographic summary level code has been around at least as long as the Census Bureau has been producing summary files (or "Counts" as they were called in the 1970 decennial census). For the 1970 and 1980 data products a 2-digit code was used.
01 was the code for a national total,
02 for a region,
03 for a division,
04 for a state and, you might expect,
05 might be a county. But actually
11 was the code for a county summary and all the rest are of no particular logical relationship to the new 3-digit codes that went into effect with the 1990 census products. Like a lot of historical facts, this does not have a lot of practical application. Unless maybe you happen to be required to go back and use some original census summary files from that earlier era. If you use data stored in the MCDC data archive, such as the stf803 and stf803x2 data directories ("filetypes"), you will see 3-digit SumLev variables that we created by converting the original 2-digit codes. If you would like to see some datasets in our collection that do use the old 2-digit SumryLvl codes try our marf2 collection (Master Area Reference File) .
The Definitive Master List
Somewhat surprisingly, there does not appear to be a (public) master list of all the summary level codes used by the Census Bureau. We have attempted to compile a complete list of all the ones we know about. See the master list. Note that this list also includes codes that are not numeric (i.e. that contain all or some alpha characters); these are not official Census Bureau codes. They
are codes that we have used on our datasets when we created our own new geographic level (e.g. a regional planning commission or a U of Missouri Extension region), or when we were unable to find out what code the Bureau was using for something (e.g. we use 61c as a code for the county portion of a state uppler-level chamber legistlative district. We are pretty sure the Bureau has a code for this but at the time we did not know what it was. It would really be nice if the Bureau would publish a complete and annotated list of all these codes.
Summary Levels and Geoids
With the advent of the American FactFinder online data retrieval system the Census Bureau has developed a new universal geographic-entity code called a geoid. The idea is fundamental data warehouse methodology -- the entities that your warehouse is describing are required to have a unique identifier. The codes are alphanumeric and of varying length and content depending upon the
geographic level. Here is what the code for Sprinfield, MO looks like:
(the actual code is the stuff inside the double quotes). This text is taken from an American FactFinder Saved Query file,
The main point to make here is that the first 3 characters of the ID are "160". Springfield is a city ("place") and 160 is the summary level code for a complete place. The next 2 characters in the geoid are "00" and this turns out to be the Geographic Component code. This is followed by the characters "US" (presumably the Bureau either currently or in the future does data for other countries so that this becomes relevant), and then by the relevant geographic codes. In this case we see "29" indicating the state, and "70000" indicating the place code (see CCC).
So the Census Bureau is using these geoid codes to keep track of geographic entities within their database and these codes start with a 3-digit summary level code. So what? So nothing, for most folks. But if you are one of us who like to know how things work, or who need to process raw ACS data or who might even be tempted to figure out what AFF query files look like and code their own then this is very useful information.
Summary Levels and Keyvals in Dexter
Warning: this section is perhaps more about Dexter and the MCDC data archive than it is about summary levels. But the fact is that using the data archive effectively really requires that you be comfortable with Summary Level codes and using them to select data within Dexter. Which is what we illustrate in this section.
The ability to understand and use summary level codes is fundamental to being able to use the Missouri Census Data Center's archive via Uexplore and Dexter. The large majority of our datasets are summary data describing geographic entities, and the majority of these contain more than one type of geography. To be an efficient user of these data requires that you be able to distinguish between a state level summary, a county level summary or a city level summary. By this we mean that you be able to code a Dexter query that specifies which levels you are interested in having included in your extracted output. We recently ran a check on our metadata to see how many of our datasets contained a summary level code variable which was designated as a "key variable" for the dataset. We found that over half (52%) were so designated. Of course there are additional datasets that contain only a single level of geography and therefore do not require a summary level code. But many/most of these single-level datasets will still have a SumLev code present. They come in handy in cases where you join multiple datasets with different geographic levels.
To see how it works we'll walk you through a typical example.
Let's say you are accessing an ACS Profile report on our site. You have navigated the menu front ends and chosen Chicago and the state of Illinois as the geographic areas. You get a page that looks like this:
Note the strange secret code in parentheses after the name of the Area (16000US1714000) - recognize it? It is the geoid code for the entity being summarized (Chicago). A common way to access one of our datasets is to be referred to Dexter with a specific dataset already selected by an application such as this one (acsprofiles). The "reference" is typically provided as a link somewhere near the bottom of the application output page. If you scroll thru the Chicago (and Illinois) profile page you'll see near the bottom:
The highlighted link can be clicked on to take you to a Dexter data extract form that lets you access a dataset -- not just any dataset but the one that was used by the current application. Why would you want to do that? Maybe you wouldn't. Maybe this is all you need to know. But maybe you wanted to see something more. Maybe you wanted to see how some of these table items compared to other large U.S. cities (New York, LA, Philadelphia, etc.) or maybe how it compared to other cities in the state of Illinois. For the sake of our example we'll say that you are interested in looking at various data related to poverty for all cities in the United States with a population of at least 500,000. We'll walk you through that and you'll see how knowing about SumLev values helps you choose the data of interest.
Here is some of what you should see when you follow the link from the bottom of the ACS Profiles page:
The highlighted link to detailed metadata can be followed to see a page containing:
This is not a Dexter tutorial, per se, so we will not go into all the detail about how using such metadata can make you a better person. Suffice to say that if you spend a quality 3 to 5 minutes reading what is contained on this page, it could greatly enhance your ability to do something useful with this dataset. One of the ways it can help is by providing you with that row of Key variables links. We are particularly interested in the first link in that row - to sumlev. Click it and you'll see this:
(The page actualy has a couple more columns to the right which we truncated for use here). If you've been reading this document carefully you should already know that the code use to indicate a place (same as city in Census lingo) is
160. Actually it is the code but just a code that is used - we saw that in some instances they used codes like 161 and 162 to indicate subsets of places. At any rate, in the real world users to not often come to this juncture already knowing the values of SumLev codes. It is not important or necessary that you memorize all the code values. All that you need to remember is what the codes are used for and where to look for this kind of metadata to help you see what levels are present on a dataset in which you have interest.
Armed with the information about 160 being the code for place-level summaries, someone who has studied the on-line documentation for Dexter, especially the part describing the critically important Section II, should now be able to make the following entries in that section of the form:
The first line implement the population size condition, and the second line is the condition that eliminates ("filters") all those rows that summarize some entity other than a place. We only keep rows where SumLev = 160 .
We complete the query by choosing what variables / columns we want in section 3. Something such as
will do it. We find the nearest button and push it to tell dexter to do its thing. The result is a csv (comma-separated value) file which our IE browser knows can be easily and automatically opened in Excel. It should look something like this:
(after 15 seconds of editing to make the first 2 columns wider and format row to wrap text, the column labels). And after some cropping to make the image fit better here.
What would we get if we changed the code "160" in Section II of the query form to "150"?
Want to see bunch of SumLev values and some related geographic identifiers, as used by the Bureau in their latest American Community Survey datasets? See a good working example of SumLev values that we extracted from the acs2008 data directory, dataset allgeos3yr. This dataset has the geographic identifier data for all 14,536 entities that qualified for summary data based on 3 years of data ending in 2008. We kept the report to a more reasonable size by filtering out any geographic-component other than
01 Rural and any state-specific areas except for Colorado. The report is sorted by geoid, which means by SumLev and then Geocomp and then whatever specific geocodes apply.
For More Information