Summary File 1 (SF1), 2000 Census
These files provide detailed population counts from the 2000 decennial
census. The data found in these files are tabulated based on information
collected on the census
, which was sent to all households (as opposed to the long
which went to only about one in six households.) The U.S. Census
Bureau has documented these files in great detail in a 600-page pdf document
available at http://www.census.gov/prod/cen2000/doc/sf1.pdf
(The MCDC has broken this document down into a directory of smaller pdf
documents that can be accessed in the Techdoc
subdirectory of the sf12000
data directory.) Everything you will need to know and more is contained
in that document. Our purpose here is to complement that information with
things pertaining to what we have done to the data here at the Missouri
Census Data Center.
Briefly, what we have done, is to transform the Bureau's collection
of 40 ascii files containing the data for a state into a series of database
files (aka tables, datasets, etc.) We are going to assume that you
have read enough of the Bureau's Technical Documentation to understand
what kind of information you can expect to find in these database files.
What we want to do here is provide pointers to help you find what you are
We have only processed the complete set of data files for a relatively
small number of states (Missouri and some of its neighbors). However, we
have downloaded and converted the geographic headers files for all states
and have stored the results in the xxgeos subdirectory. These sets
contain just a minimal amount of actual census data (total pop and housing
unit counts) but are useful as geographic reference sets.
Alternative Data Sources
For many users wanting to access the information contained in Summary File
1, learning about and trying to access these rather complex data files
may be more than you really need or want to deal with. You should be aware
of alternative sources for accessing these data in more end-user-friendly
formats. These include a number of sites that offer profiles and other
reports on the web, most notably via the Census Bureau's American
site. The MCDC offers its own set of reports that are referenced
in the section on related reports. We also offer an alternate data
collection ("filetype") which we refer to as "sf12000x"
to access this alternate data directory.)
The data in the sf12000x collection are distilled from the data in these full sf12000
files. Instead of dealing with over 8000 variables (in sf12000 files) with
names like "pct12i34" that represent cells of multi-dimensional tables,
in sf12000x files you will instead be dealing with just over 200 variables
with names such as TotPop, Over65, Families and pct_Chinese.
Prior to releasing the SF1 data starting in June of 2001, the Census
Bureau released a sort of "preview product" in May. This product was simply
called the "Demographic Profiles", and consisted primarily of nicely formatted
summary reports posted on the web in pdf format (see http://www.census.gov/Press-Release/www/2001/demoprofile.html
for details on this product. There were also comma-delimited data files
containing the data values used in the reports. An important limitation
of these products is that they were available only for governmental units
(states, counties, cities, some county subdivisions, etc.) but NOT for
census tracts, block groups, etc. The MCDC has a complete collection of
these data stored as the separate filetype sf1prof.
Data Files: What Goes Where
We are in the process of changing our
strategy for converting and naming these files (as of 1/14/2002). The description
that follows describes the "new" strategy and there may be a short period
where some of our data does not match the following description.
We have created a consistent set of data files (or datasets --
we'll use the two terms interchangeably) for each state that we process
(with "us" serving as a pseudo-state for any national collection we might
process). We identify the state that each of our files relates to by using
the state postal abbreviation as the first 2 characters of the file name.
So any file you see that begins with "mo" you can be assured contains data
relevant to the state of Missouri, while any file that begins with "ks"
contains Kansas data. A file named starting with "us" would be a national
collection. We could have made it very simple and just put all the data
for a state into a single dataset. But we just could not do that because
it would have been way too wasteful of storage space and -- more importantly
-- time required to access the data. So instead we have broken the data
down into some smaller subsets, and have tried to segregate some of the
most frequently used data into their own datasets. If all you need to access,
for example, are the data in the "identifiers" section (the "geos" file)
then you can limit your access to the XXgeos dataset. For many applications
you should be able to get by with accessing just the moph dataset; this
has the basic tables - P and H - for all geographic summary levels except
the census block. If you need summaries for blocks, look in the XXblks
dataset which contains summaries for block and blocks alone. (Here "XX"
stands for the state postal code - substitute your state's code here.)
For the state of Missouri (for example) we have the following data files:
SF1 Data Files For Typical
Universe (State of Missouri)
||Contains geographic codes and other identifiers only. For all
geographic entities, incl. blocks.
||Contains the P and H table cell values as well as the geos data; for
all geographic entities except blocks
||Contains the P and H table cell values as well as the geos data for
just census blocks. The id variables geocode and AreaName
are excluded from this set. Very large dataset.
||Contains the PCT table cell values as well as the geos data for all
geographic areas for which PCT tables are available (so nothing at the
block or block group levels.) This does NOT include the collection
of race/hispanic qualified PCT12<r> tables, which are stored separately.
||Contains the PCT12a, PCT12b, ..., PCT12o table cell values as well
as the geos data for all geographic areas for which PCT tables are available
(so nothing at the block or block group levels.) These are very large,
tedious-to-access tables, each with 209 cells.
||An alternative way of storing the information contained in the mopct12r
tables. We create this almost microdata-like view of the data with
1 obs per data cell. The "ng" stands for "no geocodes". There
is only a single geo_id variable that links this data to the mogeos dataset.
See detailed explanation, below.
||A SAS view that just merges the data in the above dataset with the
mogeos, making it look as though we had all the geographic identifiers
As you might be able to guess from the names, these are stored as SAS
data files. SAS is the proprietary software package that we use to process
and store the data. Like most database packages, SAS has its own special
format that it likes to use to store data that only it can directly create
and read. Of course, with our uexplore/xtract software you can indirectly
access it - - once you know what you are looking for. So what are these
seven SAS data files all about?
If you have read the SF1 Technical Documentation (which you really ought
to at least skim before attempting to access the data here), you'll know
that the data is organized into tables and that there are a number of different
kinds of table. There are "P" tables that contain basic Population data
(age,race,sex,household type, etc.), and "H" tables that have data related
to housing subjects (tenure, vacancy status, etc.) There are also "PCT"
tables that are like P tables except that they are not available
for geographic summaries below the census tract level, i.e. for census
blocks and block groups. (This is a new idea from the Census Bureau for
2000; they never varied the tables within a summary file by geographic
summary level in any prior censuses.) In Missouri, there are 279,300 records
(observations) on the state's SF1 file, and of these 256,811, about 92%,
are either census block or block group summaries. The PCT tables are typically
rather detailed. One of the tables, PCT12, provides 209 cells of information
giving a summary of persons by sex and single years of age (103 age categories
altogether). In addition to all this detail, SF1 also contains a series
of tables named PCT12A, PCT12B, ... PCT12O (that's an "O" as in "Overkill",
not a zero --there are 15 of these) that had the same tabulations but each
was for a different race or hispanic subgroup. In total, if you add up
the cells in all the tables for one geographic area you have over 8,000
cells of data! Of these, about 5,000 data cells are in the less- frequently
used PCT and PCT12R tables. (Those of you with any experience at all in
data base may see now why storing all the data in a single, simple "rectangular"
dataset would be so wasteful, since over half (5/8) of the data cells would
be missing for over 90% of the rows!)
In addition to the 8,000 or so table matrix data items on each SF1 summary
record, SF1 also has a wide array (about 70) of geographic and other various
"header" items: geographic codes, internal point coordinates, various area
Restructuring the PCT12 Detailed Age by Sex by Race Tables
data sets are a restructuring of the data contained
in all the PCT12 tables (that includes table PCT12 containing 103 age by
2 sex + 3 subtotals, or 209 total cells for the total population, and the
collection of 14 PCT12<r> tables, where <r> is a letter from a to
o and indicates that these counts are for some race/hispanic subgroup such
as "Persons reporting White Only for Race" (PCT12a) or "Persons Reporting
Multiple Races and Not Hispanic" (PCT12o).) There are 15 different PCT12
tables, each with 209 cells, so over 3000 variables in all. In the alternative
data set, the observations (rows, records) represent indvidual cells in
a 5- dimensional table. The 5 dimensions are geographic area (the census
tract within place and county subdivision, the smallest geographic area
for which PCT tables are available; the exact number of such entities is
large and varying by state), age (103), race (7), sex (2) and hispanic
origin (2). A single numeric variable, Persons, contains the population
count for that cell. Only non- zero cells are stored. A typical observation
might report the number of persons in Boone County, tract 20, city of Centralia,
township of Centralia (geographic dimension), who are (or were on April
1, 2000) exactly 5 years old, female, reporting white alone for race and
indicating that they are not hispanic. This is a small bit of information,
to be sure, and not one that many people would be interested in per se.
What many people should find of interest, however, is the relative ease
with which you can use this data set as input to a tabulation procedure
that can generate a report like the one we created at http://mcdc.missouri.edu/reports/misc/mo/age_by_race_hispanic_for_the_state.html
using less than 20 lines of SAS code (which can be viewed by following
the link in the footnote of the report page.)
An important fact to keep in mind when processing this data set is that
the observations are of two basic types, and you should rarely use both
types in a single query:
Observations where the dimension variables race and hispanic
Observations where neither race nor hispanic are blank.
The first group should be used for looking at total population (by age,
sex and geography), while the latter group should be used when you need
detail by race and hispanic origin. If you select both kinds and do an
aggregation you will probably get answers that are twice the actual values.
As a user (or at least potential user) of uexplore/xtract what
does all this mean to you? Basically, that when you are looking for data
you need to know what table or tables you are interested in. The data have
been broken into 3 subsets based on grouping of the tables (ph, pct and
pct12R). The geographic header information is stored in a separate dataset
as well as in the various table-cell datasets (in an earlier version, we
tried to omit the geography data from the cell-base datasets and use something
called SAS views to link them together; but this turned out to have some
unaniticipated problems and we have abandoned that strategy).
The xxgeos Geographic Headers Collection
The special subdirectory xxgeos
is used to hold geographic headers
data only, i.e. there are no demographic tables here. All we did
was to download the Census Bureau's complete collection of geographic header
files (these .zip files are stored in the subdirectory, at least for now)
and then convert these to SAS data sets. In doing the conversion, we created
2 data sets per state, one for census block headers only and the other
with header info for all other geographic levels. We thus have 51 x 2 or
102 state-level SAS data sets in this directory.
We have also created a series of national level data sets, where we
have selected specific geographic entities from all states and combined
them into us level sets. Thus far we have created these national sets for
counties (uscntys), 5-digit ZIP codes (uszctas) and places (usplaces).
For the zctas and places we have included headers for both the complete
area levels as well as the area-within-county summaries. Thus the uszctas
set has levels 871 and 881, while usplaces has levels 155 and 160.
Warning About Size
The sf12000 data directory contains some of the largest files in our data
archive. Specifically and especially, you need to watch out for the block
level summary data sets. The mophblks.sas7bdat file, for example,
containing the P and H tables for all 241,532 census blocks in Missouri
(even the ones that have no population and no housing units). This file
is about 374 megabytes, and that is using the SAS compress option, which
is the only reason it is smaller than a gigabyte. A SAS data step (which
includes a Uexplore/xtract access) may take over a minute of real time.
You can make it go faster if you use a where filter to specify one or more
counties (since this dataset is indexed by county). All of which means
you have to be careful and sometimes patient when accessing these files.
Because of the tremendous size, you never want to run an extract on
one of these without a filter . The result will be way too big to handle,
and you will hit one of the filesize limits built into the xtract program.
The Census Bureau publishes SF1 data for the entire United States and Puerto
Rico. You can access that data via the American Fact Finder
site at http://factfinder.census.gov/java_prod/dads.ui.homePage.HomePage
The Missouri Census Data Center will be making its version of the data
available here for the states of Missouri, Illinois, Kansas and Delaware,
as well as a national collection of higher level geographic summaries.
Data for other states is possible, but not probable. Organizations who
have access to the SAS(r) software package who would be interested in creating
their own datasets comparable to what we have here can access the SAS code
we used in the Tools subdirectory of this directory. Specifically, they
should study the code in the http://mcdc.missouri.edu/data/sf12000/Tools/cnvtsf1.sas
SAS source code file.
Geographic Summary Levels
Anyone who does any work at all with census summary files knows that the
most important section of the 600-page Technical Documentation is something
called the Summary
Level Sequence Chart
. This is Chapter 4 in the manual. It is 6 pages
long, with 2 pages for each of 3 versions of SF1: (A)State Summary, (B)Advanced
National and (C)Final National versions. (Most of the data you will see
on this site (for now, at least) will be from the (A) versions of the file.)
The Summary Level Sequence Chart (SLSC) is just a way of displaying
all the different geographic levels for which data is aggregated on a summary
file. As you can see if you look at the chart for SF1 there are a lot
of levels available. However, most users tend to only be interested in
just a few. Here is a list of the levels that we have found are most frequently
used by most users along with their summary level codes:
||State (see also GEOCOMP section)
||County (or county equivalent)
||County subdivision (MCD, township, CCD)
||Place (within county)
||Metropolitan Statisticl Area (MSA) or CMSA within state
||Primary MSA within state
||Congressional District (106th)
||3-digit ZIP Code Tabulation Area
||5-digit ZIP Code Tabulation Area
||5-digit ZIP Code Tabulation Area within County
|Hierarchal Census Geography
||Place within MCD
||Tract within Place and MCD
||Block Group within Place and MCD
||Tract within place
Note that, by definition, County subdivisions (MCD's) nest within county,
as do census tracts. Block groups nest within census tracts and are composed
of census blocks. Census blocks are atomic units, meaning they nest within
all other geographies.
Custom Aggregations to Other Geographic Units
You might think that all the summary levels described in the previous section
would be enough. But you'd be wrong. Users are interested in many other
layers of geography for which the Bureau has not provided any summaries.
The Missouri Census Data Center specializes in doing data allocations/aggregation
to create such custom summarizations. When we do this we create a pair
of data sets for each geographic universe-unit pair, one with the ph tables
and the other with the pct tables. We do not
create a p12r data
set, and there is no separating of the geographic variables into a separate
set and using views to combine them with the data.
To date, we have created the following custom aggregations:
Missouri School Districts: moschls[ph/pct] are summaries for complete
districts, while moschlcos[ph/pct] are summaries for school districts split
by county. SUMLEV values on these data sets are "sdu" for unified districts
and "sde" for elementary. Both kinds are included.
Missouri State Legislative Districts: mosenate[ph/pct] are summaries
for MO state senate districts, while mohouse[ph/pct] are summaries for
the state House districts. SUMLEV values on these data sets are the ones
established by the Bureau. These are for the districts as they were
defined when the SF1 files were generated, i.e. as of 2000.
We have also been able to generate similar summaries for the legislative
districts as they have been redefined by the 2001 Missouri redistricting
effort, creating new political geography that is effective starting with
the elections of 2002. These datasets are named with "02" suffixes
in their dataset names to distinguish them from the older geographic versions.
Codebook - Description of the Variables/Columns in Each Data File
Our favorite tools for seeing what variables contain what information are
the SAS source code modules that provide labels for the variables. View
these in the Tools subdirectory, i.e.: