PUMS: Public Use Microdata Sample — 2000 Census


The PUMS data are not typical of the files in the MCDC data archive. The archive consists mostly of summary files, in which each observation/row contains data summarizing (via predefined tables mostly) a geographic entity such as a state, a county or a census tract. Such summaries have been created (typically by the Census Bureau or other source statistical agency) by aggregating the microdata collected in a decennial census or other survey. The PUMS files, on the other hand, let us access some of the raw survey data describing individual housing units and persons. ("Raw" in the sense that they have not been tabulated; but these data have been carefully edited to insure compliance with Title 13 nondisclosure requirements.) Respondent identifiers, including any small area geographic codes, have been removed from the data. Some responses have been topcoded or substituted to protect privacy. (A detailed discussion of the Bureau's non-disclosure procedures is beyond the scope of this document.) The only geographic identifiers kept are state and PUMAs (Public Use Microdata Areas), which are large geographic entities having a minimum population of 100,000. Where an observation on a summary file contains summary measures such as counts, averages and medians, records on microdata files contain actual questionnaire responses and/or recodes derived from those responses. So on PUMS files there are variables such as Age, Race1 and Sex that correspond to a person's response to the questions regarding their age, race and gender. As is typical of survey data, most of the responses on PUMS files are in the form of codes. Values for the question on gender are stored as "1" and "2" rather than "male" and "female". Making sense of the data requires use of a codebook that lets you translate these codes into their meanings; it also typically requires that you have access to a statistical software package that helps you turn the microdata into meaningful summaries. The great thing about PUMS data -- the thing that makes it by far the most popular decennial census data product among academic researchers -- is that it allows you to build your own summary tables and measures. This does not appy for small geographic areas, of course, because of the lack of detailed geography. But if you need to build a table at the state, region or large county level for the poverty status of very specific demographic subgroups (such as foreign born hispanic males, or females over 40 who have less than a high school education) then PUMS is what you need. Even if you are not the type of person who would ever build such a custom tabulation yourself, it is important to understand what is possible so that you can have someone else build such summaries for you. The most important thing to remember (which is why we repeat ourselves) is that PUMS files let you build tables that are not on any of the Summary Files.

There are two subcollections of PUMS files: a one percent sample and a five percent sample. The 1% sample files have approximately one record for every 100 persons/households in an area, while the 5% files have approximately one out of 20. The universe from which the PUMS records are drawn is restricted to households/persons who received the long form; i.e., the data are based strictly on the long form questionnaire. The smallest unit of geography on the 1% PUMS files is called a "super-PUMA"; these geographic entities have a minimum population of 400,000. The 5% PUMS file records contain a different set of PUMA geographic areas, each of which has a minimum population of 100,000. See below for more about PUMAs.

Unless you are trying to save on computer resources, there is very little reason to consider using the 1% PUMS files. Because they have only a fifth of the sample size of the 5% files, sampling errors will be substantially larger using a 1% sample file.

Each PUMS record contains a weight field that can be used to weight the case in order to estimate the total population (as opposed to the sample). For example, if you were trying to estimate the number of persons who indicated they were Black or African American alone in Missouri in 2000 you would access the 5% persons data for Missouri (dataset moprecs5.sas7bdat in the mcdc archive.) You would select using the filter:
race1 = '2'
(having studied the data dictionary in order to know the variable name and code values). If you just counted the number of observations that were selected using this filter all you would have is the number of African Americans in the sample. To get the estimated number of African Americans in the complete universe (persons in the state of Missouri on 4-1-2000) you would need to sum the weight values for each observation (stored as the variable PWEIGHT). If this makes absolutely no sense to you, then you are not in the target audience for direct use of the PUMS files. But you still may be be able to use it with technical assistance.

Household and person data come on separate records and thus have separate weights (HWEIGHT is the variable containing the household weight). The person weights are not the same for all persons in a household.

MCDC Holdings

The Missouri Census Data Center has the raw (ascii format) PUMS files (both one and five percent) in .zip file format for all of the 50 states and the District of Columbia. We have converted (to SAS datasets) the 5 percent sample data for all states and 1 percent data for the states of Missouri, Illinois and Kansas. The collection occupies approximately 6.4G of storage space.

Technical Documentation

The Census Bureau has provided extensive technical documentation for the PUMS files in the form of a 714-page pdf document. The complete document can be accessed at the Bureau's web site, or you can access the MCDC's partitioned version in which each chapter/Appendix has been placed into a separate pdf file, and an HTML index page is used for quick and easy access.

A codebook is an indispensable tool when working with survey data. You really cannot do much with the data until you understand how the variables relate to the questions on the questionnaire and how the responses have been encoded. (A facsimile of the long form questionnaire is in Appendix D of the technical documentation.) Chapter 7 of the technical documentation contains the basic codebook information for the 5 percent files (with only slight differences from the 1 percent dictionary in Chapter 6). If you plan to use PUMS much you probably need to make a hard copy of this chapter or at least to have it bookmarked for ready and frequent access. Some of the variables use codes that can have hundreds of different values; the values for these are provided in Appendix G rather than in the Data Dictionary chapter. Note that some of the code lists (only the ones that appear in Appendix G) are different for the two samples. For example, the ancestry codes used on the 5 percent files are not the same as those used on the 1 percent files. In general, there will be less detail (more collapsing of categories) on the 5 percent files.

There is also a plain text version of the data dictionary without the Indexes (which are of very little use for on-line access) that can be accessed at http://mcdc.missouri.edu/pub/data/pums2000/datadict.txt. This is our personal favorite data dictionary reference. But you still need Appendix G.

Note that most of the variables in the codebook come in related pairs. You get the variable ELEC (for example) and its companion allocation flag, ELECA. The latter is a variable that takes on the value 1 or 0 to indicate whether the value of ELEC was allocated for this household. If you are not familiar with what "allocated" means in this context, you can look it up in Chapter 4 of the technical documentation (p. 4-17). Or, like many PUMS users, you can choose to more or less just ignore the allocation flags.

Accessing the Data Via Uexplore With Dexter

You can access the pums2000 datasets stored in the MCDC data archive using our uexplore data exploration web application which takes you to the Dexter data extraction tool. The URL for accessing the pums2000 data directory is http://mcdc.missouri.edu/cgi-bin/uexplore?/data/pums2000. (Most users will get to this page by accessing the uexplore main page--see previous sentence--and following the links there within the 2000 Census section.) There is a good chance that you just came from that page, since this is a common way for users to access this Readme file. The data directory page is also a good place from which to access the various metadata files mentioned in the Technical Documentation section, above.

The first file/link on the pums2000 uexplore directory page (mcdc.missouri.edu/cgi-bin/uexplore?/data/pums2000) is to Datasets.html. This is the best link to follow for easy access to the datasets. (This is not specific to the pums2000 data: access via a Datasets.html index page is always the easiest way to access the data in a directory, if such a page exists.)

Structure of the SAS Datasets

The conversion process we use to read the ASCII files distributed by the Census Bureau and create SAS datasets (think of database tables if you are not familiar with the SAS dataset concept) creates two datasets, one for the Household records data, and one for the Person records data. A simple view is then created that merges the two sets together, creating a rectangularized data table where all the household data are repeated for each of the person records. For vacant units you have the housing unit data with all the person data fields/variables missing. The names of these datasets are of the form (XX)hrecs(S), (XX)precs(S) and (XX)(S), where (XX) is the state postal abbrevation and (S) is the sample code. For example the 3 datasets for the Missouri 5 percent sample data are mohrecs5, moprecs5 and mo5. The mo5 set is the view that combines the data from the persons dataset (moprecs5) and the households data (mohrecs5).

Entire U.S. SAS Datasets (New - March, 2004)

In addition to having a collection of 51 individual state files for the PUMS 5% sample data, we have also created special SAS views which combine the data from all these datasets into a trio of national files: ushrecs5, usprecs5 and us5 . These "virtual" datasets are so large (when invoked) that they cannot be accessed using our uexplore/dexter access tools without special codes being specified to avoid having the application time out. There are over 14 million person records on the usprecs5 file (SAS view) and it can easily take a half hour or more to run an extraction against it. If anyone is in need of a custom national extraction, they should contact the MCDC to have us run a special request. Attempting to do such extracts yourself with Dexter will simply not work. You do not have to understand the difference between a regular dataset and a view in order to use them. But in case you are curious: a view is a virtual dataset that is stored as a small query or program; accessing the view causes the query to be executed, delivering the new data on the fly.

Access Data Via SAS/Share Server

This section is primarily intended for users who want to access the data using SAS and who have access to the SAS/Share server running at mcdc.missouri.edu. This is a relatively small audience. Another possible target group is persons who do not have access to the server running here in Missouri but might be intrested in creating such a SAS/Share service at their own site.
If you are not familiar with the general process for setting up and using our SAS/Share server you need to read the documentation at http://mcdc.missouri.edu/jgb/sas9/usgnotes.html#ShareServers. Once you have configured access to the service on your local platform (which can be either a Windows or a Unix system) then you just need to code the following libname statement: libname pums2000 server=mcdc.mcdcshr; You must use the libref pums2000. If this is the first reference to the mcdcshr server in your job stream then you will also need to code : libname pums2000 server=mcdc.mcdcshr sapw=[pswd] . If you do not know the password (the value to substitute for [pswd] in the above) then you can contact the author to get it.

Access to PUMS Data Via Other Means

The Census Bureau does not provide access to the PUMS data via the normal means - American Fact Finder and Advanced Query System. They do provide access to the raw ascii files via the web, as noted above. They also make the data available on DVD with custom software to assist in data extraction. (See details at http://www.census.gov/main/www/pums.html.)
There are also commercial vendors that provide access to the data using their custom software.

PUMA Geographic Areas

The smallest unit of geography that can be identified on a PUMS data record is the Public Use Microdata Area, or PUMA. There are actually two kinds of pumas used on these files: the larger "Super PUMAs" used as the only PUMA code on the 1 percent sample files, and the smaller PUMAs (sometimes referred to as "5 Percent PUMAs" or (by us, at least) as "PUMA5"s.) The 5 percent sample file records contain both the Super PUMA and the PUMA5 codes. They are stored as variables PUMA1 and PUMA5, respectively, on the datasets.

You can access maps depicting the boundaries of PUMA areas at the Census Bureau web site. There is actually one set per state, such as the set for Missouri. These consist of a state overview map image (sometimes more than one) showing the Super PUMAs, followed by at least one map page for each Super PUMA. All these maps are in pdf format. This works because PUMA5's nest within Super PUMAs.

Other maps of Missouri PUMAs (only) were created by the Office of Administration (Ryan Burson) and can be accessed at the /maps/mopumas directory on the mcdc server. The statewide PUMA (i.e. "PUMA5") map is especially useful, as are the more detailed maps of the St. Louis and Kansas City areas that also show tract boundaries and labels. (The labels shown on the 1pctPUMA.pdf map were preliminary and do not reflect the final values used for the Super PUMAs.)

Useful Facts Regarding PUMAs

Tips for SAS Programmers

If you reading this section we assume you are essentially interested in accessing the PUMS data as SAS datasets that are either exactly as stored on our web site or perhaps in a slightly modified form using slightly modified versions of our conversion code. Otherwise, you're on your own. There is absolutely no requirement that you understand the details of our conversion code in order to use the datasets. That is one of the goals we have in creating the datasets - to provide relatively simple access without users needing to wallow in too much detail. But many programmers like detail and want to see how things work. Especially if they are interested in making them work better.

The SAS code used to create and label the SAS datasets stored in our /pub/data/pums2000 data library are, per convention, stored in the Tools subdirectory of that data directory. It can thus be accessed on the web via http://mcdc.missouri.edu/cgi-bin/uexplore?/data/pums2000/Tools. The key files (for the 5 percent data, most have comparable 1 percent counterparts but we are not going to discuss those here) are:

Value Labels

valfmats.sas is one of the more interesting and challenging to create modules we have. In fact, we may still not be finished with it yet - we reserve the right to make corrections or enhancements in the future as we see fit. The challenge was to create modules that differentiate between code values that differ on the 1 vs. 5 percent files. We originally ran a program that read the data dictionay ascii file and generated some of the value labels seen in this module. But we have done so much post-editing of the results that we really do not ever want to go back and rerun that original setup. Too many special cases.
Notice that the module contains the statement proc format library=pums2000.formats; This causes the procedure to create permanent version of these format codes in a formats catalog stored in the same data library as the pums2000 data files. This allows application programs to have access to the formats without having to run Proc Format and copy any of this source code (although that is still possible to do and for many users / applications may still be the best way to go.) You can also download and run the entire module, storing the formats catalog in your local data library. The tricky part is then knowing how to get SAS to find these format codes when running an application. The key is knowing how to use the fmtsearch SAS system option. We include such a statement in this (valfmats.sas) module: options fmtsearch=(pums2000.formats library work); . What this does is it allows you to run code like this: proc freq data=pums2000.moprecs5;
table relate; format relate $relate.; weight pweight; run;
When SAS processes the format statement it will search the catalog pums2000.formats to try and find $relate. If all goes as expected it will find it and the result is that the frequency report will display value labels instead of meaningless numeric codes. Note that this works even when the formats catalog is stored on a remote server that runs on a different platform than what your request is running on. We actually ran this example from Windows with the pums2000.formats catalog stored on our AIX server, with no problems. You must be running Version 8 or later of SAS for this to work.

Basic Efficiency Tips

PUMS datasets are very large. The original all-data-for-a-state ones are, anyway. When you actually run an analysis against these datasets, however, you frequently only need to access a relatively small amount of that data. To avoid long waits while the computer processes your requests, you should heed the lesson to be learned in the following program. We ran this program from a SAS Windows session using our SAS/Share server to access the 5 Percent data file for Missouri. Only the Person records, actually (the results would have been even more dramatic if we had accessed the full mo5 view, which would have required reading the complete mohrecs5 dataset as well.) Here is the log showing the results of running a simple frequency report on 1 variable for a set of PUMA codes that corresponding to St. Louis county, MO:
14 options fmtsearch=(pums2000.formats library);
15 data; set pums2000.moprecs5;
16 if puma5=:'017';
17 run;

NOTE: There were 279675 observations read from the data set PUMS2000.MOPRECS5.
NOTE: The data set WORK.DATA1 has 39126 observations and 165 variables.
NOTE: DATA statement used:
 real time  1:55.55
 cpu time 5.00 seconds

18 proc freq;
19 table relate; format relate $relate.; weight pweight; run;

NOTE: There were 39126 observations read from the data set WORK.DATA1.
 real time  0.35 seconds
 cpu time 0.12 seconds

22 proc freq data=pums2000.moprecs5(keep=relate puma5 pweight
23  where=(puma5=:'017'));
24 table relate; format relate $relate.; weight pweight; run;

NOTE: There were 39126 observations read from the data set PUMS2000.MOPRECS5.
 WHERE puma5=:'017';
 real time  2.13 seconds
 cpu time 0.27 seconds

We have 2 methods for doing the proc freq. The first one involves coding a SAS data step to create the subset of the data we want, followed by the Proc Freq step to generate the desired frequency report. Note that this method requires almost 2 minutes of real time to run the data step and just .35 seconds to then run Proc Freq. In the second approach we do the entire request with just a Proc Freq step, using a keep= dataset clause to specify the variables that we needed and a where clause to indicate the observations we wanted to process. Using this approach we get results in just over 2 seconds of real time. It took more than 50 times longer to do it the first way.

The reason the data step method without where or keep clauses takes so long is because it forces SAS to access the entire dataset, moprecs5. Every variable and every observation has to be accessed and passed from the server to the desktop session. The subsetting if statement throws aways 90% of the data that is downloaded, but by then it is too late. When we went back and reran the same data step, but just changed the if to where, the result was a step that took just under 17 seconds to execute. We save almost 100 out of 115 seconds by using where, because a where clause is executed on the server so none of that unwanted data has to be transferred. Of course, we were still transferring all 165 variables for those cases when all we really needed was 3 of them. Sure enought when I went back and added the keep= clause in the data step, it ran that step in just 2 seconds. So the problem is not using 2 steps, but rather in not specifying early on just what observations and variables you are really interested in. Such differences are by no means trivial. In the real world, PUMS queries tend, of course, to be far more complex than this simple example and it sometimes takes dozens, if not hundreds of attempts to get exactly the right results. So knowing how to efficiently access the data so that you are not sitting around waiting for service can be crucial. A good general strategy to follow is to try and separate the data access from the data analysis, to the extent possible. This usually involves running SAS data or Proc SQL steps that access the larger pums2000 datasets and just extract the needed observations and variables (using keep= options and where clauses). Then you run your analysis steps against those locally-stored and relatively small subsets.