What and Where | Data Sources | Race Categories | Race Intervals | Population Groups | Data Columns | Sort Order | 1990 Data | Formats and Naming Conventions | Fields on CSV Files | Block level data, etc | Comparable 1990 data | Access data for entire US | Source Code | Credits
The PL94 Basic Trend Reports are compact summary reports based on 1990 and 2000 Public Law 94-171 data. The collection created by the MCDC is at ./. We expect that most people will come to this page via the link on that one, but others may be sent directly here.
The data used to create these reports comes almost entirely from the Public Law 94-171 (aka "Redistricting") files as released by the U.S. Census Bureau. There are two editions of these files, one for 1990 and one for 2000. All data shown in these reports is identified by year and can be linked to one of the two decennial files. Some reports may only show data for a single year - usually when only one year is available. But the intent of the reports (as suggested by their name) is to provide data from the two censuses.
The one exception to the statement that all data comes from pl94 files is with the national level data for 1990. We did not have a complete collection of pl94 files for the U.S. So we used data from 1990 Summary Tape File 1C to generate the 1990 numbers for the U.S. level reports.
Getting the data for the upper limit counts (see discussion in following paragraph re race data intervals) from the 2000 Public Law files as distributed by the Bureau was not a simple matter of retrieving the count from a cell in one of the tables. These values had to be calculated by summing the set of 63-race-category cells that included the given race. Some people may try to say that the count of persons who were "white alone or in combination with other races, and non-hispanic and over 18" (for example) is not on PL94 for 2000. The fact is that while it was not explicitly reported, it was derivable.
The thorniest issue that data providers and users have to deal with related to the PL94 data (and this is going to be with us for all 2000 data products) is how to deal with the change in race categories. In earlier decades the census questionnaire asked persons to check a box indicating their race. They were instructed to check only one such box. Everybody was either white or black or Japanese or American Indian, etc. But an OMB directive from 1997 has mandated that all federal surveys should now allow respondents to be able to specify multiple races. You can check as many of the race boxes as apply. For the sake of tabulating the data for PL94 there were 6 basic race categories (up from 5 basic categories in 1990; the "Asian and Pacific Islander" category used in 1990 has been split into two categories for 2000: "Hawaiian and Other Pacific Islander" and "Asian".) You can view the results of this new way of assigning race categories when you look at the size and complexity of the 4 "tables" on PL94.
While the additional race detail gathered by allowing the multiple responses is useful for someone studying racial mixing patterns, there is a real problem for most applications of the data. If it were a matter of a single table with counts of the 63 combinations it would be fine. But these 63 categories are not just a single table in the census tabulations. Many census tables (including all of them on PL94) use race as a table dimension. So we have tables of "Race by Age" and "Race by Sex" and "Median Household Income by Race of Householder", etc. Do we really want all of these tables to give us the full 63-category detail? Is more data better? Not in most cases. The data need to be further collapsed to make them useful for most applications.
One way that has been suggested (initially by the Census Bureau, I believe) to deal with the new complexity was to summarize race data by looking at two basic counts associated with a given basic racial group. If you want to look at data for the Asian racial category (for example) you would get data for all persons who indicated that they were Asian and no other race, i.e. "Asian alone". The companion category would be all persons who checked the "Asian" box on their form, i.e. "Asian alone or in combination with other races". These two counts represent an interval in which the value of "Asian persons" falls. The reason we have an interval instead of a simple count as before, is because now being "Asian" is not a simple yes/no question. You can be completely Asian ("Asian alone" - the lower limit of the interval) or you can be at least partly Asian ("Asian alone or in combination..." - the upper limit of the interval.) Of course, you can also be in the partly Asian category; these are the people who make up the interval.
In producing products based on the PL94 data, the MCDC has decided to focus primarily on the use of these racial intervals for report race-stratified data. It turns out that this was not exactly a trivial thing to do. If you study the 288 cells of the 4 tables on the PL94 files as distributed by the Census Bureau you will not find any explicit counts of such things. To determine the number of persons who are "Asian alone or in combination.." you have to identify and sum the 32 cells in Table PL1 that indicate persons who are all or partly Asian. Trying to then determine how many of this total is of voting age and/or of Hispanic origin requires even more careful study and aggregation of data in the other PL tables. The process is tedious (even by our standards) and error prone. But we felt it was worth the effort, so we created our own transformed extract files where we created exlicit counts that were based on the "racial interval" approach.
Each line of data in the reports is identified by a geographic area, a year and a "population group". The first two categories are assumed to be pretty obvious. The population groups are mostly straightforward except for the race-based groups. This is where we run into the problems created by allowing respondents to the 2000 census to check multiple races. This results in their being no simple category as "race=white" or "race=black". We now have people who indicated they are both white and black; or perhaps they are white, black and American Indian. There are 63 such racial categories, comprised of various combinations of up to 6 basic race categories.
Our approach to reporting data here is to show intervals of data for the race-based categories for the 2000 data. The first number in these invervals (which we refer to as the "lower limit") is the count of persons who reported that race and that race only. The second number ("upper limit") is the number of persons who selected that race - "alone or in combination with other races". Which means they checked the box for that race on their form - period. Whether or not they checked any others is irrelevant when calculating the upper limit. The bottom line is that if respondents had been required to check only one racial category in 2000 instead of being able to choose several, the count of persons we would have obtained for each race category would have been somewhere in this interval. The good news for those wanting to look at racial trends is that these intervals are generally very small. In fact, they are so small, that the software that generates these reports checks to see if the lower and upper limits are identical and will print just a single value if it detects such a case. This will usually only happen for relatively small geographic areas.
There are really just two fundamental data items ("facts", as opposed to "dimensions") for each line of these reports. These are the counts of all persons (in the Population Subgroup and geographic area for the year), and the count of persons over 18 (voting age). Percentages of the total and voting age population are also given. Where the counts are intervals, so are the percentages -- we just calculate the percentage figures based on the interval limits.
The files are sorted by geography, then time (year) and then population subgroup. The code could be easily tweaked to give other orders (some might prefer, for example, to have it sorted by time before population group to facilitate seeing trends.) Some reports have geographic subcategories (usually counties) as the major sort field. These will print as subheadings instead of as part of the "Geographic Area" field.
The reports usually display comparable data from the 1990 census. These data have been extracted from 1990 PL94-171 files (although the same values could be extracted from 1990 Summary Tape File 1 files and we may do this for future products.) For geographic units such as counties and places (cities) the data are reported using the geographic definition of the entity at the time of the 1990 census. Thus data for O'Fallon, Mo for 1990 is as reported at the time of the 1990 census using the then-current boundaries of that city. For census tract levels (and below) we have used geographic correspondence files to allocate the 1990 data to 2000 geographic entities so that the data will be for comparable geographic entities. Note that 1990 data for small area geographies will not always be included. This is because we do not have the geographic equivalency files for all states and have therefore not been able to generate the corresponding 1990 counts.
We have placed the 1990 and 2000 data in separate csv files for convenience. The geocode field can be used as the key to link corresponding rows from the two files. When browsing the index if you do not see a 1990.csv file for a geographic universe/summary level (e.g. "Kansas_Tracts") then you can expect that there will be no 1990 data displayed in the corresponding report files.
The reports are usually generated in 3 display formats: plain text ("txt" file extension), html and pdf (Adobe print document format). In addition, for each report the data used to create the report is included in the directory as comma-delimited files, one for each decade. The file extension for these data files is "csv" (comma separated values, a Windows standard extension.) Many or most browsers today are configured to handle such files with a special desktop application such as Excel.
Note that the html files can be quite large and therefore may take some time to load (the file sizes are provided on the index page and should be considered when selecting files to browse.) The report titles will appear immediately, but there can be a long pause before the table itself displays. For fast access, use either the txt or pdf versions. Some reports are not available in html format, since their size would prevent many browsers from being able to load them. Also, the HTML formats display the column headers only at the beginning of the report, while the txt and pdf versions repeat them at the top of each page.
All files for the same report will have the same basic filename and just different extensions. The basic filenames are intended to be as desriptive as possible. They will generally be made up of the geographic universe name, and underscore, and the geographic summary level. For example, "Kansas_Counties" will be a report displaying data for the counties in the state of Kansas; "Missouri_Tracts" will have tract level data for the state of Missouri, etc.
The files with ".csv" extensions are plain ascii "comma delimited" file ("csv" stands for "comma separated value" and is a standard Microsoft file extension.) The first line of these files contain the names of the fields (variables). If you open this file from most spreadsheet programs (including Excel) the first row is imported as-is and results in a column headers row to label the data. For other applications, such as SAS, you may need to specify to the program to ignore the first line of data. (Use the "firstobs=2" option on the infile statement to accomplish this in SAS.) The variables are arrayed within the csv records in more or less the same left-to-right order they appear in the report. Field names are intended to be fairly obvious but there are some exceptions.
The key naming convention has to do with variables related to race counts. For 2000 we use "race intervals" (see ./Notes.html#raceints">above) rather than a single count for each category. The variables black1 and black2 represent the lower and upper limits of the interval for black persons. Thus, black1 is the count of persons indicating there are black alone, and black2 is the count of persons reporting black alone or in combination with other races. Similarly, the variables white1 and white2 represent the lower and upper limits of the white race interval, etc. For the variables representing persons over 18 by race the names are formed by using "over18", a 2-character race abbreviation and the numeric suffix indicating lower or upper limit. Thus we have variables ovr18as1 and ovr18as2 representing the lower and upper limits of the interval for Asian persons over 18.
For the 1990 file we have used a naming convention consistent with the 2000 files, wherein we use the numeric suffix "1" for the single race count fields. So the count of white persons in 1990 is White1. There is no White2 variable for 1990, because there are no intervals prior to 2000 since persons were only allowed to choose a single race then.
A complete list of the variables on the csv files and their labels follows:
|GeoCode||Varies with geographic level but contains the codes necessary to uniquely identify the geographic entity. This field can be thought of as the key that identifies the row.|
|desedist||This special code appears only on the two Missouri Schools files. It contains the 6-digit school district code used by the Missouri Dept of Elementary and Secondary Education. The value of geocode on these schools file is the official federal school code ("LEA code") which is what is stored in the school code field in TIGER and the PL94 files.|
|SumLev||Geographic Summary Level code Usually a constant within a file|
|AreaName||Name of the geographic area|
|County||May not always be present but when it is, contains the name of the county.|
|geo_id||Ignore it. A numeric key linking to a master geographic reference file (2000 geography only) maintained by the MCDC.|
|White2||White alone or in combination|
|Black2||Black alone or in combination|
|Indian1||American Indian or Native Alaskan alone|
|Indian2||American Indian or A.N. alone or incombination|
|Asian2||Asian alone or in combination|
|HawnPI1||Hawaiian or Pacific Islander alone|
|HawnPI2||Hawaiian or PI alone or in combination|
|Other1||Some Other race alone|
|Other2||Some other race alone or in combination|
|Whitenh1||White alone, Non Hispanic|
|Whitenh2||White alone or in combination, Non Hispanic|
|HispPop||Total Hispanic/Latino Population|
|Over18||Total population over age 18|
|WhOvr181||White alone, over 18|
|WhOvr182||White alone or in combination, over 18|
|BlOvr181||Black alone, over 18|
|BlOvr182||Black alone or in combination, over 18|
|InOvr181||AIAN alone, over 18|
|InOvr182||AIAN alone or in combination, over 18|
|AsOvr181||Asian alone, over 18|
|AsOvr182||Asian alone or in combination, over 18|
|HaOvr181||Hawaiian or PI alone, over 18|
|HaOvr182||Hawaiian or PI alone or in combination, over 18|
|OtOvr181||Other race alone, over 18|
|OtOvr182||Other race alone or in combination, over 18|
|WNOvr181||White Non-hisp alone, over 18|
|WNOvr182||White Non-hisp alone or in combination, over 18|
|HisOvr18||Hispanic Pop Over 18|
|MultRace||Multi Racial (Total persons checking more than 1 race)|
|MROvr18||Multi Racial Over 18|
|LogRecNo||Logical record number - links to original PL94 file record #|
|AreaLand||Land Area in Square Meters|
|AreaWatr||Water Area Sq Meters|
|Pop100||100 pct Population Count|
|IntPtLat||Internal point latitude coordinate in decimal degrees.|
|IntPtLon||Internal point longitude coordinate in decimal degress.|
|LandSQMI||Land Area Square Miles|
|AreaSQMI||Total Area Sq Miles|
These reports were generated using the SAS software package, Version 8.1. The code for generating the reports can be viewed via the web at http://mcdc.missouri.edu/data/pl942000/Tools/. This will take you to a directory listing. The pertinent modules are basic_trend_report.sas for an example of code that writes these reports using the macro in pl94trnd.sas, and cnvt1state.sas for the code that converts the original Census Bureau data files and creates the SAS data set(s) (for 2000) used as input.
The code used to aggregate 1990 data to create the pl9490tx data sets which are the source of the 1990 data displayed in the reports were created using setups in http://mcdc.missouri.edu/data/pl9490/Tools/. There are quite a few aggregation setups involved here and in the pl942000 Tools directory. One of the critical software tools used in that aggregation is our aggregation utility macro stored in http://mcdc.missouri.edu/pub/sasmacro/agg.sas.
The SAS code used to create these report was written by John Blodgett of the Office of Social and Economic Data Analysis, part of University Outreach and Extension, located at the University of Missouri in Columbia. The work was done under a contract with the Missouri Census Data Center. Questions and comments can be addressed to the author (see below).