If you are interested in median family income by the age and/or race of the householder, this is not the place to look. Here, all you will find is the simple median family income. You will also see race and age data, but not cross-classified. We believe that a relatively small, carefully chosen subset of the available data can be used to answer a large percentage of user's questions. So we have spent considerable time considering what variables we wanted to include in these extracts. It is like creating a Greatest Hits collection; noone is going to agree with all your choices. We had direct input from a number of people regarding what to include here. If we had included every item that at least one person thought should be included we would easily have had over a thousand variables. We wanted to keep our count under 250. At last count (we reserve the right to tweak these files by adding new variables from time to time -- we'll never drop one, however) we had 217 independent variables and another 188 derived percentages.
Just because a variable is not included on this dataset does not mean it is not an important piece of information. Census data are used by so many people for so many different kinds of applications that there is no way you can create an extract that serves everyone's needs for all requests. That is not the intention. We still have the more detailed table files to go to, and we have created tools for providing links directly from this extracted data to the more detailed "parent" tables. (Look for these links on the corresponding profile reports, described below.)
The primary intended use of this collection is to generate profile reports based upon the data. These profiles should provide a basic overview of the area. What kind of people live there? What are their ages, their racial breakdown, their income levels and poverty status; their propensity to own versus rent, the age of the housing, how long have people lived there, the number of PhD's and the number who never made it through high school. Access to these reports is available using the MCDC's menu-driven dp3_2k web application.
Another series of profile reports combine data from these 2000 demographic profile datasets with comparable data from the 1990 census. The MCDC's menu-driven dp3_2kt web application is one of the most popular on the MCDC web site.
|1. Population Basics - Universe: Total Population|
|Total Persons (Sample Est)||13,092||P1|
|Unweighted Sample Count of Persons||1,598||P2|
|Total Persons (100% Count)||13,243||P3|
|Pct Persons Sampled||12.1||P4|
The "Total Persons (Sample Est)" field is the population of the city based on the sample estimate. The actual enumerated population was 13,243 and 1,598 of these people (unweighted sample count) got the long form questionnaire, which was 12.1 percent of the total. All the SF3 "sample" data are based on the responses of these 1,598 people. The estimate is off by 151 people, or about 1.1%. The sampling error tends to be higher for smaller geographic areas. It is not a serious problem for Washington, but it could be for a city of less than 1000. The Bureau actually oversamples in places of under 2500, but it can still be a problem.
Each person who fills out the long form is assigned a "sampling weight" value. The average sampling weight for persons in Washington was about 8.25 (the 12.1% sampling rate is about 1 in 8.25 people). That SF3 count of persons aged 20-24 was derived by counting each person filling out the long form who checked their age as being in that interval not as one person but rather as 8.25 persons (on average -- the sampling weights will actually vary from person to person, it's a very complex sampling scheme.) The bottom line here is that the figures on SF1 are somewhat more accurate that those on SF3 (complete count data is "better" than sample-based data). The problem with using the SF1 data is one of consistency when trying to analyze an area. It can be pretty confusing to users when two tables describing the total population or even some subset of it, do not add up to the same totals. For example, when we report the Marital Status data in Table 6 (sample data not available on SF1) we use as our universe persons aged 15 and over. This is the sample estimate of such persons, consistent with the five marital status counts that follow it. These differences tend to disappear for higher level summaries (states, larger cities and counties) but can be very significant for smaller geographic areas. The simple fact of the matter is that you really have to take sample census data for areas of less that a few thousand people with a grain of salt. The sampling errors for such areas can be significant. Sample data for areas with fewer than 100 people are basically worthless, as far as knowing the characteristics of those areas. They do have considerable value as building blocks to aggregate to larger entities such as school districts or 10-mile radii of proposed nuclear reactor sites. The sampling error you get when you aggregate 100 areas of 100 persons each is equivalent to the error for a place with 10,000 people - i.e. not bad in most cases. But watch out for tables with small universes.
Users might want to view the Census Bureau's note regarding this matter as it relates to 2000 census data.
By far the easiest way to access the various datasets within the directory is to click on the Datasets.html page within the sf32000x directory. Here you will find a more logical ordering of the datasets along with much more detailed descriptions and metadata references.
To access datasets with complete SF3 table files you need to access the sf32000 filetype, which means using the same URL as for this sf32000x filetype and just dropping the final x. You may want to access the MCDC's SF3 home page (which is also the Readme file for the sf32000 directory.) That file will provide more general background about the Summary File 3 2000 data, with access to complete Technical Documentation. We understand that for many -- perhaps the great majority -- of users, all that information will be a lot more than they may have the time or interest to digest. This standard extract has been created mostly for those people.
|SumLev code||SumLev Meaning||Inventory
The key to selecting observations from these datasets (and most other datasets based on census data) is the SumLev variable. This 3-digit code is provided by the Census Bureau on all their summary files to allow users to distinguish the type of geographic entity being summarized. Per the report we see that the code indicating a state level summary is 040 (leading zeroes matter here), while the code to indicate a county summary is 050. The "Inventory of Hierarchal" column indicates the specified level is classified. The levels indicated as being "inventory" (basically these are complete areas, while hierarchal areas are created by intersecting inventory areas) will be found on datasets such as "moi" and "ili", etc. Those indicated as hierarchal will be found on datasets such as "moh" and "ilh". Inventory summaries are by far the most commonly used, but hierarchy summaries are more numerous and take up a lot more space. That is why we like to segregate them - it makes using the inventory data go a lot faster.
The Frequency Count shown in the report is based on how many times each level occurs on the original sf3 data files for Missouri. The reason there are 94 state level summaries is because the Bureau has something called a geographic component summary. On sf3, you not only get a summary for the whole state but for the "geoographic components" such as "Urban", "Rural", "Urban in Rural Cluster", "Rural Farm", etc. to name just a few of the more interesting ones. Most users hate "geocomps" because they just cause confusion. For that reason, we have omitted them from the standard extract datasets, at least for now. We await a groundswell of user interest to see if we need to make them available. But for now, on sf32000x, the moi.sas7bdat dataset, there will be only a single 040 summary observation. (See http://mcdc.missouri.edu/sas/formats/Sgeocomp.sas for the complete list of geocomp codes if you are interested. The really important one is '00'.)
Notice that the variables are organized into a series of 29 subgroups called tables. These are the same tables as labeled in the dp3_2k profile reports. The table numbers and titles are shown as subheaders in the Variables report, and the report is sorted in table number order. Here is a sample of the report - describing the variables comprising the Educational Attainment table:
|Table=13. Educational Attainment Universe=Persons Over 25|
|Variable Name||Label||Definition - Code used to derive||Comment||Universe Variable||Weight
|Over25||Over 25 Yrs of Age||p37i1||TotPop|
|LessThan9th||Less Than 9th Grade||sum(of P37i3-P37i6 P37i20-P37i23)||Over25|
|SomeHighSchool||9th thru 12th grade, No Diploma||sum(of P37i7-P37i10 P37i24-P37i27)||Over25|
|HighSchool||High School Grad or GED||P37i11 + P37i28||These are people with nothing beyond High School||Over25|
|NoCollege||Did Not Attend College||LessThan9th + SomeHighSchool + HighSchool||Over25|
|SomeCollege||Some College, no degree||sum(of P37i12-P37i14 P37i29-P37i31)||This is not the complement of NoCollege. It is people with some college but no degree except maybe an associates||Over25|
|Bachelors||Bachelors||P37i15 + P37i32||People with a Bachelors and no more||Over25|
|Masters||Masters||P37i16 + P37i33||People with a Masters and no more. NA on STF3 in 1990.||Over25|
|ProfPHD||Prof School Degree or PhD||P37i17 + P37i18 + P37i34 + P37i35||NA on STF3 in 1990||Over25|
|GradProf||Graduate or Professional Degree||Masters + ProfPHD||Added together for compatibility with 1990 STF3||Over25|
The value displayed in bold in the Variable Name column is the name on the dataset. These are the names you will see displayed on the drop-down variables menu list when running the Dexter program. To run an extract that included the education attainment table you would just select these 10 variables. Note that the first variable in the table, Over25, is a special table-universe variable. It is not really an education item per se, but it is included because of its importance as a denominator used in calculating Pct variables. What are those? A footnote at the bottom of the report gives a brief explanation, but we need to make it clearer. Most variables on this dataset represent counts of things with a certain property. For example, the variable LessThan9th is the count of persons over 25 with less than a 9th grade education. We also generate a variable, PctLessThan9th, containing the percentage this is of the universe. The "Universe Variable" column tells you what variable, if any, we use as the denominator to create the corresponding Pct variable. Notice that the Over25 row has an entry of "TotPop" in the Universe Variable column. This tells us that the dataset has a variable named PctOver25 and that the value of this variable is the Over 25 population as a percentage of the total population. I.e., PctOver25=100*Over25/TotPop . Note that these percentage variables are the source of the values that appear in the Percent column of the Demographic Profile 3 (dp3_2k) reports. (See sample).
The Label column contains a more extended description of the variable and corresponds to the Label item stored on the SAS dataset. It will appear in the second row of Dexter-generated CSV files and as the column label on html output.
The Definition column is for people who want to know precisely how we derived the variable. It is a SAS numeric expression that was used on the right side of a SAS assignment statement to define the variable. For example, the SAS program that creates these extract datasets (by accessing the full-table sf32000-filetype datasets) contains the statement:
HighSchool=P37i11 + P37i28;
You can verify that the formula is correct by browsing the Plabels.txt file in the Varlabs subdirectory of the sf32000 data directory. That file contains the following text:
/* P37. SEX BY EDUCATIONAL ATTAINMENT FOR THE POPULATION */ /* 25 YEARS AND OVER  */ /* Universe: Population 25 years and over */ P37i1='Total:' /* P037001 */ P37i2=' Male:' /* P037002 */ P37i3=' No schooling completed' /* P037003 */ P37i4=' Nursery to 4th grade' /* P037004 */ P37i5=' 5th and 6th grade' /* P037005 */ P37i6=' 7th and 8th grade' /* P037006 */ P37i7=' 9th grade' /* P037007 */ P37i8=' 10th grade' /* P037008 */ P37i9=' 11th grade' /* P037009 */ P37i10=' 12th grade, no diploma' /* P037010 */ P37i11=' High school graduate (includes equivalency)' /* P037011 */ P37i12=' Some college, less than 1 year' /* P037012 */ P37i13=' Some college, 1 or more years, no degree' /* P037013 */ P37i14=' Associate degree' /* P037014 */ P37i15=' Bachelor''s degree' /* P037015 */ P37i16=' Master''s degree' /* P037016 */ P37i17=' Professional school degree' /* P037017 */ P37i18=' Doctorate degree' /* P037018 */ P37i19=' Female:' /* P037019 */ P37i20=' No schooling completed' /* P037020 */ P37i21=' Nursery to 4th grade' /* P037021 */ P37i22=' 5th and 6th grade' /* P037022 */ P37i23=' 7th and 8th grade' /* P037023 */ P37i24=' 9th grade' /* P037024 */ P37i25=' 10th grade' /* P037025 */ P37i26=' 11th grade' /* P037026 */ P37i27=' 12th grade, no diploma' /* P037027 */ P37i28=' High school graduate (includes equivalency)' /* P037028 */ P37i29=' Some college, less than 1 year' /* P037029 */ P37i30=' Some college, 1 or more years, no degree' /* P037030 */ P37i31=' Associate degree' /* P037031 */ P37i32=' Bachelor''s degree' /* P037032 */ P37i33=' Master''s degree' /* P037033 */ P37i34=' Professional school degree' /* P037034 */ P37i35=' Doctorate degree' /* P037035 */You can see from this that P37i11 is the male high school graduates and P37i28 is the female high school grads. So the sum is the total high school graduates, as advertised. Users are encouraged to examine these definitions carefully and to report any errors to the author. In some cases, the definition may not be in error exactly, but it may be less than perfectly clear from the name and label what it really represents. This is where the Definition column can be very helpful.
The Comment column is pretty obvious. Where we felt there was some need to clarify something about the variable we added it here. Thus the explanation for SomeCollege is to warn users that is not all people who have at least some college experience, but rather only those with some college experience but no degree. There is a lot of this "fine print" that has to be understood when dealing with census data.
The Universe Variable column has already been discussed above. If it is blank then there will not be a corresponding Pct The Weight Variable column is blank for all entries in our sample Educational Attainment table. It will only have a value for variables that can be aggregated by taking a weighted average. If, for example, the Bureau had reported (which they did not) total years of school completed then this table might have had an entry labeled "Average Years of School". The weight variable for such an item would have been the universe variable Over25. An example of this that actually exists on the dataset occurs in Table 21. The PCI variable (Per Capita Income) has TotPop listed as the weight variable. This means when you aggregate a dataset containing the PCI variable you need to take a weighted average of the variable, using the total population count as the weight.
SF32000: The Source
In case you missed it, all the data in these sf32000x extracts are direct derivatives of the complete tabular data stored in the parent filetype, sf32000. Additional information regarding SF3 is available in the
SF3 Readme file, which doubles as the MCDC's "home page" for Summary File 3. You can, of course, also use uexplore to extract data from the
sf32000 data directory.
The Weight Variable column is blank for all entries in our sample Educational Attainment table. It will only have a value for variables that can be aggregated by taking a weighted average. If, for example, the Bureau had reported (which they did not) total years of school completed then this table might have had an entry labeled "Average Years of School". The weight variable for such an item would have been the universe variable Over25. An example of this that actually exists on the dataset occurs in Table 21. The PCI variable (Per Capita Income) has TotPop listed as the weight variable. This means when you aggregate a dataset containing the PCI variable you need to take a weighted average of the variable, using the total population count as the weight.