Table of Contents

Background and Overview

Input options

Output options

Universe filtering options (Inc. point-and-radius)

Accessing and Understanding Outputs


Background and Overview

Where the application runs

The Geocorr engine is mirrored and can be accessed via the following URLs:

http://blue.census.gov/plue/geocorr
and at
http://oseda.missouri.edu/plue/geocorr
and at
http://plue.sedac.ciesin.org/plue/geocorr.

The "sedac" site is considered the primary mirror but all sites offer the same functionality at all times.

What the application does

The MABLE/Geocorr geographic correspondence engine generates files and/or reports showing the relationships between a wide variety of geographic coverages for the United States. It can, for example, tell you with which county or counties each ZIP code in the state of California shares population. It can tell you, for each of those ZIP/county intersections, what the size of that intersection is (based on 1990 population or other user-specifed variable) and what portion of the ZIP's total population is in that intersection. The application permits the user to specify the geographic scope of the correspondence files (typically, one or more complete states, but with the ability to specify counties, cities, or metropolitan areas within those states), and, of course, the specific geographic coverages to be processed. The latter include virtually all geographic units reported in the 1990 U.S. census summary files, and several special "extension coverages" such as 103rd Congress districts and the PUMA areas used in the 1990 PUMS files. The application creates a report file and a comma-delimited ascii file (by default) which the user can then browse and/or save to their local disk.

What is a "Correlation List"?

The output files created by this application are referred to as "correlation lists". Other commonly used terms for such entities are "equivalency files", "crosswalk files", and "geographic correspondence files". A correlation list consists of a set of "source geocodes" specifying the geographic coverage to be related (i.e the "known" geographic coverage), and a set of "target geocodes" specifying the geographic coverage to which we want to relate the source areas. Frequently (always, in the case of files generated by this application) the correlation list will include a variable to measure the absolute "size" of the correspondence (such as the land area of the intersection or the number of persons living in the intersection). When such an absolute measure is present then there may also be an "allocation factor" variable that indicates what portion of the source area is located within the target area. An entry in a census tract to ZIP correlation list (i.e. a list with "census tract" as the source coverage and ZIP as the target) might contain the population living in the tract/ZIP intersection and a number indicating what decimal portion of the tract's total population also live within the ZIP. The sum of these allocation factors for any specific value(s) for the source geocodes(s) should always be 1.0. For example:

COUNTY   TRACT   ZIP    POP   AFACT
 29510  1101.00 63109  1250   .500
 29510  1101.00 63110   625   .250
 29510  1101.00 63111   625   .250

Here we see 3 entries from a tract-to-ZIP correlation list. All 3 entries are for the same source code, census tract 1101.00 within county 29510 (city of St. Louis, Mo.) The entries show that the tract intersects with 3 different ZIP codes (estimate based on 1990 census) and show the absolute and relative sizes (POP and AFACT, respectively) of the the intersections. Note that if we add the 3 POP values we get the total POP value for the tract (2500), while if we add the 3 AFACT values we get (as always) 1.0.

Typically (always in this application unless overridden with an option) correlation lists are sorted first by the source geocodes, and then by the target geocodes within the source codes.

Who/what is MABLE?

"MABLE" is an acronym for Master Area Block Level Equivalency File. This is the name of the massive database that is used by the geocorr engine to create the correlation lists. "Block" here refers to 1990 census blocks, the smallest geographic units used in the 1990 census. It was chosen as the base unit for the application because the Census Bureau uses these blocks as their "atomic unit" for all other census-based geographies. Thus, census blocks will never cross a place (city) or MCD (county subdivision, township, New England town) boundary. While they can and do cross ZIP code boundaries, for the sake of this application (and based on the Census Bureau's offical 1990 Block-ZIP Equivalency file) each block is assigned to a unique ZIP code (vintage October, 1991). The MABLE database is actually a collection of 51 state-level datasets containing a total of just under 7 million block entries.

How does GEOCORR work?

The hard part was building the database and the user interface. The actual processing is fairly simple. Once you determine the geographic universe that the user specifies as well as the source and target geocodes and weighting variable, it is a matter of extracting these items from the appropriate entries in the MABLE database. This yields a set of census blocks for the geographic area specified, each one identified by the source and target geocodes and with a measure of its "size" (population, land area or number of housing units.) To build the correlation list outputs (listing and/or .csv file) is a relatively simple process of sorting and aggregating. What this amounts to is using the census blocks as a kind of "geographic pixel", or indivisable geographic unit. All correlations are "rounded off" to the census block level. For a majority of the geographic codes the roundoff error is 0 since most of them are never split by blocks. The resulting file is similar to the sort of result you can get from a GIS by doing a polygon intersection operation. But it goes much faster (and the output is presented in a more convenient format perhaps) because we have already determined all the spatial correspondences and stored the results in MABLE: all we need to do is pull out the subset of the 7-million pre-defined answers and aggregate them.

Programming Details

We'll work on a separate module for discussing the real nitty gritty details of the programming and interface tools used to build the application. But basically the application was written in SAS(r) and uses Perl interface scripts to handle the forms output. The MABLE database is a series of SAS datasets and views with a few of the items (so far the 103rd Congressional Districts and the 1990 PUMA codes) implemented as "virtual variables" using SAS format libraries to do lookups "on the fly". Most of the SAS code (and all of the dababase design) was done by John Blodgett of the Urban Information Center, University of Missouri St. Louis under a contract with CIESIN. The Perl interface routines were written primarily by Hendrik Meij of CIESIN. The HTML design and coding have been a joint effort.


Input Options

These are the options (specified at the top of the MABLE/Geocorr input form) that control the basic nature of the correlation list you want to have built for you. Here you specify the states, the geocodes and the weighting variable.

Note that here, as throughout the form, all items have been assigned default values, so you need to at least consider each one. If you do not, then the default value remains in effect and you need to be sure that this is acceptable. In other words, don't rush through the form assuming that if you fail to fill something out that is important, that you will be prompted for the value. If you do not specify the weighting variable (for example), the program assumes POP and does so without any dialogue with the user.

Selecting state(s)

Click on one or more state names in this select box to indicate the state or states that you wish to process. You must specify at least one state (note that Alabama is "pre-selected" so if you do nothing about it that is what you will get.) The instructions on the form state "(max=4)". We put that on there to discourage over-use of the system which may overload the systems on which the application runs and have a detrimental effect on other users' response times. We would prefer if you honored this limit. But as a bonus for actually taking the time to read the documentation you should know that this limit is not implemented in the code. But we strongly encourage you to abuse this priviledge only during off hours and we reserve the right to cancel any application if we see that it may be having an adverse effect on other processes. Be sure you know how your browser works when making multiple selections from a select list like this. Netscape, for example, requires that you hold down the "ctrl" key while clicking on items to get multiple selections; but IBM's Web Explorer does not - each click makes a new selection and you have to click on a selected item to de-select it. Be careful with this.

Weighting variable

Select a single variable to be used to measure the amount of intersection between the source and target geocodes on the output file. By default, the 1990 (complete count) population will be used. On the output this variable will contain the sum for all the blocks used in creating the output record.

Option to ignore blocks with no population

There are many census blocks that occupy space but have no population. When building a correlation list with POP as the weighting variable you may find that leaving these blocks in results in output lines showing a correspondence that has a value of 0 for the POP and AFACT (allocation factor) variables. This indicates some spatial overlap between the areas, but no population in that overlap. If you check this box then those lines with 0 population will not be present on your output. It will also make processing slightly faster since the program will have fewer observations to process.

Link to the MAGGOT file

This application makes use of a lot of different levels of geography ("coverages"), most of them corresponding to standard Census Bureau defined areas. To help users who are not familiar with these types of geography, we have created this auxiliary help file with more detailed descriptions of each of the area types. The link was placed here because the next two select boxes are where you'll be selecting the geocodes you want to process. "MAGGOT" is an acronym for Master Area Geographic Glossary of Terms.

Source geocodes

Click on one or more geographic codes you want to use for the "source" portion of the correlation list. The output will normally be sorted by the values of these codes. For example, if you select COUNTY and TRACT (the defaults) then the output file will be sorted by county and then tract. The sort order is the order in which the variables appear in this select list. Certain geocodes occur in a hierarchy so that selecting them automatically triggers selection of a higher-level qualifying variable. These are MCD (implies COUNTY), TRACT (implies COUNTY), BLOCK GROUP (implies COUNTY and TRACT), and BLOCK (implies COUNTY and TRACT; block group is not selected but it is implicitly present as the 1st digit of the block.) Note that COUNTY is a 5-digit code that includes the state code. STATE is always added to the output file, even if it is not explicity selected as one of the source or target geocodes. All codes are FIPS (Federal Information Processing Standard) if defined, or census codes otherwise. Be sure you understand that selecting multiple source geocodes means that the source areas are the intersection areas formed by looking at values for all the source codes. Thus if you click on COUNTY, MCD and ZIP for the source geocodes then the "source areas" represented on the resulting correlation lists are formed by the intersection of these 3 area types: i.e. a portion of a ZIP code within an MCD within a county. If what you actually want is a correlation list for MCD's and a (separate) correlation list for ZIP codes, then you need to invoke the application twice; geocorr creates only one list per run.

Target geocodes

Most of what was said for the source geocodes, above, apply equally to the target geocodes. Do not select the same geocode in both lists. The codes you select here define an area formed by the intersection of those areas. The correlation list defines the relationship of the source codes to these target areas. The default value (what you'll get if you do not click in this select box at all) is ZIP - the 1991 5-digit ZIP code (as defined in the Census Bureau's ZIP-Block equivalency file.)


Output Options

These options specify details about the output generated by geocorr. In most cases you will be able to accept all defaults for these options (unlike the input options where accepting all defaults would be very rare.)

Second allocation factor (afact2): target to source

The standard output correlation list from geocorr has a single AFACT allocation factor variable which indicates the decimal portion of the source geocodes contained within the target geocodes. It may also be useful to know how this works going in the other direction, i.e. to know what portion of the target area (the complete target area, not just the part within the source area) is contained in the source geocodes. Selecting this option causes geocorr to do the extra processing and calculations required to create such a "dual factored list". The best way to see how it works is to select the option once and study the AFACT2 values.

Sort by target geocodes

Normally the output file is sorted by the source geocodes, then by the target geocodes within the source. This option lets you override the default and have it sorted by the target codes first. An example of where you might want to use this option would be in creating a ZIP to CD103 list. You want to look at which ZIPs and what portions of those ZIPs make up each Congressional District. But you want the results organized by CD first, so that you can focus on the portion of the report relevant to the district you want to mail to. Specifying this option causes the output to be sorted by CD first, then show all the ZIPs within each CD together with allocation factors indicating what portions of the ZIP are in the CD. (If you specified CD103 as the source and ZIP as the target, you would get the sort the way you wanted, but then the AFACT allocation factor values would show the portion of the CD that was within the ZIP, which is typically a very small and relatively useless number.)

Generate a comma-separated value (CSV) file

This option is selected by default, meaning that geocorr will create an ascii file in comma-delimited format that you'll be able to browse (preview) and then save to your local disk. Generally this is the option to use if you want to do processing of the correlation list back on your platform using your favorite software package. The ".csv" file extension is a standard that is recognized by most Windows programs, making it easier to import the data into those applications. Note that this file will have the variable names as values in the first line (the "header" record), which when imported into a spreadsheet such as Excel or Lotus will become the first row. If you have no interest in obtaining such a file (you only want the report format) then click on this box to turn off the option. It will save processing time.

Add names to output CSV file

Note: that this option seems to appear twice - but once it pertains to the output .csv file and the 2nd time it pertains only to the listing.

In many cases it will be convenient to carry along names to go with the codes on your output file. If you select option 2 or 3 then, for any geocode for which geocorr has a "name table" and that you select as either a source or target geocode, the program will add a new variable (with name ending with "NM", e.g. PLACENM, COUNTYNM, etc.) to the output ascii (.csv) file. Usually, if you want names, you should select the "codes and names" option, rather than asking for just the names.

Generate a listing (same information as output file but report format)

You'll normally want to leave this option selected so you can at least see a nicely formatted eye-readable version of your output (the .csv file is intended more to be program-readable than eye-readable although you can browse it and count commas.) This is the preferred format for using as a reference report. The lines can be up to 120 characters across and it will print 240 lines before generating a page break with fresh column headers. Source geocodes will always appear first (leftmost) on the report and consecutive duplicate values of the source geocodes will be blanked out to emphasize "breaks" in the value of the source codes. This will normally be the largest output file. If you do not need or want it then you can save processing time by deselecting this option.

Add names to output listing

See the discussion, above, of names for the output CSV file. Generally, you are more likely to want names on the listing output than on the .csv file. The default is no, so you have to select this option to get the names included.

Weighted centroids on output file(s)?

Each of the census block entries in the MABLE database has a pair of latitude, longitude coordinates for an "internal point" of the census block. This is the geometric centroid of the block except in those few cases where the true centroid is not within the block, in which case it is moved to a location just inside the block. When you select this option, geocorr keeps these coordinate values and as it processes the blocks within the source/target geocode groups it takes a weighted average of their values (using the weight variable specified in the INPUT OPTIONS section - usually 1990 population.) The result of this is that on the output files you will have two extra columns of data, INTPTLNG and INTPTLAT (these are terrible names and we may get around to changing them - make sense on the MABLE database, but not on the output). They will be in degrees, with 6 digits after the decimal point kept (if needed.) West longitude is assumed, no minus signs.

Specifying a name for the output files

This option is perhaps more trouble than its worth. You can safely ignore it if you want. It allows you to specify up to a 15-character name for your two output files. They are normally named "geocorr.csv" and "geocorr.lst" for the ascii comma-separated-value file and listing file, respectively. If you type "tr2zip.detroit" in this box, then the files will instead be named "tr2zip.detroit.csv" and "tr2zip.detroit.lst". The only reason this might matter to someone is if they intend to save the files to a local disk and your browser is able to pick up and use the original name as the default for the copy on your local disk. If this makes no sense to you, just ignore the option - you won't need it.

Universe filtering options (Limiting the Geographic Universe)

For many applications by the time you get to this point on the input form you'll be ready to click on the "Run Request" button to tell geocorr you are finished with your specifications. With one very minor exception (having to do with adding a distance-to-a-point variable to the output file) all of the options that remain have to do with limiting the set of blocks that will be processed by geocorr. This can be done by specifying either county, place or metro-area level filters, or by specifying a point-and-distance select criteria. We begin with the latter.

Point-and-Distance Criteria

There are 4 closely-related items that can be specified in this section. If you enter values for a specified point location as decimal degrees of longitude and latitude you are telling geocorr that you want it to calculate the distance between that point and the "internal point" of each census block on the MABLE database that is otherwise selected for processing (i.e. that first passes the other geographic filters we'll be discussing, below.) Note that the longitude value entered is assumed to be West longitude and the leading minus sign is optional; if entered, it is ignored. Entering a value of "92.3456" is interpreted as 92.3456 degrees west longitude. Geocorr expresses all coordinates with this convention: longitudes on output files are also expressed as positive values for west longitudes. Many GIS programs will require these values to be negated if these coordinates are to be processed.

If the point location you want to use corresponds to a valid street address then you may be able to take advantage of an address-location service provided by the Mapquest corporation. A link to their web page is provided. You'll need to do 2 additional clicks to get to the form where you enter the street address. If the address is found, it will return a map of the area with the address at the center. The lat-long coordinates appear (in very small font) just above the map. You need to write these down, back out of the MAPQUEST application (will take about 4 clicks on the "previous" button), and then manually enter the values into the geocorr form. Be sure to keep the latitude and longitude straight. If you enter the latitude in the longitude box and vice-versa (like we did when we first tested this feature) you will not get any geocodes selected. WE HAVE NOT VERIFIED AND CANNOT GUARANTEE THE ACCURACY OF THE MAPQUEST COORDINATES. We did, however, run several test addresses using local addresses and the results appeared to be correct.

You cannot enter just one of the coordinate values: if you specify a longitude value then you must specify a latitude value as well.

The entry for "radius" has a default value of "0", which has a special meaning. When "0" is not overridden it signifies that the distance calculation is not to be used a filtering mechanism (i.e. no blocks are to be excluded from processing based on their distance from the specified point.) In this case it means that you want to carry along an extra variable in the output file which represents the distance (in miles or kilometers - see the next option) between the weighted centroid of the output record and the specified point. Not a frequently used option, but possibly of some value. More typically, however, you will enter a non-zero value for the radius option and when this happens filtering takes place that will limit processing to blocks whose internal points are within the specified distance from the specified point. Using this option has a dramatic effect on the way you interpret the entries in the output correlation list, since everything there has to be qualified by starting with the initial filtering options. Typically, use of this option, will be used with a very large target area (such as a complete state or metro area) and the real correlation is between the n-mile circular area and that large target area or areas. For example, you could specify a place-to-state correlation list (source geocodes=place, target geocodes=state), with a metro area filter (only the portions of the places with the specified metro area are processed) and the coordinates of the metro airport entered with a radius of 3 miles specified. What results is that only blocks within 3 miles of the airport are selected. On output the POP figure shows the total persons living in the specified places and also in blocks that are within 3 miles of the airport, and the AFACT variable will typically be 1.0 since ALL of the blocks in the selected place will be associated with the same state. It is critical to remember that the POP figure shown is not the total population of the place, but only the population of the portion of the place within 3 miles of the specified point. If you need to know what portion of the total population of the place is within this circle, you will have to do some special postprocessing, since this figure is not readily generated directly by geocorr.

The "check for kilometers" check box can be used to specify that the radius value entered is to be interpreted as kilometers rather than the default miles. The "Label of point" box allows you to enter text to describe the location represented by the point. This label gets added to the variable label on one of the documentary output files.
Note that whenever you specify the point option a DISTANCE variable is added to the output file. This distance is in miles (or kilometers if you checked that box) and represents the approximate distance from the calculated weighted centroid of the output area (source/target intersection) and the specified point. When you are using the point-and-radius options strictly as a filter you may well have no interest in this item, but it is included in the output nonetheless.

General information re filtering by geographic code lists

Geocorr allows you to specify lists of 3 types of geographic areas that will be used to further limit the geographic universe ("further" meaning, following state-level filtering which is mandatory and is dealt with under INPUT OPTIONS.) The first box to check is preceded by the explanation for it use: to specify that if multiple types of geocodes are used to filter that they each be considered as sufficient rather than necessary conditions for inclusion. For example if I do not check this box and I then enter a value in the "place codes" box for Kansas City, Mo and a value in the "county codes" box for Jackson county, Mo, then the universe would be limited to the portion of Kansas City within Jackson County. But when I check this box then the conditions become "or"-ed instead of "and"-ed, meaning I want all blocks that are either in the city of Kansas City or in Jackson County. So now I get all of the city (which I did not before) plus I get the parts of Jackson County that are not inside the city.

To limit the universe based on one or more counties to be selected you can enter their FIPS codes in the box provided. Be careful to enter full 5-digit codes when processing multiple states; 3-digit codes are OK if you only selected a single state for processing. Specifying a code for a state that was not selected will cause an error and geocorr will not complete processing. If you need to look up a county code, simply click on the "County codes" hyperlink. You'll have to note what the codes are and enter them after returning from the the linked-to code pages.

Similar processes apply for filtering by place and by metro area, except, of course, that there is no option for entering a state portion of the codes. Simple enter the 5-digit FIPS place codes or the 4-digit MSA, CMSA or PMSA metro codes in the appropriate boxes.

Be sure to specify leading zeroes in all codes.

Don't forget to click on the SUBMIT button to tell the application that you have finished with specifications and are ready for processing.


Accessing and Understanding the Output

When and if your request is successfully processed you should see a screen with a series of filenames and descriptions, with each of the filenames being a hyperlink to the file itself. There are four possible output files, depending on what options you select. These are each described.

The summary.log file

This file gives a very brief summary of what you requested and a little about what the program did to satisfy the request. It tells you, for example, how many census blocks were selected for processing and how many lines (records, observations) actually made it to the output files. The first line on this file tells you what your "Process id" was for this request. If you have any problems with your request you need to be sure to save this key number and report it to the authors with a description of what went wrong. In most cases, you'll find that you should be able to safely ignore this file unlesss you have a problem.

The geocorr.lst file

This is your listing (i.e. report format) file. It is usually the largest of the output files, and often the most important. Note its size before attempting to print or save it to your desktop since it may be quite large. If you filled in that box on the form that let you specify a name other than "geocorr" you should see that name here instead of geocorr. The same applies for the .csv file, next.

The geocorr.csv file

This is your comma-delimited ascii file. You might want to browse/preview it, but you'll most likely want to save this back on your local disk. You should be able to easily load the file into a spreadsheet for further local processing.

The varlst.lst file

This is a very short file that simply provides a little extra information about the variables, as specified in the header record, on your .csv file. If you did not request a comma-delimited file then you will not get this file either: they are a matched set. The report lists each of the variables (fields) on your file and adds a descriptive label to help you identify what each means. You'll note that the variables have a consistent order in this report and on the .csv file with the source geocode fields appearing first, followed by the target codes and then the weight variable, allocation factor(s) and any x-y coordinate and distance-to-specified-point items. If you did not explicitly specify "state" as one of your geocodes you will nonetheless see it added to this file as well, usually after the last target geocode and just before the weighting variable. These files are all stored in a temporary directory and will remain there for a period of 2 hours or so. But you should retrieve them to your local system before exiting the application.

| Top | SEDAC | UIC | FTP Archive | DDViewer | Credits | Comments and Suggestions |

Last Update: 02-04-97