A common problem encountered with using geography-based data (i.e. data that summarizes a set of geographic entities) is that we have data for some geographic areas describing a universe (such as a county or state) but not for some other sub-units of that same universe which are of interest. For example, we might have 2010 census data for units such as census tracts and block groups within a state, but not for the state legislative districts (house and senate) of that state. We shall refer to the units for which we have data as the source geography, and the units for which we would like to get data as the target geography. This terminology will be familiar to MABLE/Geocorr users, where you get to choose two kinds of geographic layers from select lists labeled as Source and Target geographies. What we would like to be able to do is to process the data at the source geocode level to at least approximate data for the target units.
In this tutorial we want to focus on a few specific examples of doing source-to-target data allocations, outlining the basic algorithm and even pointing to some specific tools (in the form of SAS macros) that we have been using at the Missouri Census Data Center for more than 30 years. Our examples will make use of equivalency files (aka "correlation lists" or "crosswalk files") generated using MABLE/Geocorr. We'll include discussions of how reliable the results may be and look at a way of measuring the reliability of an allocation.
We are not aware of any end-user-friendly applications that handle this allocation process. We assume that anyone wanting/needing to apply these methods will have some programming skills that will allow implementing the algorithm. We shall cite some SAS code because that is the language/resource we use for doing such processing. But obviously there are many other programming options.
We chose this example because it was relatively uncomplicated and illustrates some of the issues you'll encounter doing allocations. It is not practical because the data we'll be allocating to the township ("county subdivision", called townships in Missouri) already exists. But that allows us to compare the approximations we'll be generating with the true values.
For Adair County, Missouri we want to allocate a set of 10 basic demographic items from the 2010 census tabulated at the block group level to the county subdivision level (these are called "townships" in Missouri; it varies by state). There are 25 block groups and 10 townships in the county, which had a total population of 25,607 in the 2010 census. After doing the allocation we would like to do an evaluation step where we attempt to assign a reliability indicator to the output set indicating the extent to which the data for that area was allocated/disaggregated. We will also do a compare between our estimated results and the actual tabulations of the data from the 2010 census.
(MABLE/Geocorr web utility application)
We choose Missouri as the state, Census Block Group 2010 as the source geocode, and County Subdivision/township/MCD 2010 as the target geocode. We took defaults on most of the rest of the form options except that we entered 001 in the County Codes box in the Geographic Filtering section of the form to limit processing to Adair county; and we chose to generate a second allocation factor, showing us what portion of the Target's population was within the Source. Here is the plain-text report generated by geocorr:
Does this look promising? Not great and not terrible. What we would like to see are a bunch of 1.000 allocation factors in the bg to cousubfp alloc factor column. Looking at the first block group (Tract 9501, BG 1) we see that it intersects with 4 townships. This means we'll be disaggregating the data for that block group to 4 areas. One of these, Nineveh, is a trivial piece, with just under 1%. Polk gets almost half of the BG (44%) while Benton and Clay get about 25 and 20%, respectively. We have similar cases for Tract 9501, BG's 2 and 3. These are the neighborhoods within the city of Kirksville. Note that once you get past tract 9502 you are in a much more rural area where the townships are larger and we see mostly 1.000 allocation factors in all these. That is a good thing since it means we can allocate all of those Block Groups to townships with no concern about the error that comes with disaggregation. What is that? It is the error that is introduced when we allocate data proportionally and assume/hope that the subareas are homogeneous. When we disaggregate the 9501/BG 1 data we give about 25% of all the BG data fields to Benton Township. In the case of Total Population, we know that this allocation is exactly right. But we also will allocate data items such as Persons Under 18 and Persons Over 65. Just because a quarter of the BG's total population is in Benton Township does not mean that the same portion of persons over 65 are as well. But that is the way we shall allocate, and that is where there is uncertainty (allocation error).
In our case that means SAS. So we wrote a very short program to read the saved CSV file generated by geocorr. Here we see that code next to a partial display of the resulting data set:
Here is the data we chose to allocate: 3 geographic ID fields and 10 numeric items for all block groups in Adair Co.
We pulled these data from one of our 2010 census SF1 based data sets with Missouri block group level data. We chose a dozen numeric variables for the purpose of illustration. We could have almost as easily pulled hundreds of variables, but we wanted to keep it simple so we could see all the data. Note that the geographic key fields — county, tract and bg — are in the same exact format as one on the equivalency file set. The LandSQMI variable is being rounded to the nearest integer in this display but actually contains fractional values that are typically less than 1.0 for the block group level.
We have an equivalency file that has multiple observations (records) per source area (block group); at least for some of them. And we have a source data set with one and only one observation/record per block group. Both these inputs share the Adair county, MO universe. We shall do a one-to-many join of these two data sets. We use the following SAS code with a merge statement to handle the data-join logic:
The SAS merge statement makes this pretty trivial. SQL users should have no problem coding the equivalent join. Note the code that checks to make sure there are no unmatched geographies. We need to match every block group on the source_data set with at least one observation on the equivalency data set. When there are multiple matches we'll have multiple output observations. The number of output observations should always match the number of observations on the input equivalency file.
The resulting data set (tobeaggregated) looks like this:
We have omitted the County column from this display. It is a constant: 29001. Note how the township (cousubfp — County Subdivision FIPS code) and afact (BG to township allocation factor based on 2010 population) have been added. The LandSQMI field is being displayed as a whole integer, but the value actually stored is much finer and can handle values less than 1 with no problem.
Now all that remains to do is to aggregate the data in this tobeaggregated data set by the target variable(s). The degree of difficulty here is dependent upon what tools are available to do the job. If you have to do it from scratch, without the aid of software tools that handle this type of processing, then you have some work to do. But if, like us, you have access to software that knows how to perform these tasks then the step becomes fairly simple.
Here is what we have to do to in this step:
If you are a SAS user like us, we have a macro that can be used to implement the aggregation process as described above. Even if you are not a SAS user you may be able to borrow from the logic of this program to implement something comparable for other computing environments. We have online documentation for the macro, and the code itself.
Here is the SAS code we used to perform this aggregation process:
And here is the output from the Proc Print step to display our results:
How did we do? A good first step in verifying the results is to look at the last observation here, the Adair county totals. These values are readily available from various sources including the MCDC's 2010 Census SF1 Profile Report product. When we run that app for Adair County we get a report that includes:
When you compare the numbers in our allocated data set with these what do you see? They match almost exactly. Or at least the Total Population, the Persons under 18 and over 65 are exact. The 2000 population estimate differs by 13 (out of almost 25,000) and would be exact except for roundoff issues at the block group level. Not shown here, but also an exact match is the Group Quarters population, as is the Land Area in square miles.
So does this mean our allocations are "perfect"? Not at all. It just means we probably do not have any processing errors. When we do our allocations all we do is chop our block group data into smaller pieces representing their intersections with townships. The "error" (i.e. the difference between our estimate and the true value) occurs at the township level. But all these smaller pieces still add up to the BG values, so when we do our aggregation it makes perfect mathematical sense that all the BG pieces should sum up exactly to the county totals. But what about at the township level?
To help us get a handle on this question we did some custom processing of the data, comparing the results of our allocation with the township level actual data from the 2010 census. We created a data set with four rows per township. The Type variable indicates the meaning of each row. Here is the resulting report that shows us how we did with our estimates.
We omitted the GQPop and PopChange variables from the report. Some things to notice on this report include the following:
We have said that allocations are more reliable for larger areas, and less so for smaller ones. But is there a way we can actually create a measure of the reliability of the data allocated to a specific geography. The uncertainty comes when we have to allocate (i.e., when the afact variable has a value other than 1.0) data. If an area is the result of taking 20% of Geocode A and 30% of Geocode B, vs. an area where we took 100% of C and 90% of D, which would be the better estimate? Clearly, the latter case yields a more reliable result because we are doing less allocation. And we can easily measure this characteristic by simply taking the weighted average of the allocation factors in our correlation list, weighting by the population figure used to create the allocation factor. We did this with the correlation list used in this case study and got the following results:
As we thought, Benton is by far the "best", while Walnut township comes in last. The factors tend to be correlated with size in this case. Here is the SAS code we used to generate these data (the above is just a Viewtable display of the AvgAfacts data set):
You can access all the code shown in this tutorial (and more), as well as some of the SAS data sets created by that code using our Uexplore web application to access the directory (folder) for it here. The code is stored in the bg_twpAdair.sas file. The SAS data sets have extensions of sas7bdat; if you click on one of those it will invoke Dexter to allow you to run a query to see the data in the set.
The first case study was a teaching tool where we kept things very simple so we could focus on the mechanics of doing an allocation. In this second case study we'll be looking at a real world application that we actually coded and ran here at the Missouri Census Data Center. The problems we address here are not at all uncommon with census data for small geographic areas. The best units for small area data analysis of census data are census tracts and block groups. One of the bad things about these units is that they are redefined every ten years, so to get trends you have to figure out a way to get the "old" (i.e., previous decade) data into the "new" (current decade) geographic entities. In the very important case of the American Community Survey (ACS) data, we have an interesting opportunity/challenge regarding census tract level data. We would like to look at trends in ACS data for census tracts in calendar year 2016, where the latest available ACS data vintage is 2014 (at least until sometime in December when the 2015 data will be released.) The problem is that we are warned against using overlapping 5-year intervals to do trend analysis. The good news is that the earliest 5-year data available (2005-2009) does not (for the first time ever) overlap with the latest (2010-2014) data. The bad news for doing this at the census tract level is that back in 2010 when they were creating the 2009 vintage data they were still using the old census 2000 geography for block groups and census tracts. What we want to do is take the 2005-2009 data tabulated to 2000 census tracts and do an allocation of those data to get them into 2010 census tracts, so that we can compare them with the latest available (vintage 2014) data.
To do this allocation we are going to need an equivalency file relating the 2000 census tracts to 2010 census tracts. Since we are going to be allocating ACS data for 2005-2009, we would like to have allocation factors based on the population of the intersections in the latter part of that time period. That would be ideal. The only problem is that we really do not have any publicly available data at small enough geographic units to create an equivalency file based on, say, 2007 estimates. But a list based on 2000 population counts would be pretty good, if not ideal. It turns out that is what we do have, and that is what we shall use. Can we get such a list using MABLE/Geocorr? The answer is "no" — geocorr unfortunately does not provide small-area links across time. So we need to look for another source.
As you would expect, the Census Bureau does provide tools for linking 2000 census blocks to 2010 blocks. They are called block relationship files — see the Bureau's Relationship Files web page. They even have tract relationship files, but we only processed the block relationship files. We were then able to easily create tract-to-tract equivalencies from the block-to-block data. We stored these data in the blkrel10 data directory of our public data archive. To get a true block2k to block10 population-based equivalency file, we would need to know how many people lived in each block intersection. We do not have that, but what we do have is the size of the spatial intersection of these blocks. We then disaggregated the 2000 population of the 2000 blocks to 2010 blocks based on the spatial measure. If 2000 block A had 50% of its area within 2010 block B, then we estimated the population of that intersection to be .500 of the 2000 population. While that is not a particularly reliable assumption, keep in mind that most uses of the derived equivalency files will be for much larger areas, where the entire 2000 block is contained within the target area, so that the block allocation factors do not matter. For people who read SAS code and want to get down to the nitty gritty, you can see the code we used to process these block relationship files in the Tools subdirectory of the blkrel10 data directory, file blkrel2010.sas. (Feel free to jump to the next paragraph to avoid all the SAS code that follows here. It is useful to understand how this works but not necessary if you trust us.) In this program, we create a set of state-based block-to-block relationship files and then a set of corresponding tract-to-tract files. Here is some of the code used:
*---create a block 2k to block 2010 correlation list (equivalency file)----;
%corrwt(setin=blkrel10.&stab.blks,geocds1=state2k cnty2k tract2k blk2k, geocds2=state cnty tract blk, wtvar=Pop2kInt, keepwt=1, afacts=1, exitout=xcode, setout=blkrel10.&stab.bl2k_bl10);
*---create a tract2k to tract10 correlation list (equivalency file)----;
%corrwt(setin=blkrel10.&stab.blks,geocds1=state2k cnty2k tract2k, geocds2=state cnty tract, wtvar=Pop2kInt, keepwt=1, afacts=1, exitout=xcode, setout=blkrel10.&stab.tr2k_tr10);
The corrwt macro does most of the work here. (View the source here.) This is the macro that the geocorr web modules invoke to do the heart of the geography processing. The geocds1 parm specifies the source geocodes, while geocds2 is the target list. The appearance of &stab. in the setin/setout parameters tells you that this code processes data for a specific state whose postal abbreviation is stored in global macro variable stab. We have a utility macro called %dostates that we use to invoke a state-based processing macro to repeat that code for either all or selected states in the US.
In this "real-world" application, things are a bit messier. For example, we do not have just 10 numeric variables to aggregate. We have hundreds of variables. Almost half of these are percentages (if there is a variable called Age0_4 you can just about bet on there being a variable called PctAge0_4). There are also variables that contain margin of error values (MOEs). We opt to not process these variables, but we need to get a list of them so we can specify that they be dropped. We could do it without using special tools to help, but that would be not only very tedious and time consuming but could also affect the reliability of the results.
Some of the most commonly used ACS data items are medians. Median Age, Median Household Income, Median Home Value, etc. are all key indicators. But the data we shall be using on input is a special extract done using more detailed data from the complete set of summary (aka "base") tables provided. We have 11 medians in our standard profile data sets. Of these 11, four of them have other variables that form a distribution table corresponding to that median. For example MedinaHHInc (median household income) can be paired with the eight variables HValUnder50, Hav50, ..., HvalOverMillion that form a distribution table from which a median could be estimated. But the eight intervals are a pretty severe collapsing of the 24 intervals on the more detailed distribution in the base tables. This means that even if we were to make the extra effort to estimate the median by a post-aggregation step where we used data from the aggregated distribution, it would not be all that accurate because of the lack of detail. The same situation applies in each of the four cases where we had some distribution figures. So what we do instead is violate the statistician's code and estimate the median values by taking their weighted averages. The result is a weighted average of the medians of the components, but it is not the median of the aggregate. It may not be the actual median, but it is still a not bad measure of the typical, middle value of the distribution.
A common problem when doing aggregation is to make sure that variables that are not not directly aggregatable get aggregated properly. It is easy to aggregate variable HvalOverMillion (number of homes with value of $1,000,000 or more) but what about PctHvalOverMillion? If you had to aggregate four observations to get an output record you cannot simply add up the four percentage values and get a result that makes any sense. What you really have to do is recalculate the percentage by aggregating numerators and denominators and then recalculating the percentage following the initial aggregation. This turns out to be equivalent to taking the weighted average of the Pct variables, where the weight variable is the denominator used in calculating the percentage. In the case of our PctHvalOverMillion, the denominator is OwnerOcc — number of owner-occupied housing units. Taking weighted averages also works for variables like PCI (per capita income, which is just an average) or AvgFamilyInc. So how does one calculate a weighted average? Simply:
One of the uglier "real world" realities that needs to be dealt with in doing this processing has to do with making sure that the geographic codes that are used on our equivalency file (based on the blkrel10 data sets) match up with those being used in the ACS data. There are four geocodes being used here: the 2000 vintage county and census tract, and the 2010 versions of the same. So why should we have to worry about our codes all matching up? They all come from the same source (the Census Bureau), and that source is known for its reliability and attention to geographic detail. So what could go wrong? Here is an example, extracted from the Bureau's 2009 Geography Changes page:
"There are major changes to the geographic definitions for two of the Census 2000 tracts in Ketchikan Gateway Borough and Prince of Wales - Hyder Census Area, resulting from the substantial annexation of territory by Ketchikan Gateway Borough from the former Prince of Wales - Outer Ketchikan Census Area. However, the remaining Census 2000 tract definitions in these areas are unchanged. The following describes the comparability of Census 2000 tract definitions in the current context, compared to their original definitions in Census 2000.
"Tracts 1, 2 and 4 in Prince of Wales - Hyder Census Area have the same definitions as Tracts 1, 2 and 4 in the former Prince of Wales - Outer Ketchikan Census Area from Census 2000.
"There are substantial changes in the geographic definition of Tract 3 in Prince of Wales - Hyder Census Area, and there is no comparability with Tract 3 in the former Prince of Wales - Outer Ketchikan Census Area from Census 2000. Tracts 2, 3, and 4 in Ketchikan Gateway Borough retain the same Census 2000 tract definitions.
"There are substantial changes in the geographic definition of Tract 1 in Ketchikan Gateway Borough, and there is no comparability with the original Census 2000 definition of Tract 1 in Ketchikan Gateway Borough.
"Lastly, AFF access to ACS 5-year estimates for Tracts 1, 2, 3, and 4 in the former Prince of Wales - Outer Ketchikan Census Area are accessed through the successor Prince of Wales - Hyder Census Area instead."
The changes for the 2000 census tracts were confined to the states of Alaska (worst case) and Colorado (the creation of a new county, Bloomfield, created problems), and Clifton Forge, VA. These are the valid changes. But then you should read the final section on this changes page. The first paragraph of this section explains:
"Census tracts and block groups used to tabulate and present 2005-2009 ACS 5-year estimates are those defined for Census 2000. However, in 19 counties from 8 different states, many of the census tracts and block groups used to tabulate and present the 2005-2009 ACS 5-year estimates are either those submitted to the Census Bureau for the 2010 Census, or a preliminary version of 2010 Census definitions. These census tracts and block groups were inadvertently included in the version of the Census Bureau's geographic database (TIGER) used to produce geographic area information for the 2005-2009 ACS 5-year estimates."
The only good news here is that the problem is well documented. (The Census Bureau does not always correct their mistakes, they just document them.) There are good and precise descriptions of what happened during the years following 2000 in Alaska, Colorado and Virginia to cause significant restructuring of the census tracts used for tabulating the 2000 census. So why is this relevant to our allocation project? Because in the blkrel10 files released circa 2001 the county and tract codes used are the original as-of-the-2000-census codes. But the codes that the Bureau uses to tabulate the tract-level data for the 2005-2009 5-year estimates are those reflecting the valid changes and mistakes described above. This means when we do that potentially simple step of joining/merging the equivalency file (blkrel10) with the source data (acs2009) we are not going to have a perfect match. Not unless we first engage in the tedious task of translating the above narrative descriptions of what changed into a block-level file telling us what the "new" county and tract codes are whenever they do not match what they were at the time of the 2k census. We have done something like this dealing with similar changes to the 2010 census tracts vs. the ones currently being used in the 2014 vintage ACS data. It took us several days to write the code to capture those changes at the block level. And we have not even talked about how to deal with the problems with the wrong tracts used in the 2005-2009 ACS data. They provide spreadsheets to help with seeing the problems, but admit that there are some cases where it will not be possible to overcome the errors and get data by the original 2000 tract geographies, per this excerpt from their description of the problem:
"The purpose of inclusion is to identify the tracts and block groups where this condition is true, and acknowledge that in some cases comparisons to Census 2000 tracts and block groups is not possible."
There are actually two sets of geographic fixes required to make our results be what we want. We need to get updated 2k tracts on our equivalency file so that they will match the tracts used on the 2009 vintage ACS files. But we also need to update the 2010 tracts as well. If we do not then the output data set will not reflect the current tract-level geography being used on the 2014 ACS data. And the whole point of doing this allocation is to get something that we can match up with the 2014 data and look at trends. The task is easier on this side because there have not been as many changes this decade as there were in the previous one, and because we have already done some work on creating a data set that will help us with these updates.
So, how are we going to handle this? We are going to take a cue from the Census Bureau for the time being (until someone funds us to do otherwise or provides us with the block level "corrections"). We are going to document the problem, but not fix it. The errors will occur during the step where we join our equivalency files with our 2005-2009 ACS data. We match using the key variables County and Tract, as of 2000. The problems will occur in places where the 2000 tract has had to revised (only effects 3 states: AK, CO, and VA, as noted above) and in states where the Bureau used post-2000 tracts for tabulating the 2005-2009 data. As we do our join/merge we can document mis-match problems to identify cases where the 2000 county/tract on the block relationship file does not match the County/tract codes on the vintage 2009 ACS data. We can afford to take this somewhat cavalier position since our most frequently used states, Missouri, Illinois, and Kansas, are not affected.
We at the Missouri Census Data Center maintain a public archive of data, most but not all based on data from the Census Bureau. We also provide special web applications to allow users to access and query these data. If you are not familiar with our Uexplore and Dexter applications you might want to follow the link in our blue navigation box to the Data Archive home page, which has lots of links to both the software and to sections of the data archive (categories we call filetypes). One of the filetypes we have is called acs2009 and has data based on the vintage 2009 American Community Survey data. With that directory, we already have a data set containing the 5-year period estimates at the census tract level. This is not a tutorial on how to access our data archive, so we are not going to get into great detail here. But one of the data sets that we have in this collection is a standard extract of more than 300 key data items each of more than 170,000 geographic areas covering the entire U.S. Among those geographic areas there are more than 65,000 at the census tract level. These entities are the 2000 census tracts. These are the data we'll be using as our source. We shall be allocating and aggregating these data to get estimates of these data items at the 2010 census tract level.
The result of this data allocation shall be stored in the public data archive as well, in the same filetype (data directory) — /pub/data/acs2009. Not only shall we be making the results accessible to the public, but we'll be making the code we used to create that data set accessible as well. We do that by putting the SAS program file used to create the new data sets in our usual location: in the Tools subdirectory of the filetype data directory (/pub/data/acs2009/Tools/allocated2trk2010.sas).
(Note: this is a little geeky and of interest mostly to SAS programmers, so feel free to skip this paragraph). We could do this all in one huge step where we read in data for the entire country and then wrote just a single output data set with data for the entire country, similar to the input data set. But we choose not to. Instead we shall do these state by state. It is a bit easier to code that way, and it allows us to bypass states where there might be data problems. We have a simple coding strategy that employ for most of our national data collections. We write code to process a single state in the form of a SAS macro that has a standard interface with two positional parameters called state and stab. The first contains the 2-digit FIPs state code (with leading 0s), while stab contains the 2-character state postal abbreviation. Within the macro we reference these parms rather than specific values. To specify the state whose data we want to extract from the usmcdcprofiles5yr data set, we code
[where] state="&state" (as opposed to the more straightforward
state="29"). Once we get the code working we can then invoke the macro to do selected states (especially valuable while debugging) or we can easily invoke it for all 51 states using our %dostates utility macro.
We have three lists of variables that we shall use for our conversion processing. We need a list of all the variables that are margin-of-error values, so that we can specify that we want these dropped from the input data step. This could be coded manually but it would be very tedious and error prone. Fortunately, it is pretty easy to do this by combining our strict variable naming conventions with a sas utility macro. All such variables have names of the form [variable-name]_moe, where variable_name specifies the name of the variable for whuch this is the corresponding MOE measure. Our utility macro (actually it was written by David Ward and he shared it the SAS-L list many years ago and we use it quite a bit) %varlist allows us to specify a data set and a name pattern such as *_moe and have it return a blank-separated list of all variables on that data set with names matching the pattern. We code this as
The list of more than 340 MOE variables can now be referenced when needed in our program by just coding a reference to &moes.
The other key variable lists we need to save us from tedious and error prone coding are the lists of mean variables and the corresponding mean weights that can be used to weight those variables when doing aggregations. Fortunately we have a metadata data set where we keep such information regarding the variables in our mcdcprofiles-type data set. This is the
data _null_; step beginning around line 30 of the SAS program file. We won't go into program logic here but will tell you what the step accomplishes. It creates two global macro variables named means and meanwts, and the values of these variables are the lists we shall need to tell %agg how to handle these variables. There are 330 such variables, which include all of the Pct variables. In addition to creating this pair of global macro variables, the step also writes a file with a short report showing how the means and their weight variables pair up. This report can be viewed in the meanwts.txt file in the same Tools directory as the conversion program.
Now that we have finished defining some key variable list macro variables we are ready to begin the actual processing of the data. We begin (circa line 70 of the program — note that we are inside the %doit macro) with a data step that defines the acstracts data. Here is the SAS code that accesses the ACS data to be aggregated/allocated and then merges it with the tract 2k to tract 2010 equivalency file to create the data set with 2010 tracts ready to be aggregated.
The key statement here is the
set statement at line 72 referencing the very large acs2009.usmcdcprofiles5yr data set:
set acs2009.usmcdcprofiles5yr(in=inacs drop=&moes where=(state="&state" and sumlev='140') );
set statement is used to read SAS data sets, one observation at a time. The
drop=&moes specification tells SAS not to bother with all those variables that have names ending in _MOE. The where clause —
where=(state="&state" and sumlev='140') — specifies the conditions for having an observation actually read during the data step. It specifies that we only want those observations where the value of the variable state is equal to the value &state — the value we specify as the first positional parameter when invoking the %doit macro. If we are processing Missouri this turns into
The second condition (which is connected to the first with the and operator, meaning we want both conditions to be true) specifies that we only want those observations where the value of variable sumlev is 140. Census data geeks will recognize this as meaning we want summaries at the census tract level. Ignore all those other state, county, place, ZCTA, PUMA, etc. summaries — we only want complete census tracts. When we run the macro for Missouri it will pull out just over 1900 observations (cases, tracts, ...) from the more than 170,000 on the complete referenced data set — and will do this in a matter of a second or less because the input set is indexed by state.
The data step beginning at line 83 (
data agiin&stab;) is the key to doing the allocation. This is the step where we merge (i.e. combine, join) the ACS estimates data with the equivalency file. The SAS
merge statement takes care of the program logic for doing the join. Using
in= variables, it is easy to know what you are dealing with in terms of the two files being matched for a given cycle of the data step. It makes it easy to know when you have a case where one of the two data sets is not matched. The
by statement (line 85) specifies the variables used to link the observations from the two data sets. We have a one-to-many merge here: there can be multiple observations on the blkrel10 equivalency file with the same by variable keys (identifying a 2000 census tract), but there should be only (at most) a single observation with these by variables on the acstracts data set. Each time we have a match an observation is written to the output data set (
aggin&stab). The if statement at line 86 checks to see if we have a match, i.e. if both of the
in= special flag variables as specified on line 84 are set to true to indicate we have a case where both input data sets are contributing. If we have a tract that is in the data but not in the equivalency file, then we do not know where the allocated values are to go, and if we have an equivalency file tract without any matching ACS data, we have no data to allocate. So we really need to have only cases where the tracts match between the data and the equivalency file. When they do not match it is an error that needs to be looked into to avoid invalid output. The code at lines 87 to 94 are error handling. We opt to ignore cases where we have a tract with 0 population (it does happen), and also if we have a case where the equivalency file indicates a zero population intersection. The "bad" case gets documented on the cnvtlog file where we keep track of these problems. This file is the same no matter what state is being processed and gets written to mod, meaning when we start processing Alaska we just pick up from what we wrote to this file while processing Alabama. It becomes a rather important file, because it lets us know which states are going to need some work before we can fully trust the allocated data for them. The other key thing to note in this error handling block is the
goto cklast; statement at the end of the do group. Turns out that we do not really need that statement, but it emphasizes that we do not want to output to aggin when we have a problem case. The statement is not needed because we modified line 95 to specify that the output statement (which is what causes the program to actually write something to the output data set) will only be executed when the condition at line 86 is false, i.e. when we have data from both input sets.
Note that we can easily detect when we are finished processing (i.e. when we are processing the last input cycle) using the
end=_last spec on the
merge statement. When we are processing the last observation the _last flag will be set to true, otherwise it will be false. So we can execute the code starting at line 100 only when we are almost finished. Note that if found no bad cases, we output a line to the cnvtlog to document this fact. We also can output more detailed summary info for states that did have problems to get a better handle on how serious the problems may be. The cnvtlog file is stored in the same data directory as the input and output data sets. It shows that of the 51 states (we treat DC like a state for census data purposes), 39 had no errors (unmatched data), leaving 12 states with problems. One of those 12 was Oregon; it had a single error and the population not matched rounded to zero.
This is perhaps the most challenging step (not counting finding or creating the equivalency files) of the process. This is where we have to do the data allocation, apportioning the data based on the allocation factors from the equivalency files. But if you have a good tool set it can become very routine. We already discussed the %agg macro in our first case study. It won't change for this example. What makes it harder is having to specify those long lists of "mean" variables and their corresponding weights. But we already did this, taking advantage of the fact that we had a metadata set that make it pretty easy. We created global macro variables called _means and _meanwts with the values needed. So all we have to do is invoke the macro, specifying the parameters. As follows:
run; statement is necessary (if running interactively) to cause the generated data step to actually run.
There is nothing too complicated here. We want to add some identifiers to make it easier for us and our Dexter users to be able to understand what these data are about. The ID variables Vintage, Period and SumLev are constants and document the fact that these are vintage 2009 5-year period estimates at the census tract level. The Areaname variable is always helpful and starts with "2010" to emphasize that these are 2010 tracts (not 2000 versions such as are reported elsewhere in the acs2009 data directory). Tract is handy to have as a separate variable. The esriid variable is one we commonly use. It corresponds to the portion of the Bureau's Geoid variable without the first seven characters. For example, for the first tract in Adair county, MO partially displayed below, the value of Geoid would be 14000US29001950100, where the bold portion is the value of esriid.
Specifying these variables in a SAS length statement prior to the set statement causes them to be stored first within the observations on the SAS data set. This makes it easier to work with data extracted from the archive when all the IDs appear first (in the leftmost columns of your Excel spreadsheet or SAS data set, for example). Here is a (partial) look at the first observation on the Missouri output data set:
We use our %dostates utility macro (mentioned earlier) in order to invoke the %doit macro for specified states. You can see the two ways it can be used. At line 136 you see how we used it to run some test states (5 of them) by specifying their codes as the single positional parm. This invocation has been commented out (the %dostates replaced with %* to indicate a macro comment instread of a macro invocation). Line 138 shows the other way to invoke dostates using the n= parameter. We use n=51 to invoke the macro for all 50 states and DC.