Ten More Things to Know (and Do) About the American Community Survey

Item 8: Downloading Custom Extracts from MCDC Site

The Missouri Census Data Center maintains a large collection of public data files, most of them from the Census Bureau or other government agencies. We have created a web-based set of software tools to assist users in navigating this collection and accessing specific datasets in order to do custom extractions. The navigation tool is called Uexplore, and the tool for doing the custom extractions is called Dexter (for data extracter). There is a data archive overview page that describes the system, and a data archive home page where you can start exploring the system. But such a start-from-the-top approach is too much for our purposes here. Instead, we'll focus on the less daunting task of just accessing the ACS data. (We do encourage you to explore these just-referenced pages on your own, but maybe not just right now.)

The Missouri Census Data Center ACS data collection can be broken into three basic categories:

  1. data directly downloaded from the Census Bureau web site corresponding to the Census Bureau Data Profiles
  2. data we use in our own MCDC ACS data profiles, derived primarily from the data in the Bureau data profiles, augmented with some items derived from the base tables
  3. Base tables, the 1200+ set of tables corresponding to the ones you can see in American FactFinder when you select the Detailed Tables option for an ACS dataset. (In some contexts, on the FTP-access portion of the Bureau web site, these may be referred to as the "ACS Summary Files". But that is a technical detail that very few data users should ever need to worry about.)

We store the Census profiles and the base tables data in their own separate subdirectories, leaving just our own MCDC custom profiles data to be in the main ACS data directories (we have separate directories for each data-release year, i.e. acs2006, acs2007, acs2008, ...). We have a complete national collection (excluding Puerto Rico) of data for the survey years 2006 through 2008 with the 3-year period estimates for 2005-2007 and 2006-2008. (We keep the 3-year period estimates in the same data directory with the 1-year data for the last year of the period (referred to sometimes as the vintage year). We plan to continue this pattern; so in late 2010 you can expect to see vintage 2009 data stored in an acs2009 data directory. The latter which will contain 2009 single-year data and 2007-2009 3-year data and (perhaps/probably) 2005-2009 5-year data. Where normally someone needing to access one of these datasets would come in through the archive home page and navigate their way to the acs2008 directory (plug in acs2007 or acs2009 if you need something a different vintage), we provide a shortcut (a link labeled "Extract Data via Dexter" as shown just below) that will take you not only to that directory, but will also choose the relevant dataset and drop you into a dynamic web application which we'll refer to as the Dexter query form. Essentially, we have eliminated the uexplore (navigating the archive) portion of the uexplore/dexter navigate/extract process. This puts you in the mine, but you still have to extract the gold.

At the bottom of an ACS Profiles page you should see a row of links.

The right-most link is to Complete Metadata(pdf). Right-click on this link and open it in a new window or browser tab. You can peruse it a bit now if you like, but then return to the profile page window/tab. We'll want to return to the metadata when we are doing our extraction step, below.
The 3rd of the links is Extract Data via Dexter; clicking on this takes you straight to the Dexter form, which should look something like this:

Click here to see the complete page, "live", instead of just this snapshot of the top portion. What happens next can be a little daunting at first, especially if you have never been here before, and are not all that comfortable with phrases like "data query" and "data filter". The keys to success in doing the extraction are the user's understanding of and interest in accessing these data. Basically, this translates into do you know what's available and do you know what you want.

The dataset being accessed here is called usmcdcprofiles3yr, stored in the directory acs2008. The mnemonic dataset name (usmcdcprofiles3yr) can be parsed. The us at the beginning of the name tells you that this is a national dataset (has data for geographic areas covering the entire country); the mcdcprofiles part tells us what type of dataset we have (Missouri Census Data Center standard profiles data); and 3yr indicates that this is a set of 3-year period estimates. (As opposed to, for example, a dataset named momcdcprofiles which would be Missouri only data of the same type (mcdcprofiles) and based on a single year.) According to the info displayed at the top of the form this dataset has 14265 rows/observations, 1050 columns/variables. Each row corresponds to a geographic area, while each column/variable contains one piece of information regarding that area. These variables correspond to the data items in the MCDC ACS profile reports.

So what are we going to do with this information? Let's say we want to look at indicators of economic well being (income and poverty measures and housing values/rents) for all counties in your state. For the sake of our example, we'll use New York as the state, but you should be able to easily substitute your own state. We are going to want to put these data into an Excel spreadsheet so that we can do various manipulations of the data once we have them "in house".

The first section of the query form is titled " I. Choose Output Format(s)". If we leave everything alone here we shall be asking the application to give us a "Delimited File", specifically a comma-delimited (or "csv") file. Excel users know that a comma-delimited file (with filename extension of ".csv") is readily imported and converted to an excel file. And it helps if the first row or two contain column identifier information. This is exactly what Dexter will produce for you, by default.

The second section of the query form s titled "II. Choose rows (observations) to keep by specifying a filter. . This is the most challenging part of the form to complete because it requires some understanding of the dataset being queried, and often requires knowledge of coded values. In this case we need to know that each row of the data set represents a geographic entity: the nation, a state, a county, etc. Try clicking on the Quick Look button at the top and it will display a page like this:

We don't see too much in the screen shot here, but if you do it live and scroll down to see the first 100 rows you should be able to recognize a pattern. The rows begin with a complete United States summary, followed by an Urban portion of the U.S., a Rural portion, etc. There are 16 rows/observations that are the entire U.S. geographic component summaries. These are followed by summaries for the 4 U.S. regions (rows 18-21) and then by geographic components of those regions (rows 22-65, 11 components per region). The key here is to recognize the meaning and importance of the two fields/variables SumLev and Geocomp. One indicates the kind of geographic area (e.g. 020 is the code for a Region) and the other for a special component of the geographic area, such as the Urban or Rural portion. We want to choose rows where the value of SumLev indicates a county level summary, and where the value of Geocomp indicates "Not a geographic component". It's pretty obvious that the code for the latter is "00" but what about the SumLev code to indicate a county level summary? These codes tend to be well known by experienced census data users but not by casual or firt-time users. We need to have a method for helping users access a codebook of some sort to help them with such coded values. This is where the link to detailed metadata near the top of the form (just above the Quick Look button) comes in. Clicking on this link takes you to this page (abridged):

There is a lot of useful information on this page, but the specific items we want you to focus on now are the 3 highlighted links in the Key Variables section of the page. Clicking on any of these key variable names takes you to a key values page such as:


What you should be able to learn from this page (among other things) is that there are 1821 rows/observations on this dataset where the value of the variable SumLev is 050 and that this code means county. Clicking on the geocomp and state key variable links will produce "Values with frequencies..." pages that will know the codes for those two variables. Armed with this knowledge you can now proceed to complete the data filter as described. We define a filter with three conditions. A row will be selected if and only if all three of the conditions are true. The first condition is that the SumLev variable must have a value indicating a county level summAry. The second condition is that the row not be a special geographic component - we just want the "Not a geographic component" summaries. The third and final condition is the state we are interested in. The variable State contains FIPS states codes which, if we did not already know it, we could have looked up by following the link to the key values frequenceis page for the variable State. The code for New York is 36. We click on the down arrows and select values for the variable names and the logical Operator values, and we type in the desired values in the text boxes. Here is what the filter section looks like:

Note that we leave the last two rows as-is and we do NOT do anything with the check boxes for inserting parentheses. These are rarely used.

We now proceed to the "III. Choose columns (variables)" section. This is where you will probably want to refer back to the metadata pdf file linked to from the bottom of the profile report (which we asked you to access above), where you'll see


that it is a table with rows that parallel those in the profile reports, and includes a table index panel. This makes it relatively easy to access any portion of the 41-table profile report. The key fields here are the Variable Name and Description columns. These allow you to put a name and label with items viewed in the report; that allow you to see, for example, that the name of variable containing the median age of persons living in the area is MedianAge and the variable containing the count of persons aged 85 and over is Over85 and has a Universe variable of TotPop. So what is a "Universe variable"? It is the variable that we associate with another count variable that is used as the denominator in calculating a Pct (percentage) variable. For example the variable Age0_4 has a Universe variable of TotPop; this tells us that the corresponding percent variable (which will be called PctAge0_4) is derived by taking Age0_4 as a percentage of TotPop. The fact that the Universe column is blank for the MedianAge row tells us that we do not have a percentage variable associated with this median (since it would not make sense). We do have an entry in the Weight column for MedianAge, which tells us what variable we might want to use to weight this value if we were to do aggregation of the data. You should use this report to help develop a list of the variables you are interested in keeping on your extract.

Section III of the Dexter form is where you specify what columns/variables you wish to keep. Initially it will look like this:


Note that Dexter knows enough about the data to be able to segregate Identifiers (mostly geographic and time period codes) from statistical data so that you select these item from separate lists. The Numerics select list is displaying variables that are closely related to the ones shown in the pdf file codebook. In addition to the base variables (i.e. the actual counts) and the corresponding Pct variables you also see variables with _MOE suffixes in their names; these are the Margin of Error measures. The Numerics list is very long (about 1000 entries) and it can be pretty tedious to scroll through and find just the items you want. There are some tools that can be used to make this process a bit easier. Notice the highlighted (on the screen snapshot only) box labeled "Filter by regular expression". This is a text box where you can enter strings that the program will use to modify the Numerics select list, keeping only entries that match the expression you enter. The simplest and most useful kind of expression to enter is a series of keywords separated by vertical bar symbols (without blanks). For example, we can limit our search to variables that have a name or label containing any of the 4 "words" income, poverty, poor or population by entering these as shown below:

After entering our filter expression (and also after selecting 3 variables from the Identifiers list, the geographic ids we want to keep on our output) we click on the Filter button and within a second or so we get a refreshed select list as shown. (Actually, this shot shows the new list after we did some clicking and scrolling on it to choose our numeric variables of interest).

Note the highlighted instruction to "hold down the CTRL key to get multiple selects". Be careful - it can take several minutes to carefully scroll through a list of variables, making choices by holding down the Ctrl key while clicking on variables to be kept (if you accidentally click on one you don't want, just click on it again to de-select it). Clicking on any entry without holding down the Ctrl key will result in all previous selections to be lost - you have just selected your one and only choice.

There are two sections of Dexter customizing parameters left on the form but none of them are essential (things like report titles, sort order, variables formats or labels, etc.) So just go ahead and this point and click on the Extract Data button at the bottom of Section III. If all goes well you should be presented with an intermedit Output Menu page like so:


which should normally take about five seconds to be completely displayed. It should immediately appear with a message indicating that the extraction work is in progress and asking for your patience. What you are waiting for is one or more links to the output file(s) as requested in Section I of the form. In this case we get a link to the requested csv file, or Delimited File

What happens when you click on the Delimited File link is browser dependent. We recommend that the right-click on this link and ask to open the link in a new window or tab. That way you still have your Output Menu page displayed, and you can use the browser Back button to return to the form. (Which is important should you decided to modify your query. Unfortunately, we have had varying experience regarding the browser's ability to retain the form as filled out; it seems to always work just fine with Firefox, but is inconsistent with IE: sometimes when we go back we get the form just the way we left it, and sometimes it has been reset so that all the choices we made are lost. Which is why we always try to use Firefox when doing complex dexter queries, including this one.) Here is what we get when we click on the Delimited File link using Firefox, without any special configuring regarding associating an application with the .csv file extension:

Actually, this is a clipped image. It's enough to see that what we have an ascii file in csv format, with the top two rows clearly different from the rest. The first row contains variable names, the second row contains variable labels. Rows 3 and over contain the actual data. The values are separated by commas and in most cases enclosed in double quotes (values that contain commas are so enclosed). We can save this file to our desktop and then open it with Excel. Or, we can take advantage of the fact that our Internet Explorer browser come preconfigured to automatically recognize files with csv extensions as comma-delimited Excel-importable files and thus invokes Excel and imports the file. So if I click on the same Delimited File link from IE I get this:

(OK, I did a couple of clicks in Excel to make the columns wider to get this view.) Note how the variable names and labels now occupy the first two rows of the spreadsheet with the remaining rows containing the data. Where you go from here is pretty much determined by your knowledge of Excel and the ACS data. But clearly there are lots of things that can be done rather easily with the data at this point, none of them requiring the skills of a professional programmer or statistician.

This completes our example of an extraction from an MCDC profiles (standard extract) dataset. We also provide access to the detailed tables via the same uexplore/dexter tools. We shall not be doing a detailed description of how this would be done, but we can tell you where to start. That would be at the Uexplore Home Page on the MCDC web site. Click on American Community Survey in the Major Category Index box near the top of the page. Click on the acs2008 link (or choose a different vintage year if you prefer) which invokes uexplore and lets you browse the ACS data collection (directory) for that year.

Click on the basetbls link to access the subdirectory containing the detailed base tables. In that directory you will see the names of our base table datasets such as ustabs21_24.sas7bdat and ustabs21_243yr.sas7bdat . These datasets will contain, respectively, all base/detailed tables in the 21 to 24 topic series for all geographies in the U.S., single year (2008) and 3-year period estimates (2006-2008). If you have not already determined what table(s) you (we recommend using American FactFinder to access Detailed tables with its ability to do Keyword and Subject searches for most table-searching) you can click on the TableTopicsCodes.txt file. The first 2 digits of a base table is the topic code. So tables regarding topic 21=Veteran Status will be found in the 21_24 datasets. To see the names of specific variables stored in these datasets click on the Varlabs subdirectory link and then on the txt file for the topic group of interest. Margin-of-error data are stored in separate datasets with comparable names that include the string _moes. You will need to extract the MOE variables separately, keeping the geoid key variable; the MOEs can be linked back to the estimates using geoid as the key.

The important difference between doing an extraction from one of these base tables datasets vs. a smaller mcdcprofiles dataset is that it is flagged as a table-based dataset, which means that in Section III of the Dexter query form you get a Tables select list where you would otherwise expect a Numeric Variables list. So you get to select entire tables rather than the individual variables comprising those tables. (Advanced users who would prefer to select specific table cells rather than entire tables can get around this by entering the list of variables to be extracted in the text box provided at the bottom of Section III of the form -- the seldom-used "3rd way" to specify which variables to keep.)

We have created a separate PowerPoint tutorial Accessing Large Table Files With Dexter to help users get started with table-level access. The Dexter mechanics are really not much different.

Return to Ten More Things ... document (part 2).