The Missouri Census Data Center maintains a large collection of public data files, most of them from the Census Bureau or other government agencies. We have created a web-based set of software tools to assist users in navigating this collection and accessing specific datasets in order to do custom extractions. The navigation tool is called Uexplore, and the tool for doing the custom extractions is called Dexter (for data extracter). There is a data archive overview page that describes the system, and a data archive home page where you can start exploring the system. But such a start-from-the-top approach is too much for our purposes here. Instead, we'll focus on the less daunting task of just accessing the ACS data. (We do encourage you to explore these just-referenced pages on your own, but maybe not just right now.)
The Missouri Census Data Center ACS data collection can be broken into three basic categories:
We store the Census profiles and the base tables data in their own separate subdirectories, leaving just our own MCDC custom profiles data to be in the main ACS data directories (we have separate directories for each data-release year, i.e. acs2006, acs2007, acs2008, ...). We have a complete national collection (excluding Puerto Rico) of data for the survey years 2006 through 2008 with the 3-year period estimates for 2005-2007 and 2006-2008. (We keep the 3-year period estimates in the same data directory with the 1-year data for the last year of the period (referred to sometimes as the vintage year). We plan to continue this pattern; so in late 2010 you can expect to see vintage 2009 data stored in an acs2009 data directory. The latter which will contain 2009 single-year data and 2007-2009 3-year data and (perhaps/probably) 2005-2009 5-year data. Where normally someone needing to access one of these datasets would come in through the archive home page and navigate their way to the acs2008 directory (plug in acs2007 or acs2009 if you need something a different vintage), we provide a shortcut (a link labeled "Extract Data via Dexter" as shown just below) that will take you not only to that directory, but will also choose the relevant dataset and drop you into a dynamic web application which we'll refer to as the Dexter query form. Essentially, we have eliminated the uexplore (navigating the archive) portion of the uexplore/dexter navigate/extract process. This puts you in the mine, but you still have to extract the gold.
At the bottom of an ACS Profiles page you should see a row of links.
The right-most link is to Complete Metadata(pdf). Right-click on this link and open it in a new window or browser tab. You can peruse it a bit now if you like, but then return to the profile page window/tab. We'll want to return to the metadata when we are doing our extraction step, below.
The 3rd of the links is Extract Data via Dexter; clicking on this takes you straight to the Dexter form, which should look something like this:
Click hereto see the complete page, "live", instead of just this snapshot of the top portion. What happens next can be a little daunting at first, especially if you have never been here before, and are not all that comfortable with phrases like "data query" and "data filter". The keys to success in doing the extraction are the user's understanding of and interest in accessing these data. Basically, this translates into do you know what's available and do you know what you want.
The dataset being accessed here is called usmcdcprofiles3yr, stored in the directory acs2008. The mnemonic dataset name (usmcdcprofiles3yr) can be parsed. The us at the beginning of the name tells you that this is a national dataset (has data for geographic areas covering the entire country); the mcdcprofiles part tells us what type of dataset we have (Missouri Census Data Center standard profiles data); and 3yr indicates that this is a set of 3-year period estimates. (As opposed to, for example, a dataset named momcdcprofiles which would be Missouri only data of the same type (mcdcprofiles) and based on a single year.) According to the info displayed at the top of the form this dataset has
14265 rows/observations, 1050 columns/variables. Each row corresponds to a geographic area, while each column/variable contains one piece of information regarding that area. These variables correspond to the data items in the MCDC ACS profile reports.
So what are we going to do with this information? Let's say we want to look at indicators of economic well being (income and poverty measures and housing values/rents) for all counties in your state. For the sake of our example, we'll use New York as the state, but you should be able to easily substitute your own state. We are going to want to put these data into an Excel spreadsheet so that we can do various manipulations of the data once we have them "in house".
The first section of the query form is titled " I. Choose Output Format(s)". If we leave everything alone here we shall be asking the application to give us a "Delimited File", specifically a comma-delimited (or "csv") file. Excel users know that a comma-delimited file (with filename extension of ".csv") is readily imported and converted to an excel file. And it helps if the first row or two contain column identifier information. This is exactly what Dexter will produce for you, by default.
The second section of the query form s titled "II. Choose rows (observations) to keep by specifying a filter. . This is the most challenging part of the form to complete because it requires some understanding of the dataset being queried, and often requires knowledge of coded values. In this case we need to know that each row of the data set represents a geographic entity: the nation, a state, a county, etc. Try clicking on the Quick Look button at the top and it will display a page like this:
We don't see too much in the screen shot here, but if you do it live and scroll down to see the first 100 rows you should be able to recognize a pattern. The rows begin with a complete United States summary, followed by an Urban portion of the U.S., a Rural portion, etc. There are 16 rows/observations that are the entire U.S. geographic component summaries. These are followed by summaries for the 4 U.S. regions (rows 18-21) and then by geographic components of those regions (rows 22-65, 11 components per region). The key here is to recognize the meaning and importance of the two fields/variables SumLev and Geocomp. One indicates the kind of geographic area (e.g. 020 is the code for a Region) and the other for a special component of the geographic area, such as the Urban or Rural portion. We want to choose rows where the value of SumLev indicates a county level summary, and where the value of Geocomp indicates "Not a geographic component". It's pretty obvious that the code for the latter is "00" but what about the SumLev code to indicate a county level summary? These codes tend to be well known by experienced census data users but not by casual or firt-time users. We need to have a method for helping users access a codebook of some sort to help them with such coded values. This is where the link to detailed metadata near the top of the form (just above the Quick Look button) comes in. Clicking on this link takes you to this page (abridged):
There is a lot of useful information on this page, but the specific items we want you to focus on now are the 3 highlighted links in the Key Variables section of the page. Clicking on any of these key variable names takes you to a key values page such as:
We now proceed to the "III. Choose columns (variables)" section. This is where you will probably want to refer back to the metadata pdf file linked to from the bottom of the profile report (which we asked you to access above), where you'll see
Descriptioncolumns. These allow you to put a name and label with items viewed in the report; that allow you to see, for example, that the name of variable containing the median age of persons living in the area is
MedianAgeand the variable containing the count of persons aged 85 and over is
Over85and has a Universe variable of
TotPop. So what is a "Universe variable"? It is the variable that we associate with another count variable that is used as the denominator in calculating a Pct (percentage) variable. For example the variable
Age0_4has a Universe variable of TotPop; this tells us that the corresponding percent variable (which will be called PctAge0_4) is derived by taking Age0_4 as a percentage of TotPop. The fact that the Universe column is blank for the MedianAge row tells us that we do not have a percentage variable associated with this median (since it would not make sense). We do have an entry in the Weight column for MedianAge, which tells us what variable we might want to use to weight this value if we were to do aggregation of the data. You should use this report to help develop a list of the variables you are interested in keeping on your extract.
Section III of the Dexter form is where you specify what columns/variables you wish to keep. Initially it will look like this:
Note the highlighted instruction to "hold down the CTRL key to get multiple selects". Be careful - it can take several minutes to carefully scroll through a list of variables, making choices by holding down the Ctrl key while clicking on variables to be kept (if you accidentally click on one you don't want, just click on it again to de-select it). Clicking on any entry without holding down the Ctrl key will result in all previous selections to be lost - you have just selected your one and only choice.
There are two sections of Dexter customizing parameters left on the form but none of them are essential (things like report titles, sort order, variables formats or labels, etc.) So just go ahead and this point and click on the
Extract Data button at the bottom of Section III. If all goes well you should be presented with an intermedit Output Menu page like so:
What happens when you click on the Delimited File link is browser dependent. We recommend that the right-click on this link and ask to open the link in a new window or tab. That way you still have your Output Menu page displayed, and you can use the browser Back button to return to the form. (Which is important should you decided to modify your query. Unfortunately, we have had varying experience regarding the browser's ability to retain the form as filled out; it seems to always work just fine with Firefox, but is inconsistent with IE: sometimes when we go back we get the form just the way we left it, and sometimes it has been reset so that all the choices we made are lost. Which is why we always try to use Firefox when doing complex dexter queries, including this one.) Here is what we get when we click on the Delimited File link using Firefox, without any special configuring regarding associating an application with the .csv file extension:
Actually, this is a clipped image. It's enough to see that what we have an ascii file in csv format, with the top two rows clearly different from the rest. The first row contains variable names, the second row contains variable labels. Rows 3 and over contain the actual data. The values are separated by commas and in most cases enclosed in double quotes (values that contain commas are so enclosed). We can save this file to our desktop and then open it with Excel. Or, we can take advantage of the fact that our Internet Explorer browser come preconfigured to automatically recognize files with csv extensions as comma-delimited Excel-importable files and thus invokes Excel and imports the file. So if I click on the same Delimited File link from IE I get this:
(OK, I did a couple of clicks in Excel to make the columns wider to get this view.) Note how the variable names and labels now occupy the first two rows of the spreadsheet with the remaining rows containing the data. Where you go from here is pretty much determined by your knowledge of Excel and the ACS data. But clearly there are lots of things that can be done rather easily with the data at this point, none of them requiring the skills of a professional programmer or statistician.
This completes our example of an extraction from an MCDC profiles (standard extract) dataset. We also provide access to the detailed tables via the same uexplore/dexter tools. We shall not be doing a detailed description of how this would be done, but we can tell you where to start. That would be at the Uexplore Home Page on the MCDC web site. Click on American Community Survey in the Major Category Index box near the top of the page. Click on the acs2008 link (or choose a different vintage year if you prefer) which invokes uexplore and lets you browse the ACS data collection (directory) for that year.
Click on the basetbls link to access the subdirectory containing the detailed base tables. In that directory you will see the names of our base table datasets such as ustabs21_24.sas7bdat and ustabs21_243yr.sas7bdat . These datasets will contain, respectively, all base/detailed tables in the 21 to 24 topic series for all geographies in the U.S., single year (2008) and 3-year period estimates (2006-2008). If you have not already determined what table(s) you (we recommend using American FactFinder to access Detailed tables with its ability to do Keyword and Subject searches for most table-searching) you can click on the TableTopicsCodes.txt file. The first 2 digits of a base table is the topic code. So tables regarding topic 21=Veteran Status will be found in the 21_24 datasets. To see the names of specific variables stored in these datasets click on the Varlabs subdirectory link and then on the txt file for the topic group of interest. Margin-of-error data are stored in separate datasets with comparable names that include the string _moes. You will need to extract the MOE variables separately, keeping the geoid key variable; the MOEs can be linked back to the estimates using geoid as the key.
The important difference between doing an extraction from one of these base tables datasets vs. a smaller mcdcprofiles dataset is that it is flagged as a table-based dataset, which means that in Section III of the Dexter query form you get a Tables select list where you would otherwise expect a Numeric Variables list. So you get to select entire tables rather than the individual variables comprising those tables. (Advanced users who would prefer to select specific table cells rather than entire tables can get around this by entering the list of variables to be extracted in the text box provided at the bottom of Section III of the form -- the seldom-used "3rd way" to specify which variables to keep.)
We have created a separate PowerPoint tutorial Accessing Large Table Files With Dexter to help users get started with table-level access. The Dexter mechanics are really not much different.
Return to Ten More Things ... document (part 2).