For a simpler overview of this material (with pictures) you may want to check out our Dexter Quick Start Guide.
The Stuff at the Top of the Form
We reference the portion of the page above the first horizontal bar, i.e. prior to Section I. There are two links at the very top, to the Dexter Quick Start Guide and to this Online help document.
Data Set ID Line
The line a thte top of the page identifying the selected Data Set is divided into three parts:
- The data directory (path) is displayed and doubles as a hyperlink in case you want to return to a Uexplore page referencing the current data collection (directory).
- A text box containing the name of the data set (without the extension). This has not always been a text box, it used to just display the name as text. The reason we made it a text box now is because we wanted to allow the user to modify the value. You do that by simply editing the value and hitting
enter. Doing so will not change anything you can see on the page. What it will do is change links on the page (for example, to detailed metadata, just below) and, most importantly, the input dataset that is going to be passed to dexter. It is important to be aware that there is little or no checking to make sure that what you type into the box is valid. For this to work you must enter the name of a dataset that really exists in the specified data directory (spelling definitely matters, case does not; the path/directory cannot be changed from this page) and the new dataset should contain essentially the same variables as the current dataset. The current page contains various select lists (in Sections I and III, for example) that have been generated based on variables (/tables) present in the original dataset. These select lists do not get changed to reflect the variables in the new dataset you have entered. The program does not even check to see if the dataset whose name you just typed in even exists. You will only find this out when you Submit the query and get the message saying Dataset specified does not exist. (from Dexter). So type very carefully. This feature is typically useful when you have a somewhat complicated query that you want to run for a series of similar datasets. You could run it for dataset motracts, then go back and change the dataset and run the identical query for iltracts and again for kstracts, etc. -- the same query to access tract level data for Missouri, Illinois and Kansas.
A new wild card dataset specification capability was added to Dexter (November 2015). You can now enter a value containing two consecutive underscores (__) to specify you want all datasets where two characters can replace these underscores. So if a directory contains a collection of data sets for all states in the country you can access all of them (in sequential alphabetical by dataset name order) by replacing the state abbreviation portion of the name with the two underscores. For example in data directory sf12010x there are datasets alslectedinv, akselectedinv, ... wyselectedinv. You could initially select any of these 52 datasets (we have a PR dataset in this collection) to invoke Dexter but then replace the dataset name in the box with __ as the first 2 characters and your query would then access all 52 data sets. If you only wanted to process say 5 of the 52 you could use the wildcard feature and then code a state-based filter specification in Sec. II of the form to filter all but the states actually desired. This will only work if all these data sets have the same variables as the data set you specified when you invoked dexter. None of the datasets accessed can be SAS views (which should not be a problem when accessing a collection with one dataset per state or a time series with one per time period). You cannot specify the "Go straight to rankster" option when using wildcards. You should avoid using the Quicklook feature or following links to metadata after entering the wildcard value. Probably other restrictions will pop up that we have not tested for yet. Let us know if something does not work.
<.# rows>. rows/observations, <.# cols>. columns/variablesThis is fairly straightforward, useful information about the dataset selected (if you overtype the dataset name in the preceding text box, it does not cause this information to be updated). If the dataset is a view (i.e. a "virtual" dataset that is stored as a program reference rather than a normal dataset) then the
rows/observationscount will be missing (the observation count is stored as part of a regular dataset but not as part of a view).
Detailed Metadata Link
There may or may not be a line just below the Data Set: line that invites you to "
See detailed metadata for this dataset.". MCDC staff have been creating metadata modules for many of our most frequently-used and important datasets but we still have many datasets without such a module. If the link is there and you are new to this dataset (or at least this type of dataset) then you really owe it to yourself to follow the link and see what you can learn about the data. Note that one of the most important aspects of these meta files is to provide you with Key Values pages that tell you know the meaning of important identifier codes used on the dataset. Such knowledge is critical for coding data filters (see Section II).
This is another relatively new feature, added to the application early in 2010. The rankster dynamic web application can be used to create reports and data extracts where the focus is on ranking the data based on specified variables. In many ways rankster parallels dexter in the way that it accesses and processes datasets in the archive. There are two basic ways you can invoke the rankster application, and they both require using this line. One way to do it is to just skip dexter and take the specified dataset (in its entirety, i.e. all rows and all columns) and go straight to rankster. You can do this by clicking the link at the end of this line. The other way to invoke rankster is by specifying (via the
Rankster option:select list) one of the two options other than the default "Not interested" value. The two "yes" options allow you to use dexter as a preprocessor for rankster. You can specify that you want to view the dexter results and then have an option to pass the results of your dexter query to rankster; or, you can take the more I-know-what-I'm-doing approach and skip viewing the dexter results and just go straight to the rankster query form with the results of the dexter query specified as the input source for rankster. By having dexter serve as a front end module for rankster means we can take advantage of its nice filtering and variable selection features without having to reimplement them within the rankster application; it also is a way of sharing the uexplore navigation tool to choose the rankster data source.
The button can be used to generate a quick and easy report showing selected variables for the first few rows of the dataset. You would typically hit this button when you first start your query to see if it helps you get a handle on what the dataset looks like. By default the report will contain all the Identifier variables (as displayed in Section III of the left side select list) for the first 100 rows (unless, of course, the dataset has fewer than this). These defaults can be overriden as follows:
- If you select variables from the Numeric Variables list in Section III and do not select anything from the Identifiers list in that section, then you get these numeric variables in addition to all the IDs.
- If you choose from the ID variables list in Section III then the Quick Look report will contain just those variables chosen (from both ID and Numeric lists).
- Quick Look output will not apply any variable-based filters specified in Section II, but if you specify a value in the box to "limit # or observations/rows on each output..." it will use that value instead of 100.
I. Choose Output Format(s)
This section of the form lets you choose up to 3 output formats. Each of the 3 lines here contain a set of Radio buttons that lets you choose whether you want any output of the specfied type and, if so, which specific format.
The Delimited File buttons let you choose a comma or tab-delimited output file. By default you will get a comma-delimited ("csv") file. Note that "csv" is the filetype extension associated with this format; it stands for "comma separated values" and most browsers - IE especially - will be configured to automatically invoke Excel as a plug-in to handle a file of this type. If you choose Tab-delimited instead (since these are radio buttons you can choose only 1 option from each line) then the fields are delimited by tab characters rather than commas and the file generated by Dexter has a ".txt" filetype extension (instead of the default ".csv"); this will usually result in different behavior by your browser when you click on the output file. In most cases it will just display it in your browser and you can then use File-Save to capture it.
These files will begin with 2 rows of identifiers before the actual data lines begin. The first row contains the names of the variables and the second row contains their labels. (In some files, and for some variables, there will be no labels assigned in which case the variable name is repeated as the default label.)
Choosing "none" in this row indicates that no delimited file is to be generated. (Just be sure that if you make this choice, that you select some option other than "none" from at least one of the remaining 2 rows; otherwise, you will have selected no output at all.) If you are expecting to use this format to load your data into a spreadsheet (and we know that is what most people are doing these days) you need to keep in mind that spreadsheet packages have limits on the number of rows and columns they can handle. Many (older) versions of Excel cannot handle more than 255 columns nor more than 65,000 rows.
The Listing/Report choices allow you to specify an output file that is in a format designed for reading rather than for storage and further manipulation. The rows/observations from the dataset become the rows of the report, and the variables become the columns. If you choose plain text or pdf formats then the program will limit the number of columns to display across the page, and will start a new page (with titles and column headings after every so many lines). An HTML format report only displays the headings at the very top and has an unlimited page width, so all variables are displayed on one line. You really should not request a report format if your extract contains many variables, especially if those variables may be long character strings. Plain text format is the best choice for faster response, since it requires the fewest resources to generate; pdf takes by far the most resources to generate. But pdf is the best choice if you want a report that you want to print and/or share with others.
Most of the options in Section IV and those in Section V-b are relevant to the details of this output file; things such as titles, footnotes, column headers, and even complete styles (font and color choices). Tip: Consider the first checkbox in Section V-b; using labels instead of variable names as column headers can greatly enhance the readability of a report. More ambitious users can combine this with use of the Label option in Section V-c to assign their own variable labels.
The Database File option is probably the least often used. The venerable dbf format is still widely used and is especially popular for use within GIS and desktop mapping systems. These files can also be read into Excel and other spreadsheets, although if that is what you want you are better off going with the default - a csv file. The latter requires less space and also is able to store a row of variable labels, which dbf format does not support.
The SAS dataset option is for those who want to access the data using the SAS(c) software package. We write this dataset in the latest (v9) format for the Windows platform. Many of our datasets have variables that are linked to custom SAS format codes. For example, a variable containing a 5-character FIPS county code may be linked to a $county format code, which causes it to display as the county name associated with the code. In the other output formats you just get the results of applying the format, but with the SAS dataset you get the 5-character variable with the associated format code. This can cause problems when accessing from a SAS environment where $county is not recognized. You can avoid the problem by specifying the SAS system option nofmterr. If you would like to recreate that custom format code for your local SAS environment you can access our public formats library and download the source code. Go to http://mcdc2.missouri.edu/cgi-bin/uexplore?/pub/sasfmats and look for the file with the name corresponding to the format name. Character type formats (which are what we use most of the time) will be named starting with an upper-case "S" instead of a "$"; so the code for the $county format code is in the file Scounty.sas. Once you access the code in your browser you'll need to either save it to a local file or you can copy and paste it directly into your SAS program that accesses the dataset. You may need to add a Proc Format statement, since some of these modules will not include that statement. If you plan to use this format for working with many different data sets you may want to look into the possibility of storing the format code in a permanent SAS catalog. But that is beyond the scope of this help file.
We used to alert people to the option of creating SAS datasets that could be viewed with the SAS Viewer utility from SAS Institute. The viewer was a very useful utility that we thought was akin to the Adobe reader software, a free and easy-to-use viewer for people who did not have access to the parent software. But SAS Institute has apparently not viewed it this way. It has become difficult to impossible to find the software to download, and it requires you to be a "registered user" to do so. We cannot even find a version that will consistently read our latest downloaded datasets. So we no longer recommend this option.
The check box at the end of this section is rarely used. Normally when you invoke Dexter it writes things directly to your browser that summarize the query and provide links to the various output files. But if you are only generating a single output element and that element is either a plain text or HTMl report file then checking this box will cause Dexter to send your report file directly to the browser with no itermediate page.
II. Choose rows (observations) to keep by specifying a filter
This is the section where you get to tell Dexter which rows you want to keep for your output file(s). If you skip this section you get all the rows. If you just fill in the box at the top limiting the number of observations/rows then you just get that many output rows (report lines) from the top of the dataset (which is a good way to run a quick test and see what your output is going to look like.) As a general rule (there are important exceptions) the rows of our datasets correspond to geographic entities such as states, counties, cities, etc. Typically what you are doing in this section is specifying the geographic areas for which you want data. You do this by specifying 1 or more conditions which have to be met before the row is chosen. This process is known as filtering. The form allows for specifying up to 5 conditions and for specifying logical operators (And, OR & And Not) for connecting them; it even allows you to indicate parentheses to insert into the "filter" to control the order in which the conditions are to be evaluated and combined. But it is rare that a filter uses more than 2 lines and even rarer that you ever need to specify any logical operators or parentheses. Each condition is comprised of 3 parts: the Variable/Column, the Operator and the Value. The first 2 of these you select from drop-down menus, but the Value column is where you need to enter something. A typical filter would involve accessing a dataset that has census summary data for every county in the country. You would like to just get the data for California. You do this by selecting the variable State from the Variable/Column select list, then selecting the Equal to(=) value for Operator and then entering the code 06 in the Value text box. This tells Dexter to choose only rows where the value of the State code has a value of 06. How would you know to do this? You need to know 3 things:
- That the variable State contains the FIPS code for the state being summarized on each observation.
- That 06 is the FIPS code for California
- How this condition stuff works.
Item 1 is fairly easy. Whenever you see a variable named State in this archive, you can be almost certain that it contains a 2-character FIPS state code. There may be a label associated with the variable explicitly saying something to this effect (the variable labels are displayed following the variable names separated by a dash on the drop-down Variable select list) but usually this can just be assumed.
The hardest part is knowing what the code is for California (or whatever state you are interested in). These are readily available in lots of places on the web, including our geographic codes lookup web application. Another likely source, which will show you just the values and their meaning for the current dataset, can be accessed by looking at the top of the Dexter input form page to see if there is a link labeled "detailed metadata" for this dataset. Following that link takes you to a page that provides a good deal of information about the current dataset, including a section labeled Key variables. This section consists of a row of hyperlinks referring to variables within the dataset that we think will be the most likely to be needed to create filters for the dataset. State is almost always a key variable, at least for datasets that have data for more than 1 state. Clicking on the State link under Key Variables will display a page showing what values are associated with this variable for the current dataset, and even displays a count of how many times each value occurs.
The third item, knowing "how this condition stuff works", seems to be what separates the people who will become happy Dexter users and those who will not. To the former group, it all seems obvious and reasonable, while to the latter it will forever remain a mystery. These people often hire research assistants to extract their data for them (if they are academics) or pay way too much to buy the data "prepackaged" from a commercial vendor (if they work for a company).
Value of Operator
The values that drop down when you click the down arrow on the select lists under Operator (the middle column of this section) are mostly straightforward relational operators. Equal to, less than, greater than -- these are pretty standard. But a few may not be so obvious:
- In List (Separate items with :) is perhaps the most mysterious. This turns out to be one of the most useful operators and it is really not hard to use once you know the conventions. Let's say you wanted data not just for California but also for Oregon and Washington as well. You would then select this In List operator and in the Values text box you would type
06:41:54. The colons are used to separate the list of values. The condition is true if the value of the variable matches any of the values in the list.
- Short compare (=:)(Char Vars Only) is occasionally used. Suppose you were accessing a dataset that had ZIP code summaries and you wanted data for all the ZIP codes that started with "631". You could select ZIP as the Variable, this (short compare) as the Operator , and enter
631as the Value entry. The "(Char Vars Only)" qualifier indicates that this operator only works on character type variables. This should rarely be an issue since we would never (well, hardly ever) store a geographic code such as a ZIP code in numeric format. We have a very rigorously followed convention of storing all our geographic and other identifying codes in character format (so that they occur in the Identifiers list in Section III, rather than the Numerics - this is a quick and easy way to make sure that a variable is stored as a character string).
- Contains (Character Vars Only) is rarely used. It says to see if the Value entered is found as a substring of the Variable specified. For example, you might have a dataset that contained summaries for places and you only wanted to select those that were Census Designated Places. You displayed some observations and noticed that the Areaname variable included the string "(CDP)" at the end to indicate such a place. Thus you could code the filter using Areaname as the Variable value, Contains as the operator, and
(CDP)as the Value.
- Between (Separate items with :) This can be used to select based on an interval where you specify the lower and upper bounds in the Value box, separated by colons. For example, if you were accessing a dataset with place (city) level data and you just wanted towns with populations between 10,000 and 50,000 you would choose the variable containing the total pop count from the Variable drop-down, choose this Between operator, and then enter
10000:50000in the Value box.
Entering Case Sensitive Values
Normally, whenever the variable selected in Section II is of character type then whatever value is entered in the Value text box will be automatically converted to upper case by Dexter and compared to upper-cased values on the dataset. Once in a long while, you may want to override this default behavior and tell Dexter that you want the program to consider case. To do this you must enter a tilde character (~) as the first character in the Value box followed by the actual value. Thus if I select SumType as the variable, Equal to as the Operator and enter
~ras the Value, Dexter will only select rows where the value of the variable SumType is a lowercase "r"; it will NOT select a row if the value if SumType is an uppercase "R". If you omit the tilde from the value then the generated condition is
upcase(sumtype) = upcase("r")
and you would select rows that had a value of either lowercase or uppercase R.
Specifying Blank Values
You may have occasion to want to select using a condition requiring that a character variable has a blank value. Dexter has a special way of recognizing blank values: entering a single underscore (_) will be interpreted as a blank value when the Operator is
Not Equal Toor
In List. When a list of values is being specified (i.e. when the Operator is
In List) then use the single underscore within the list. For example typing
in the Value text box would result in the 3rd list value being a blank. (Note that the length of a blank character field is not important; do NOT try entering multiple underscores.) Also, note that this only pertains to completely blank values, not to blank characters within text fields. You would enter "San Antonio", NOT "San_Antonio".
Logical Operators and Parentheses
Note: This section can easily be skipped by novice and casual users; learning how to use these features will yield a relatively small gain.
Every once in a long while you have a situation where the filtering condition requires more than the typical simple specifications. Something more than just give me the data for this state or this county or this geographic summary level. Tools are provided to permit creating such complex conditions. These are
- The ability to specify logical operators connecting your Variable-Operator-Value conditions. By default the logical operator is
- Checkboxes down the left and right sides allow specifying opening and closing parentheses. These allow you to create groupings of your conditions so that the logical operators that are inside parentheses get applied first.
The only good way to explain this is not to try, but just to give an example:
- Suppose you are accessing a dataset that has data summarizing states and counties for the U.S. The variable SumLev has the standard geographic summary level codes, while the variable TotPop contains the total population for the geographic area. You want to select data for all states that have a population of over 1 million, and for counties with a population over 50,000. On the first line click and type to create the conditionIf you were to simply enter these 4 conditions and leave the relevant logical operators set at the defaultOn the second line create the condition
SumLev Equal To(=) 040
On the third line create the condition
TotPop Greater Than 1000000
and on the fourth line the condition
SumLev Equal To(=) 050
TotPop Greater Than 50000
Andvalues without any parentheses specified, what you would be creating is a logically impossible condition (and hence a failed query where no rows are selected). This is because the value of the SumLev variable cannot be equal to both 040 and 050 (as specified on the 1st and 3rd lines). Clearly, what you have here is not a simple set of 4 necessary conditions, but rather a pair of sufficient conditions, each consisting of 2 necessary conditions. What we would like to do is this:
- Evaluate the conditions on lines 1 and 2 and, if both are true select the observation (we have a large state).
- Evaluate the conditions on lines 3 and 4 and when both of those are true we also want to select these cases (counties of 50,000 or more population).
What we have here are two compound conditions, and we want to select the observation if either of those conditions is met. So we need to use an
Orlogical connector. We do this by checking the small circle next to the word
Orbetween the 2nd and 3rd rows to choose it as the logical operator. To insure that the program first evaluates the compound expression based on the first 2 lines we check the left parentheses box at the left of line 1 and the right parentheses box at the right of line 2. Similarly we use the left paren and right paren boxes on lines 3 and 4, respectively, to insure that these 2 conditions are logically combined (with the And operator) first, before applying the Or operator. The Or operator is then applied to the result of And'ing the conditions on lines 1 and 2 (the first condition for the Or operator) and the result of And'ing the conditions on lines 3 and 4. If either of these compound conditions is true, the row is selected. Translated into SAS code our condition looks like this:
(SumLev=040 And TotPop > 1000000 ) Or ( SumLev=050 and TotPop > 50000 );
Limiting the Number of Rows SelectedFinally you can enter a number into the text box at the end of Section II to specify a maximum number of rows that can be selected. Dexter counts the rows selected and when the count reaches these number it acts as if it had encountered the end of the input dataset. Notice that this limits the number of observations selected, not the number of observations accessed. It only counts those rows that make it through your filter. A good use of this option is to do quick test runs by entering a smallish number in this box so you only select this many result rows. You can then examine the results to see if they appear to be what you had in mind.
Accessing qmeta Metadata ReportsThe form provides a pair of links to the qmeta utility application that will display a listing of each of the variables (either just the ID variables, or all variables on the dataset - there are separate links for these two cases). These reports display some of the key attributes of the specified variables, including their name, label, format, type (character string vs. numeric) and length. In general all (most - there are a few exceptions such as numeric variables named logrecno, key, or obsno) ID variables are character strings (and all character string variables are IDs). The Format column may be one of the most useful. When you see a variable with a custom format code associated with it (such as $state, $county or $cbsa) this means that the value you will see in your extract (other than a SAS dataset) will be the result of converted the stored code to the value label for that code. For example if the variable County is type C, length 5 and has Format code of $county, then if the value stored on the dataset is
01001then the value you will see in your output file(s) will be
Autauga AL. The $county format code "looks up" the value and returns its label. (You can modify the format associated with a variable in Section V-c.)
III. Choose columns (variables)
In Section II you specify what rows you want to keep; here in Section III you specify the columns (variables) you want to keep. It is a much easier process than coding a filter, which usually involves nothing more than clicking (selecting) values from select lists. It can get tedious when working with datasets that have a lot of variables, but it is conceptually simple. The simplest think to do is check the box at the top that says you want all the variables. It may be simple, but it is not usually what you really want. Most users most of the time will want to be selective. Think about what you'll be doing with the results - do you really need or want hundreds (or thousands in some cases) of variables? Remember the 255 column limit if you are intending to take your extract into Excel.
Note that Section III is the only one where you are required to enter something. All the other parts of the form have builtin defaults that apply if you ignore them. But there is no default for your variable selections - you must do something here, even if it just checking the box saying you want all the variables.
We provide two column select lists, identifiers on the left and numerics on the right. Typically, you will nned to choose just a handful of identifiers - be sure to take enough that you will know what your data pertains to. It is the Numerics list that can sometimes be a challenge, depending on how many there are on the dataset. In a few special cases involving mostly complete-table decennial census files, the Numerics list (which allows you to select variables) is replaced with a Tables list. When this is the case you are only allowed to select entire tables - at least using the select lists. This is a trade off. We determined after considerable experience trying to work with some of these over-sized datasets that going with variables organized by table was the way to go. Examples of filetypes that support table selections are sf32000 and sf12000. An example of a filetype that should support such selections but does not (yet - because we have not created the required metadata files for it) is sf42000. These are all 2000 census Summary Files.
Other filetypes, such as our standard extracts based on 2000 SF3 tables, sf32000x, have enough variables (between 100 and 2000 is the magic interval; for fewer than 100 it is not worth the bother, and for over 2000 it takes too long to respond to searches) to justify some special extra tools for making it easier to find and choose what you want. These tools (select list "filtering") are described in the following paragraph.
Filter by regular expression Processing
When you see the
Filter by regular expressionbox below the Numerics select list you have the option of typing something in this text box - something called a regular expression - that will be used to modify what appears on the Numerics select list. For example, if I am processing a typical sf32000x dataset with its over 400 variables I might be looking for creating an extract of variables related to income or to poverty. I can enter the value
poverty | income
and click on the
Filterbutton. Within seconds the Numerics select list will be regenerated and will now only include variables where the specified keywords "poverty" or "income" are present in either the variable name or its label. The vertical bar (|) in the expression is an "or" operator that indicates that you want to select variables that contain either word. You can now make your choices off the much shorter list. If you do not like the results of your filter simply click on the
Clearbuttom to restore the list to its original all-variables-shown state. Click the
Case-sensitvebuttom in the unlikely case that you want your regular expression matching to take into account case in matching strings - so that "Poverty" does NOT match "poverty".
There is a 3rd wayThe last thing you see in Section III, just before the
Extract Databutton, is a text box where you are invited to enter your "own variable list". Most of you will not (and probably should not) ever do this. It is easy to mess up when you type in your names, that is why we have pull-down select lists. But there are cases where this method can constitute a shortcut, especially for users who are familiar with the syntax of SAS variables lists. Entering a value in this box will cause Dexter to ignore any selections you may have made - the list you type is your entire select list; it is not an addition to previously-selected items. (But if you checked the "ALL columns" box that takes precedent and any list you enter here will then be ignored.)
A reason for using this "third way" is that it lets you control the order in which your variables will appear on all your outputs. You need to code an @ symbol as the first character entered here. Normally the order is based on the order of the variables in the source dataset and cannot be modified. So if I were to enter@name address age city state zip
I could be certain that on my output report and/or delimited file the variables would appear in the order I typed them, with name first and zip last, regardless of their order on the source dataset. This can come in handy for custom reports and even for spreadsheets where you want control over the left-to-right order of your columns. You cannot use double-dash intervals (see below) when using the leading @ feature.
Perhaps the most common use of this text box is to allow selecting specific variables from a dataset where you get a Tables select list instead of a Numerics select. Say I were accessing the SF3 complete-tables dataset for all P and H tables for ZIP codes (ZCTAs) in the United States (file uszipsph.sas7bdat in data directory /pub/data/sf32000) and I had studied the detailed metadata which pointed me to a varlabs file that was the best tool for seeing what variables contained what information. Furthermore, let's say I was interested in calculating a poverty rate for children aged 0-17 for every ZIP, and the data cells I needed were in table p87. I could just select the entire table (which has data on poverty by age with a total of 17 data cells, of which only 8 were relevant to the number I wanted.) I would type in the text box something like this:
geocode state p87i3 p87i4 p87i5 p87i6 p87i11 p87i12 p87i13 p87i14
This is fine since we only had to enter the names of 8 data cells. But similar requests in the real world could possibly have many more data cells required. This is where knowing how to use special list notation comes in handy. There are basically three key variable list shortcuts that can be used:
- Single-dash intervals can be used to specify a list of variables that have a common base name with numeric suffixes. For example the list
p87i3 p87i4 p87i5 p87i6(used in our last example) can be written as
- Double-dash intervals can be used to specify a list of variables based on their relative position in the (SAS) dataset. We are very careful when creating datasets for the MCDC archive to insure that variables are placed in a logical order. Identifiers usually come first, followed by the numerics. When numerics are part of tables they are stored in ascending table number order and the variables are in the expected order within the table: if table p87 has 17 cells they will be named p87i1 thru p87i17 and they will be stored in that order. What this allows you to do is pick long "data intervals", often comprised of consecutive tables. If I wanted to select all the table cells from table p87 through p93 (all tables that deal with poverty - this is sf3, 2000 census - filetype sf32000) I could simply include in my variable list
and all of the variables corresponding to the 7 tables would be specified (kept on output).
- Colon modifier lists are used to specify all variables with names sharing a common root name. This notation is very handy when you want to keep all the variables from a set of tables on one of the MCDC's detailed summary file datasets. If I type
p12i:in my list I am saying I want all variables with names starting with the characters "p12i". Our table-variable naming convention guarantees that this is exactly the set of variables comprising table p12. The variable names are p12i1, p12i2, ..., p12i31. The two lists
p12i1-p12i31yield the same result, but the former is easier to code because you don't have to look up the number of cells in the table as you would to code a single-dash interval.
IV. Other Options, Non-essential
This short section consists of a set of text boxes where you can specify some useful options. The first 3 pertain only to the Listing/Report output type. The sort option applies to all outputs. Special characters, including quotes (apostrophes or double quote symbols), percent sybmols and ampersands, will be removed from any values entered for titles or footnotes in this section.
- Title is used to enter text that you would like to have appear as the title at the top of your report page(s).
- Subtitle is used to enter text for a sub-title for your report. It will appear below the main title in a slightly smaller font.
- Footnote is text that will appear as a footnote at the bottom of your report page(s). There is a default footnote that specifies that the report was generated by Dexter and specifies the date.
- Sort lets you specify how you want your output sorted. By default the output inherits the sort order of the dataset from which you are extracting. You enter variable names here in major to minor sort order. To indicate sorting in descending order on a variable precede its name with a minus sign. For example entering a value ofwould result in output(s) sorted first by State; within state by county; and within county by descending values of pctpoor. If you were to insert a minus sign just before
state county -pctpoor
statein the list it would result in the output sorted by descending state (so that Wyoming might appear first, with Alabama at the end, reversing the normal ascending order).
All variables used for the sort options must be selected for keeping on the output in Section III; you cannot sort on a variable that is not present on the dataset. If you are doing aggregation of the data (Section V-a) then the dataset must be sorted by the aggby variables specified there.
V. Advanced OptionsNote: this section is not intended for the casual user.
Part a: Aggregating the DataYou may find occasion where the data stored in one of the archive datasets is more detailed than what you really need. It may have too many rows; you would prefer to collapse across rows to get numeric data summarized by aggregating (summing) the variables in rows that share a set of common identifier, or aggby variables. For example, let's say you are accessing one of the recent population estimates datasets where each row / observation had data for a county. Let's say that the dataset contains a variable named CBSA that has the code for the current (core based) statistical area (Metroplolitan or Micropolitan Statistical Area) associated with the county. Let's further suppose that what you really want is not the county populations but instead the CBSA figures. So you want to aggregate the data by CBSA, or perhaps by CBSA-within-state, so that you could get the Kansas portion of the KC MSA and the Missouri portion. Here is how you could do that:
- In the text box labeled Aggby you enter the names of the variables by which you want to aggregate. For example, you could enter
State CBSA. That would tell the program to take the initial extract, sort it by CBSA within state and call an aggregation utility that will produce summary rows by aggregating numeric variables. For now, this will only work for character type variables; numeric variables cannot be used in an aggby list. (For the most part, character variables are those that appear in the Identifiers lists (as opposed to the Numerics lists); use the Qmeta report to get the variable type for absolute certain, from the
Typecolumn.) This is usually not a problem, since we are very careful to make sure that category variables on our datasets are character type rather than numeric.
- In the text box labeled Agg Level you tell the program whether you want to produce aggregates for more than 1 level of the by variable(s). Normally (the default) you get just a single level. In our example where we specified 2 variables for AGGBY, the program would produce an output row for each combination of the State and CBSA variables. If you were to enter a value of 2 for Agg Level (the only other valid value, given there are only 2 aggby variables) then the program produce summaries at 2 levels:
- for the portions of CBSA's within states, and
- for states, regardless of CBSA (called a "level 2" summary.
Note that the order of the variables in the aggby list is significant here. If you had typed in
CBSA State(with CBSA entered first instead of second) then the summaries produced would be for State-within-CBSA (a "level 1" summary) and for CBSA's (regardless of State). That's an important difference.
- The select list labeled Grand Totals? lets you specify whether you would like to get data aggregated across all the selected rows of the dataset, an extra summary row added to the end of your output. Default value is NO, but the other option is Yes which must be selected in order to get the extra grand total data. (We implement this as a select list rather than a checkbox to allow future expansion where we may allow a 3rd option that says you want a grand total only.)
- The next 2 input boxes are a matched set and should either be left blank or both filled in. You need to provide a list of variables (in the Means or Percents box) that you are aggregating that are not the kind of values that can be simply summed up. Examples of such variables are Per Capita Income (a mean) or PctAsian (a percentage). The utility module that Dexter uses to do the aggregation is able to handle such variables by taking a weighted average of their values within a by group rather than an arithmetic sum. Each variable that requires special processing must be matched up with another variable on the dataset that will serve as the value to be used to weight the variable. A variable measuring per capita income needs to be weighted by a variable containing the total population. A variable such as PctAsian (percentage of the population that is Asian) would similarly need to be weighted by a variable measuring the total population; in general a percentage variable needs to be weighted by the variable that is the denominator for that percentage calculation. A variable such as PctPoor (on our sf32000x datsets, for example) needs to be weighted not by the total population, but rather by the PovUniv (poverty universe) variable, which is used to keep count of the persons for whom poverty status was assigned. When specifying these parameters you should either have the same number of variables in each of the two lists or, as a special case, you can have just a single variable in the Weights list provided that that sincle variable can be used as the weight for ALL the variables specfified in the Means or Percents list. Note that the variables in the first list (Means or Percents) should all be distinct (no value repeated) while in the weights list it is very common for a variable to be repeated more than once.
- Variables to drop is not something most users will need to worry about. The module that does the aggregation creates two special variables named _lvl_ and _nag_; the former keeps track of something called the summary level, and the latter keeps track of the number of rows that were aggregated to form each sum (output row). If you specified
aggby=State CBSA Countyas the aggby parameter and
Agg Level=3then on your output files the variable _lvl_ would take on a value of 1 for County summaries (within State and CBSA), and would take on the value 2 for CBSA (within State) summaries and a value of 3 for State summaries. When requesting multiple levels you might want to keep the _lvl_ variable to make it easier to select subsets and understand your results. The _nag_ variable could also be useful; in our example where we were aggregating county data to get summaries for CBSA's, for example, the value of _nag_ would tell you how many counties there were within the CBSA.
- The Drop redundant higher level summaries is a rather obscure option but can occasionally come in handy. An example of how it might be used is when doing an aggregation by CBSA and state with Agg Level=2. In many cases a CBSA is in only a single state, so that the summary for the CBSA within the state and the summary for the complete CBSA would yield the same results. You can use this checkbox option to tell the aggregation program to check for such "redundant summaries" and not output them. (Only the one with the lowest _lvl_ value would be generated.)
Part b: Advanced Report Formatting OptionsThese options only make a difference if you chose something other than None in Part I from the Listing/Report format options on the 2nd line. These options have to do with various cosmetic and formatting features that only pertain to data in a report format.
- The Use variable labels as column headers in reports check box could easily be moved to a different section since it does not exactly fit as an advanced feature. It is really a fairly frequently preferred option. The variables in our SAS datasets have names which up until a few years ago (SAS Version 7) were limited to 8 characters and are even now limited to 32 characters, none of which can be blanks or other special characters. But variables can have labels associated with them which can now be up to 256 characters. In many if not most cases, users may find that displaying these labels in the column headers will make their reports more readable than using the variable names. The only down side to using the labels in most cases is that they may require additional horizontal spacing which may force the report to be displayed in multiple parts, instead of having the value for all the variables on a single report line.
- The By variables for report text box allows you to specify 1 or more variables that will be used as By variables in the report. In most cases this means that a special By line will be generated each time the value of (any of the) by variables change. For example if you specify a value of State for this parameter then the program will usually generate a line that will look something like
when it first encounters the value "Alabama" for the variable State. The extract dataset must be sorted in order for this to work, which can be taken care of using the sort text box at the end of Section IV. (You can always append the keyword notsorted after a variable name in this list, in which case the dataset need not be sorted by that variable. But presumably the data would be grouped by it; otherwise, it is not going to look very good to have a large number of by lines in your report. Normally, of course, all variables that are kept on your output are displayed as columns in the report; that is not the case with variables specified as report by variables.
- The ID variables for report text box allows you to specify 1 or more variables that will be used as identifier columns in the report. Identifier columns appear as the leftmost columns of the report and, when there are too many columns to display across a single line, the ID columns are repeated for each tableau or sets of columns which comprise the report. With html and pdf output, special style attributes apply to these columns, with the current default causing them to display with a blue background. Be careful to spell the names correctly and not to use any variable that you did not specify to keep in Section III. You can have multiple ID variables; just separate their names with blanks.
- The Orientation of print output is rarely used and only makes a difference in certain cases. Override the default value (portrait) when you are asking for pdf output where doing landscape will allow fitting all your variables across one line.
- The Style to use for html or pdf output option lets you have some fun with how your report output looks. The default value, sasweb, is our recommended choice but people have different tastes. The values followed by two asterisks (**) are ones we think are the better choices, but you might want to try them all. The best way to see what each one looks like is to define a small query (less than 10 variables, with a maximum observations to select value of 100 or so and html output format selected) and just keep rerunning it with different choices for Style. The names (except for mcdcdefault) are from the software vendor (SAS Institute) and are not very mnemonic. Keep in mind that if you intend to get hard copy of your report that you probably do not want a shaded background if you are using a laserjet printer and don't want to use up too much toner.
Part c: SAS Variable AttributesThe general rule is that you don't have to know SAS to use Dexter. But this is one section where, while not required, knowing the language could come in handy. Here we allow you to type in the body of 3 different SAS statements that are used to modify attributes (names, labels and formats) of any of the variables you have selected. The syntax of these statements is simple enough that even if you have never coded a SAS statement in your life, and hope to never have to, you should not have too much difficulty using these statements.
- The Format statement allows you to control to some extent how the values for your variables are displayed. The most common use is to control the display of digits after the decimal point in numeric variables, or perhaps to use comma formats to have numeric variables with large values display with commas (i.e.
1234567). The value you enter needs to be in the format
variable-name(s) format-code [repeat]. For example:
causes the value of the variables totpop and over65 to display in comma9 format, (x,xxx,xxx) ; and the value of the variable pctpoor to display in 5.1 format, which is xxx.x (it will be rounded to the 1 decimal place). Note well the use of periods in the format codes - the syntax requires these (they allow the program to distinguish variable names from format codes). It is beyond the scope of this document to go into all the possible format codes available in SAS, but 99% of the time all users will be interested in is controlling the insertion or omission of commas and/or the number of decimal digits. Some common format codes that can be used include:
totpop over65 comma9. pctpoor 5.1
7.0- variable displays with 7 digits, no commas and no decimal point, i.e. xxxxxxx. Leading blanks will not be retained on delimited output files. (You might specify this format if the variable had a comma. format associated with it and you wanted to override that so that the commas did not occur in your csv file. You could also use it to round off values with decimal portions to integer values.)
dollar10.This could be used for a variable that is a dollar amount and would cause the value to display in $x,xxx,xxx format. If there was a decimal portion you could use the code
dollar12.2which would display in $x,xxx,xxx.xx format.
$5.This format should be used only with character type variables (to see what variables are of type character vs. numeric you should use the View qmeta Metadata report links at the end of Section II; for the most part, variables appearing in the ID list will be character type, while all variables in the numeric list will be of type numeric.) The $5. format code indicates to simply display the value as 5 characters with leading blanks removed (use the $char5. format if you want to preserve leading blanks). The most common reason for using such a code would be for a variable that had a custom format code associated with it, which you wanted to override. For example, on many of our datasets you will find the character variable county which contains a 5-character FIPS county code, and which has the custom format code $county. associated with it. This custom format code is used to translate the codes to the name - e.g. "01001" displays as "Autauga AL". If you wanted to see the code instead of the name you could do so by specifying the $5. format for the county variable.
$state.If there is a variable that contains a 2-character FIPS state code you can specify the custom $state. format code to cause the value displayed to be the name associated with the code, i.e. instead of "29" you would get "Missouri". Similarly, as discussed above, you could specify a format code of
$county.with a variable that contained a 5-character FIPS county code in order to have it display the name of the county rather than the code.
- The Rename statement allows you to change the name of any of your variables. The syntax requires repeated specifications of the form
current-name=new-name. For example:
fipco=County Areaname=Area_Summarizedwould result in the variable fipco being renamed to County, and the variable Areaname being renamed Area_Summarized. (These names can occur in header rows of Excel spreadsheets created from Dexter-generated csv files, or as column headers in reports.) When assigning new values you need to keep in mind these rules for SAS variable names:
- Names can be at most 32 characters long and comprised only of letters of the alphabet (upper or lower case), numeric digits or the underscore character (_).
- The first character of a name can not be a digit.
- You cannot rename a variable to a name that is already in use on the dataset.
The new names will be used on your output file(s), but you need to use the old names when referencing variables on the Dexter query form. For example, if you should elect to specify your variables-to-keep using the text box at the end of Section III of the form you need to enter the name as it is on the input data set, not the new name specified for that variable here.
- The Label statement allows you to modify the label associated with any of your variables. In many cases there may be no label associated with the variable, in which case the name of the variable appears where the label would go. Label values appear in spreadsheet header rows and can appear as column headers if you specify the Use variable labels as column headers in reports option (in part b. of this Advanced Options section). The syntax requires repeated specifications of the form
variable-name='label-text'. Note that the label has to be enclosed in single quotes. For example:
MetroDiv='Metropolitan Division' popjl13='Current Estimate'
Part d: Transpose OutputUse this section to specify that you want to restructure the output data so that what would normally be rows become columns and what would normally be columns become rows. In the classic (most common) case of a geographic summary dataset you would normally have geographic areas as the rows, with characteristics of those geographic areas going across as columns. For example, if you extract two sf3 summary tables for all the counties in your state you would get one row per county with all the table variables (cells) appearing as columns. But maybe you would prefer to have the variables (table cells) represented as rows in your spreadsheet with attribute values for each county appearing in its own column. You can accomplish this with the options in this section.
For this to work you have to start by checking the box to indicate that the output is to be transposed.
The by variables option is rarely used. It requires that the data being extracted have the same set of rows repeated for the values of these by variables. Maybe if the dataset contained time series data that repeated a cycle for several values of a variable Year, you could specify year as a by variable for the transposition. It would then remain a column instead of a row, and you get a new row for each variable, one for each value of year.
The Name variables on output ... line lets you specify what the names of the transposed variables will be. You do this by specifying an alpha prefix and then specifying a variable in the dataset whose value will be appended to that alpha prefix to form the name. For example, if you were transposing a dataset with county level summaries you might specify a value of "c" as the prefix and the variable fipco for the 2nd option. Then if an observation on the dataset had a value of 01001 for the variable fipco then all the values from that row would be transposed into a column and that column would be named c01001. Note that the resulting name must be unique - there can be only one observation on the extracted dataset with a given value of fipco.
The Assign labels to the output ... line lets you specify the name of a variable in the dataset whose value will become the label of the variables after they are transposed. A frequently used variable for this function is Areaname, i.e. it is a character variable the value of which describes (identifies) the (old) row. The variable label appears as the second row (line) of a csv output file, with the variable name appearing in the first row. In a report output the column headers are by default the variable names, but you can override this by checking the option in section V-b (Advanced Report Formatting) saying to use variable labels as column headers in reports.
Transposing is a tricky feature. As a practical matter it requires that you have a set of numeric data items with similar characteristics and just one or two ID variables which are used to name and/or label the resulting columns.
Part e: Coordinates of specified point (distance calculations and filtering)
This section can be used to provide parameters used to select observations based upon distances from a specified point (calculated by Dexter). This feature only works provided that
- The data set's observations represent geographic areas for which centroid/internal point latitude-longitude coordinates are "available". They could also be "point observations" containing the latitude-longitude coordinates of a specific location, such as you find on a geocoded address file.
- By "available" we mean that
- they are contained as variables named intptlat and intptlon in the data set, which must be selected in Section III of the form. -OR-
- Such coordinates can be "looked up" by means of a geoid key variable or sumlev/geocode variable combination in the current data set (in which case these geographic key variables must be selected in Section III). Common geographic units such as state, county, place, ZIP and PUMA should work, with more to come.
- In Section II. Choose Rows you must have a condition that references the pseudo-variable _distance. Typically, you will specify a condition such as
_distance Less Than 100
which would indicate that you only want to choose geographic areas where the "distance" is under 100 miles. What distance? That is what the following parameters are used to specify.
The two text boxes labeled Latitude and Longitude can be used to enter the coordinates of a location that will be used to calculate the _distance values. These values should be entered as decimal fractions, with as many digits after the decimal point as you think are needed. A typical value for the Latitude would be 38.9898, and for Longitude -92.313400 . The leading minus sign on the longitude coordinate is optional and will be assumed if not entered. (The program only works for locations in the western hemisphere. ) The trailing 0's on the Longitude value here make no difference and could be left off.
You can also enter the coordinates in degree-minutes-seconds notation rather than as decimal fractions. To do so you must use dd.mm.ss format, where dd is the number of whole degrees, mm is the number of minutes and ss the number of seconds. Both mm and ss should include leading 0s if necessary. For example 38.40.04 would represent 38 degrees, 40 minutes and 4 seconds north latitude. A value of 41.40.64 would be invalid; the value of MM and SS must be between 00 and 59. The program will recognize that you are using this format by checking for and seeing two decimal points (periods) in the entered value.
Alternatively, you can enter a 5-digit ZCTA/ZIP code in the Latitude box. The program will recognize a 5-character string without a decimal point as a ZIP value. It will then do a lookup of the internal point coordinates for that ZIP code in a reference data set, and will use those values to do _distance calculations.
There are 2 check-box options that follow. The first can be checked in oder to specify that you want to keep the calculated _distance variable as well as the intptlat and intptlon coordinate variables which may have been looked up. The second checkbox lets you decide how you want to handle cases where the coordinates of the current observation cannot be determined. Do you want to keep those observations or filter them out? The default is to filter them out, but you can easily override that by checking this box.
Additional Options at the Bottom of the FormThese options (following the final
Reset Defaultsbuttons) are intended for use by the persons responsible for developing and debugging the software. Please ignore this section. This does not include the very bottom section starting with
Questions and comments .... We strongly encourage you to take advantage of that section.
This file last modified Wednesday January 27, 2016, 14:38:56