!!!!This page has been copied, pasted and edited to become part of the dexterOnlineHelp.shtml page. This page is
obsolete and should not be edited ===============================================================
Dexter Advanced Options Help
Dexter Advanced Options Help Page
Part a: Aggregating the Data
You may find occasion where the data stored in one of the archive datasets is really more detailed than what you really need. It may have too many rows; you would prefer to collapse across rows to get numeric data
summarized by aggregating (summing) the variables in rows that share a set of common identifier, or aggby variables. For example, let's say you are accessing one of the recent population estimates
datasets where each row / observation had data for a county. Let's say that the dataset contains a variable named CBSA that has the code for the current (core based) statistical area (Metroplolitan or
Micropolitan Statistical Area) associated with the county. Let's further suppose that what you really want is not the county populations but instead the CBSA figures. So you want to aggregate the data by CBSA, or perhaps by CBSA-within-state, so that you could get the Kansas portion of the KC MSA and the Missouri portion. Here is how you could do that:
- In the text box labeled Aggby you enter the names of the variables by which you want to aggregate. For example, you could enter
State CBSA. That would tell the program to take the initial extract, then sort it by CBSA within state and call an aggregation utility that will produce summary rows by aggregating numeric variables. For now, this will only work for character type variables; numeric variables cannot be used in an aggby list. This is not usually a problem, since we are pretty careful to make sure that most category variables on our datasets are character type rather than numeric.
- In the text box labeled Agg Level you tell the program whether you want to produce aggregates for more than 1 level of the by variable(s). Normally (the default) you just a single level. In our example where we specified 2 variables for AGGBY, the program would produce an output row for each single combination of the State and CBSA variables. If you were to enter a value of 2 for Agg Level (the only other valid value given there are only 2 aggby variables) then the program produce summaries at 2 levels
- for the portions of CBSA's within states, and
- for states, regardless of CBSA (called a "level 2" summary.
Note that the order of the variables in the aggby list is significant here. If you had typed in
CBSA State (with CBSA entered first instead of second) then the summaries produced would be for State-within-CBSA (a "level 1" summary) and for CBSA's (regardless of State). That's a very important difference.
- The select list labeled Grand Totals? lets you specify whether you would like to get data aggregated across all the rows of the dataset (filtered), an extra summary row added to the end of your output. Default value is NO, but the other option is Yes which must be selected in order to get the extra grand total data.
- The next 2 input boxes are a matched set and should either be left blank or both filled in. You need to provide a list of variables (in the Means or Percents box) that you are aggregating that are not the kind of values that can be simply summed up. Examples of such variables are Per Capita Income (a mean) or PctAsian (a percentage). The utility module that Dexter uses to do the aggregation is able to handle such variables by taking a weighted average of their values within a by group rather than an arithmetic sum. Each variable that requires special processing must be matched up with another variable on the dataset that will serve as the value to be used to weight the variable. A variable measuring per capita income needs to be weighted by a variable containing the total population. A variable such as PctAsian (percentage of the population that is Asian) would similarly need to be weighted by a variable measuring the total population; in general a percentage variable needs to be weighted by the variable that is the denominator for that percentage calculation. A variable such as PctPoor (on our sf32000x datsets, for example) needs to be weighted not by the total population, but rather by the PovUniv (poverty universe) variable, which is used to keep count of the persons for which poverty status was assigned. When specifying these parameters you should either have the same number of variables in each of the two lists or, as a special case, you can have just a single variable in the Weights list provided that that sincle variable can be used as the weight for ALL the variables specfified in the Means or Percents list. Note that the variables in the first list (Means or Percents) should all be distinct (no value repeated) while in the weights list it is very common for a variable to be repeated more than once.
- Variables to drop is not something most users will need to worry about. The module that does the aggregation creates two special variables named _lvl_ and _nag_; the former keeps track of something called the summary level, and the latter keeps track of the number of rows that were aggregated to form each sum. If you specified
aggby=State CBSA County as the aggby parameter and Agg Level=3 then on your output files the variable _lvl_ would take on a value of 1 for County summaries (within State and CBSA), and would take on the value 2 for CBSA (within State) summaries and a value of 3 for State summaries. When requesting multiple levels you might want to keep the _lvl_ variable to make it easier to subset and understand your results. The _nag_ variable could also be useful; in our example where we were aggregating county data to get summaries for CBSA's, for example, the value of _nag_ would tell you how many counties there were within the CBSA.
- The Drop redundant higher level summaries is a rather obscure option but can occasionally come in handy. An example of how it might be used is when doing an aggregation by CBSA and state with Agg Level=2. In many cases a CBSA is in only a single state, so that the summary for the CBSA within the state and the summary for the complete CBSA would yield the same results. You can use this checkbox option to tell the aggregation program to check for such "redundant summaries" and not output them. (Only the one with the lowest _lvl_ value would be generated.)
Part b: Advanced Report Formatting Options
These options only make a difference if you chose something other than None in Part I from the Listing/Report format options on the 2nd line. These options have to do with various cosmetic and formatting features that only pertain to data in a report format.
- The Use variable labels as column headers in reports check box could easily be moved to a different section since it does not exactly fit as an advanced feature. It is really a fairly frequently preferred option. The variables in our SAS datasets have names which up until a few years ago (SAS Version 7) were limited to 8 characters and are even now limited to 32 characters, none of which can be blanks or other special characters. But variables can have labels associated with them which can now be up to 256 characters. In many if not most cases, users may find that displaying these labels in the column headers will make their reports more readable than using the variable names. The only down side to using the labels in most cases is that they may require additional horizontal spacing which may force the report to be displayed in multiple parts, instead of having the value for all the variables on a single report line.
- The By variables for report text box allows you to specify 1 or more variables that will be used as By variables in the report. In most cases this means that a special By line will be generated each time the value of (any of the) by variables change. For example if you specify a value of State for this parameter then the program will usually generate a line that will look something like
State=Alabama
when it first encounters the value "Alabama" for the variable State. The extract dataset must be sorted in order for this to work, which can be taken care of using the sort text box at the end of Section IV. (You can always append the keyword notsorted after a variable name in this list, in which case the dataset need not be sorted by that variable. But presumably the data would be grouped by it; otherwise, it is not going to look very good to have a large number of by lines in your report. Normally, of course, all variables that are kept on your output are displayed as columns in the report; that is not the case with variables specified as report by variables.
- The ID variables for report text box allows you to specify 1 or more variables that will be used as identifier columns in the report. Identifier columns appear as the leftmost columns of the report and, when there are too many columns to display across a single line, the ID columns are repeated for each tableau or sets of columns which comprise the report. With html and pdf output, special style attributes apply to these columns, with the current default causing them to display with a blue background. Be careful to spell the names correctly and not to use any variable that you did not specify to keep in Section III. You can have multiple ID variables; just separate their names with blanks.
- The Orientation of print output is rarely used and only makes a difference in certain cases. Override the default value (portrait) when you are asking for pdf output where doing landscape will allow fitting all your variables across one line.
- The Style to use for html or pdf output option lets you have some fun with how your report output looks. The default value, sasweb, is our recommended choice but people have different tastes. The values followed by two asterisks (**) are ones we think are the better choices, but you might want to try them all.
The best way to see what each one looks like is to define a small query (less than 10 variables, with a maximum observations to select value of 100 or so and html output format selected) and just keep rerunning it with different choices for Style. The names (except for mcdcdefault) are from the software vendor (SAS Institute) and are not very mnemonic. Keep in mind that if you intend to get hard copy of your report that you probably do not want a shaded background if you are using a laserject printer and don't want to use up too much toner.
Part c: SAS Variable Attributes
The general rule is that you don't have to know SAS to use Dexter. But this is one section where knowing the
language could come in handy. Here we allow you to type in the body of 3 different SAS statements
that are used to modify attributes (names, labels and formats) of any of the variables you have selected. The syntax of these statements is simple enough that even if you have never coded a SAS statement and hope to never have to you should not have too much difficulty using these statements.
- The Format statement allows you to control to some extent how the values for your variables are displayed. The most common use is to control the display of digits after the
decimal point in numeric variables, or perhaps to use comma formats to have variables with large values display with commas (i.e.
1,234,567 instead of 1234567). The value you enter needs to be in the format variable-name(s) format-code [repeat] . For example:
totpop over65 comma9. pctpoor 5.1
causes the value of the variables totpop and over65 to display in comma9 format, which is x,xxx,xxx ; and the value of the variable pctpoor to display in 5.1 format, which is xxx.x (it will be rounded to the 1 decimal place). Notice the use of periods in the format codes - the syntax requires these (they allow the program to distinguish variable names from format codes). It is beyond the scope of this document to go into all the possible format codes available in SAS, but 99% of the time all users will be interested in is controlling the insertion or omission of commas and/or the number of decimal digits. Some common format codes that can be used include:
-
7.0 - variable displays with 7 digits, no commas and no decimal point. Leading blanks will not be retained on delimited output files. (You might specify this format if the variable had a comma. format associated with it and you wanted to override that so that the commas did not occur in your csv file.)
dollar10. This could be used for a variable that is a dollar abount and would cause the
value to display in $x,xxx,xxx format. If there was a decimal portion you could use the code dollar12.2 which would display in $x,xxx,xxx.xx format.
$5. This format should be used only with character type variables (to see what variables are of type character vs. numeric you should use the View qmeta Metadata report links at the end of Section II; for the most part, variables appearing in the ID list will be character type, while all variables in the numeric list will be of type numeric.) The $5. format code indicates to simply display the value as 5 characters with leading blanks removed (use the $char5. format if you want to preserve leading blanks). The most common reason for using such a code would be for a variable that had a custom format code associated with it, which you wanted to disable. For example, on many of our datasets you will find the character variable
county which contains a 5-character FIPS county code, and which has the custom format code $county. associated with it. This custom format code is used to translate the codes to the name - e.g. "01001" displays as "Autauga AL". If you wanted to see the code instead of the name you could do so by specifying the $5. format for the county variable.
$state. If there is a variable that contains a 2-character FIPS state code you can specify the custom $state. format code to cause the value displayed to be the name associated with the code, i.e. instead of "29" you would get "Missouri". Similarly, as discussed above, you could specify a format code of $county. with a variable that contained a 5-character FIPS county code in order to have it display the name of the county rather than the code.
- The Rename statement allows you to change the name of any of your variables. The syntax requires repeated specifications of the form
current-name=new-name. For example: fipco=County Areaname=Area_Summarized would result in the variable fipco being renamed to County, and the variable Areaname being renamed Area_Summarized. (These names can occur in header rows of Excel spreadsheets created from Dexter-generated csv files, or as column headers in reports.) When assigned new names you need to keep in mind these rules:
- Names can be at most 32 characters long and comprised only of letters of the alphabet (upper or lower case), numeric digits or the underscore character (_).
- The first character of a name cannot be a digit.
- You cannot rename a variable to a name that is already in use on the dataset.
- The Label statement allows you to modify the label associated with any of your variables. In many cases there may be no label associated with the variable, in which case the name of the variable appears where the label would go. Label values appear in spreadsheet header rows and can appear as column headers if you specify the Use variable labels as column headers in reports option (in part b. of this Advanced Options section). The syntax requires repeated specifications of the form
variable-name='label-text'. Note that the label has to be enclosed in single quotes. For example:
MetroDiv='Metropolitan Division' popjl04='Current Estimate'
Questions and comments regarding this help page or Dexter in general should be addressed to
John Blodgett
MCDC Home ||
OSEDA Home ||
Dexter tutorial: ppt format |
html format ||
Uexplore Home