Data Ranking Utility Application Online Help Page

Application Overview

This application is about accessing a dataset containing some interesting numeric data items ("variables") which measure some meaningful aspects of a set of entities (usually geographic areas, but could be time periods, SIC categories, etc) -- the "rows" of the dataset. The application uses this front-end form page to let the user specify which variables are of interest and what kind of criteria are to be applied, and/or how many rows are to be kept and what type of output is desired. The application can create output in the form of files (csv files or SAS datasets) and/or custom reports (in html or pdf format).

I. Choose Variables for Output File(s) and/or Report

This section is where you get to to choose the variables of interest. There are two variable selection lists (menus), each of which allow you to make multiple selections. The list on the left contains only numeric variables; here you will not see any geographic code variables, or names, or time periods, or other variables that are not numerically significant. Each variable that you choose from this list will be used to create a ranking of that variable within the dataset. Output files (csv files or SAS datasets can be specified) will contain both the values of these variables from the input dataset as well as a corresponding rank variable/column. If an output report is requested then variables that you want to appear in the report as Rank variables (see form section III) must be selected here.

The list on the right contains all the variables that are on the input dataset. All non-numeric "identifier" variables appear first (the notation "** End ID variables**" appears next to the last identifier variable), followed by the same numeric variables that appear in the left-side list. From this list you select all variables that you want kept in the output files and/or reports other than the rank variables that you already chose from the left list. You should avoid choosing the same variable from both lists (which may lead to extra columns in output reports, with ambiguous labels).

II. Ranking and Filtering Options

In this section the user specifies the details regarding how they want the ranking to be done, and how they want to limit the data that gets considered for ranking and/or kept on the output files/report ("filtering").

The first line here lets you choose the ascending vs. descending option when ranking. The default choice is to assign a rank of 1 for the value with the largest numeric value, a rank of 2 for the second largest, etc. If you would prefer that a rank of 1 indicate the smallest numeric value then you should click on the "Rank in ascending order" button. (These are radio buttons, so that choosing one automatically unchooses the other.)

Next we have 3 ranking options which apply only if you chose a single ranking variable (from the left menu list in Section I of the form). The first of these options allows you to limit your output to the top or bottom-ranking cases. Entering a value of 10 in this box says that you are only interested in keeping cases where the rank is in the top 10. This is a handy option for generating "Top 10" (or "Top 50" or "Top 100", etc.) reports. In addition to limiting output based on rank, the next box lets you specify a value above or below which a value must fall in order to be included in the output. For example, you could enter 10000 in this box in conjunction with a value of 25 in the previous box; assuming ascending order remains specified this would say that you want to see the 25 top-ranking cases, but only if they have values of at least 10000. If the 12th highest value was 10200 and the 13th highest was 9700, then you would only get the top 12 cases on your output(s).

The last item in this subsection is a checkbox that lets you indicate that you are interested in both ends of the spectrum, both high and low ranking cases. So if you were working on a dataset with Missouri counties and you entered 10 in the box to limit ouput to the top/bottom cases and you checked this option, then you would get output where the rankings were 1 to 10 and 106 to 115 (the 10 lowest ranked values of the 115 Missouri counties). Note that this option cannot be specified if "by groups" are specified. Which is the next option.

You might be working with a dataset that has data for counties within a state for a series of years. You might want to look at a different set of rankings for each year. Or, you might prefer to group them by county and then rank them over time. Assuming the variables identifying the year and the county were Year and County you would enter the approproiate variable name in this text box. Entering Year (case does not matter here -- you could also type year) would indicate a different set of ranks being assigned for each set of rows grouped within years. Entering County in the box would produce a separate set of ranks (across years) for each county.
The check box can be used to tell the application that the input data are known to already be sorted by the variable(s) specified just above. You should only check this if you know this to be the case. It really makes very little difference unless the dataset is very large, in which case it can save processing time.

You can enter a numeric value to indicate that you want group ranks rather than ordinary ranks. The classic example of a group rank is a percentile. If you enter a value of 100 in this box then you will get percentile rankings.

The option for processing ties tells the application how you want to handle cases where the ranking variables have the same value for multiple cases. For example if you were processing crimes data for US States and there were 3 states that had the same value for the variable Murders and this value was the largest of any state. Choosing the "Use low value" option would result in a value of 1 being assigend to the Murder_rank variable for all three of the cases; choosing "Use high value" would result in a value of 3 being assigned to Murder_rank; and choosing "Use mean rank" (the default, so you do not have to do anything to choose this option) would result in averaging the ranks so that Murder_rank would be assigned a value of 2 for each of the three states.

Optional: Filtering the Data

The application allows some limited filtering of the input data. (You can also use the dexter application to create a filtered dataset which can be passed to Rankster, but that is beyond the scope of this document.) The first item here is designed to let you indicate that you only want cases where the data are significant. If you wanted to rank a dataset of U.S. counties based on a variable that measured the median household income of black households, then it might be useful to limit the ranking to only those counties that had a significant number of black households. In that case you might choose (from the drop-down select list) the the (number of) black households variable and then enter a value of, say, 200, for the "at least" value.

Part 2 of this filtering section lets you specify more traitional "dexter type" filters based on specific values of identifier variables. For example if you were processing a dataset that had both state and county level summary data for the entire United States and only want to do ranks for the counties of your state you could use these two Variable/Value(s) boxes to achieve the desired selection. This assumes that the dataset has a variable that lets you distinguish the geographic summary levels and another that identifies which state is being summarized, and also that you know the code values that need to be entered. The archive datasets would almost always contain a variable called SumLev that would contain a 3-digit geographic summary level code, with "050" being the code indicating a county level summary. Our datasets would also almost always have a variable named State that would contain the 2-digit FIPS state code. So you could select SumLev from the first Variable drop-down select list and enter 050 in the first Value(s) text box. Then choose State from the second Variable drop-down and enter (for example) 06 in the second Value(s) text box to indicate you only want data for California. (Dexter users will realize they could also accomplish this via Dexter preprocessing, where they might have easier access to metadata where codes for such key varibles would be made available for many datasets.)

II. Output Options

In this section the user gets to specify details regarding the output to be produced by Rankster. Most of this section is concerned with specifications related to producing a custom report. But there are also options to specify other outputs such as csv files, SAS datasets and a statistical summary report.

The first two checkboxes allow the user to override the default choices concerning the two kinds of non-report output files. By default a csv file will be produced. "csv" stands for "comma separated value" and this format is an industry standard that is recognized by many programs including Excel. Note that you can have a very large output file (or files) with dozens if not hundreds or even thousands of variables. It is only the report output that has to be limited due to the practical limitations of space in a printed report.
We set the output defaults so that you DO get a csv file but you DO NOT get a SAS dataset. This is because most users these days are Excel users and only a relative few are SAS users. These boxes make it easy for you to alter these defaults. Note that you easily choose to not have a csv file generated while accepting the default of not having a SAS dataset either. In this case, we assume you are only interested in seeing a report. That should be fairly common.

Options for a custom output report is an important subsection of the form that allows the user to customize the report to be generated. There are three parts to this:

  1. the report format
  2. the specific variables to appear in the report
  3. titles and footnotes to be added at the top and bottom of the report.

The Report Format option allows you to choose among 3 actual formats and a 4th option that says you do not want an output report. Default is to generate the report and do it in standard HTML format for viewing in your web browser. Optionally, you can ask for a special "scrollable table" HTML variant that only really works properly with Microsoft's IE browser. For IE users, it may be a more convenient format because the column headers remain locked as you scroll vertically through the report. Of course, PDF format will be the preferred choice for many, especially if they want to print the report.

Specify variables to be included in the report is perhaps the most daunting section of this form. There a lot of words here that you might need to read to figure out what you have to do. It can be difficult at first but once you see a couple of examples you'll probably notice that it's easier than it first appears. Users should also notice that in many (perhaps most) cases nothing needs to be entered in this section because of the program's implementation of implied defaults based on what was selected in Section I.

Start by recognizing that an output report is divided into three horizontal sections.

It is important to understand the relationship between the variable choices made in Section I of the form and those entered here as report specifications. The general rules are:

If all you are interested in is a report then you may not have to specify any of these variable lists. Typically you will need/want to specify them if you are getting file output as a report and you want to have more data on the files that you display in the report. What you need to do is look at the default values for each of these 3 variables lists to see which ones you are willing to accept. For the Identifiers the default is indicated on the form: (If nothing is entered here then the first (up to) 3 character-type non-ranking variables chosen in Sec. I will be used.) You may want fewer than 3 identifier variables for the report and you also may want them in a different left-to-right order. A common situation is that users will want to keep geographic codes on their output files to facilitate merging the data with other files, but they do not want these codes to appear in their report. For example, you might want to save the variables Year, Fipco and Areaname for your file (even though Year is a constant and fipco is a FIPS county code that would only be of interest to the researcher wanting to merge this output with other county level data). You could then enter Areaname in the text box for Identifiers to specify that you only need this single item to identify the case (it would contain the name of the county).

For the Rank variables box the default is indicated as: ...the first (up to) 4 ranking variables chosen in Sec I will be used. If you want fewer rank variables or if you want them in a different order or if you specified 12 rank variables in Section I but you only want to see two of them in the report, then here is where you enter the list of those variables, very carefully spelled, in the order in which they are to appear (left to right) in the report.

The Additional variables list may be the least likely to be left to its default value. That default value is a bit hard to define. It requires you to understand that we impose a 12-column limit to Rankster reports. Rank variables occupy two columns, the variable value and its rank. So the total number of columns in the report would be NId + 2*Nrank + Nother, where the Nid is the number of ID variables and Nrank is the number of rank variables selected for the report. If there were 2 ID variables and 4 rank variables that would take up 10 columns. And since we have a 12-column limit it would mean that Nother could be no larger than 2. If you had chosen six other numeric variables from the right list in Section I and left the Additional variables box blank here it would result in just the first two of those six Other variables being included in the report.

A checkbox is provided that can be used to override the normal sort order of the output report and/or datasets. Normally these are sorted by the rank of the first (and often only) rank variable. But sometimes you may prefer to see the results in ID variable order. For example, you could do a report with 4 variables ranked for all counties in a state, and you would like to see the report in county order. That is what this checkbox can do for you.

Title and footnotes can be specified, up to two of each. Do not specify title2 unless you specify title1 and, similarly, do not specify footnote2 unless you also specify footnote1. Titles appear at the top of the report/page, footnotes at the bottom. They appear only once (at the beginning and end) of html reports, and on every page of pdf reports.

The final option in this section is a checkbox to indicate if you want to see a statistical summary report for the ranked variables. Try it and see what the report looks like. You may never ask for it again, or you may decide this is a useful set if information. Many news stories summarizing some new set of data are apt to site this kind of information in order to give readers a feeling for what the "typical" values are, and how they are dispersed.

Click on the Run Rankster button to invoke the application. Until you hit that button you can change your mind all you want, by going back up and entering or erasing choices / text entries in the form fields. If you decide to start over entirely you can click on the Reset Defaults button to set all values back to their defaults. Be careful not to hit this button when you mean to hit the Run button which is right next to it. After running the application you should be able to use your browser's Back button to get back to this page. When doing so your choices may still be reflected (i.e. text boxes containing your entered text, select lists still showing your choices) or the form may have been reset. This seems to be somewhat browser dependent. We have found that when using Firefox the form is almost always restored to what it looked like when we hit the Run Rankster button, but with IE we have mixed results; it seems to be function of time: the longer the time between the submit and the return to the form, the more likely it is that the page will have been reset.

Getting assistance with Rankser. If you run into trouble and need help beyond what this Help page has given you there is a feedback button: "Questions and comments..." can be sent to the author via e-mail. You should typically receive a reply within 24 hours (at least on weekdays).