Tutorial: Basic Descriptive Statistics with PivotTables

Note: If you don't like web pages, there's a pdf file you can download.

I. Making a PivotTable

Microsoft Excel’s PivotTables® are very useful for data analysis and manipulation.  We will use them to create basic graphical statistics but before we can get to the stats, we need to create a table.

Download the ApplicantsData.XLS example file.  Do NOT open the file directly from the web, save it somewhere first. If you open the file directly from the web, you will get a “Data reference not valid error” from the PivotTable. Then open the file.

In order to use PivotTables, the data should be in a contiguous block of cells with the first row containing variable names and the following rows containing data records.

Put your cursor on any single cell in your data block and click on the PivotTable icon on the Insert Ribbon.  Do NOT try to select all of your data before clicking the icon.  The PivotTable engine will find the data for you.  The Window below will appear.   You do not need to make any changes to this Window.  Click OK.

Depending on the configuration of your Office installation, you will get one of the following screens.

For our purposes, the first screen, called the “Classic View”, is easier to work with.  It’s also the screen you would get in older versions of Excel.  Therefore, if you have the second view, you’ll need to change it.  Click on Options beneath PivotTable on the ribbon and select.  In the resulting Window, select the Display tab and click on “Classic PivotTable layout”. Then click OK to return to the PivotTable and it should look like the first view above.

II. Using a PivotTable

Before getting into Statistics, let’s explore some basic PivotTable functions.  We’ll look at the gender distribution in our applicant data set.  In the PivotTable Field List, click and hold on SEX and drag it to the “Drop Row Fields Here” area.  Then drag SEX to the “Drop Data Items Here” area.  The outcome shows that there are 2649 applicants, 615 female, and 2034 male.

For contrast, drag “Count of SEX” back to field list and drag AGE to “Drop Data Items Here” area.  Now it says “Sum of AGE”.

Excel’s default action is to count any variable it identifies as a category and sum any variable it identifies as a number.  This can cause problems when a variable is a numeric code but not really a number (such as ID numbers).  Fortunately, Excel’s decision can be easily changed.  For now, drag AGE back out and SEX back in so that your table looks like the first one.

Absolute counts of gender are useful, but sometime you’d rather have relative frequencies (percents).  To get this, double-click on “Count of SEX”.  The resulting Window shows various initial actions (Sum, Count, etc.).

Click on the “Show values as” tab and click on Normal.  Then scroll down to select “% of column”.  Then click OK.

Although it’s the same data, now it’s easier to see that just over 3/4th of the applicants are male.

Next we want to see a cross-tabulation (aka “two-way table”) of nationality and gender.  Drag CITZ CODE to the open area above the “Total” cell.  Depending on your computer’s settings, this might still be labeled “Drop Column Fields Here”.

Now we can see that just over 1/4th of the US applicants are female and just under 1/5th of the foreign applicants are female.  Of course, we’ve lost the absolute counts so we can’t tell how many foreign vs. US applicants there are.  This can be changed by returning the Count to “Normal”.  We could also see the data as “% of row” or “% of total” instead of “% of column”.

Now, we’ll see what the Page Fields area at the top of the sheet does.  Drag DECISION to that area.

It doesn’t look very different yet but the drop-down arrow by DECISION lets us filter the table.  Click on that arrow and elect Enrolling from the list.

For foreign applicants, the male/female split for those enrolling is similar to the entire applicant pool.  For the US applicants however, there is quite a shift toward females.  You can try other DECISION outcomes to see how the data looks.  Sometime you may also want to switch the “Count of SEX” back to “Normal” to see the absolute counts because a surprising percent outcome can be the result of very low absolute counts.  For example, if there are a total of only 10 people in a particular decision category, one person can have a dramatic impact on percent.

III. Tables and Graphs: Qualitative/Categorical Variables:

We’ll continue with the Applicant data.  You could continue with the same table, but go ahead and delete the sheet with the table so that you can practice starting a new table.

We’ll work with the 1ST CONTACT variable.  Drag 1ST CONTACT to both the Rows area and the Data area to get a count.  If you drag something the Excel thinks is a number, it will be summed but you can change it to count (remember how?). You’ll get the following.

Two things should stand out right away.  There are 148 records with blanks for this variable and, one time, someone couldn’t type “Letter” correctly.  The typing problem is the easy fix.  Go back to the data sheet and click on Find & Select on the Home ribbon.

From here you can select Find or Replace.  I used Replace below.  Fill in the appropriate items and click Find Next.  Then click Replace and then Close.

Now your data is right, but switch back to your table and you’ll see that it’s still wrong.  To fix this, go to the Data ribbon and click on Refresh All.  This will refresh the data behind your table and your table will be updated to reflect the changes.

Now we have to do something about the blank records.  Since this is a categorical variable, it can be argued that “blank” is a valid category and should be included.  In this case, we’re looking for the distribution of how the 1st contact is made.  The blank records don’t tell me anything about that so we are going to remove them from the table (but NOT from the data set).   In an actual report, I would probably include a footnote about ignoring records with missing data.  In real life, you are the statistician and you need to make your own decisions when these situations come up.

To remove the blanks from the table, click on the arrow at the top of the table, scroll down the list, and unclick “(blank)”.  Now the table should look like this.

There are several very small categories that could be combined into an “other” group but that would require modifying the original dataset and it cannot be done through any PivotTable functionality.

If the table is just an intermediate step toward a graph, then we’re ready to go on.  However, if the table itself will be part of a final report, we may want to make it look a little nicer.  While it’s still a PivotTable, we’re pretty much stuck with its appearance.  However, if we copy the table to another worksheet and “Paste Values”, we’ll get a table that can be modified.

 Original After “Paste Values” After formatting changes

Click anywhere in the table and then select the Insert ribbon.  Then click on Column.  Don’t click on Bar – Microsoft programmers and statisticians disagree on several terms and this is one of them.  Microsoft’s Column Charts are what statisticians call Bar Charts.  Select the first 2-D type Column type.  Your complete computer screen gets really filled up at this point.

You can adjust some of the sizes and un-anchor the PivotChart and PivotTable toolbars, but you’ll just have to deal with a lot of “stuff” on the screen.

The chart doesn’t look too impressive yet, but there are two new ribbons that can help out: Design and Layout.  On the Design ribbon you can change colors and chart types.  Play with a few colors if you want.  You can also right-click on a bar and get some other format options.

Go the Layout, then Legend and select None.  With a single data series, there’s no need for a legend, it just takes up space.  There are other options on the Layout ribbon about axis titles and such, but this chart doesn’t really need them.

What it does need is a better title.  Click right on the existing title and you can type in something else.  You can also use the Home ribbon to change the type font or size.

This Chart above shows absolute frequencies because the table’s count is shown as “Normal”.  Go to the table and change it to “% of column”.

Right now, the contact methods are in alphabetical order.  That may or may not make sense.  You might prefer to have them in Pareto order (high to low frequency) or the other way around.  For Pareto order, right click in the numbers (currently in percent format) on the table and select Sort.  Then select Largest to Smallest.  Your chart should change again.

To put it back, right-click in the contact types, select Sort, and select A to Z.

You could change the chart type from Bar to Pie, but we’ll just delete the existing chart (click on it and press the delete key) and start over.

We start the say way.  Select a cell in the table, select the Insert ribbon, and select the first 2-D type of Pie chart.  For this type, we’ll want the legend.

We need to fix the title (again) and we’ll want to consider data labels and category order (again).  The title change is the same as before, click it and change it.  For data labels, there are several options under the Layout ribbon.

You can try any of the options. Lately I’ve liked the “Best Fit”.

The order can be changed the same way as before.  Sort the table as you want and the chart will follow.  Let’s go with Pareto ordering again.

The small frequency categories really mess up the view.  This is why small categories are often combined into an “other” category.   Since we didn’t combine them, we’re stuck with the small categories.   However, we can go into the chart and individually delete the labels that we don’t want.

If we expand the chart size so that we can see better, we can click on individual data labels and delete them.  Be careful to select just the label you want to delete instead of all labels.  After deleting the data labels, we can do the same with the legend entries, again being careful to delete just the ones we want to get rid of.  Afterwards, we can change the font color of the remaining labels to show better against the pie chart.

IV. Tables and Graphs: Quantitative/Numeric Variables

Before we can make an appropriate chart for a quantitative variable, we need to get the frequency table in order first.  Let’s try AGE first.

Create a PivotTable with AGE in the Row Fields and something counted in the Data Fields.   Have the count show as Normal.  There will be a different row for every possible age and that’s generally not what we want.  We usually want numeric variables grouped somehow.

Right click on any number in the Row Fields and select Group. Excel will come up with default values.

With a base ten number system, our brains like to see the first group start at a multiple of 5 or 10 when possible.  Change “Starting at:” to 20 and click OK.

This doesn’t tell us much since most applicants are in their 20’s.  Let’s try grouping by 5’s instead.   Right click on one of the group labels, select Group, and change the options.

This is better but it still doesn’t show the distribution very well in the upper 20’s.  Let’s try one more starting at 22 and grouping by 3’s.

We could try more options but this is good enough for now.  A rough rule of thumb is to end up with four to ten total groups and, when possible, start on a multiple of 5 or 10 and have class widths that are multiples of 5 or 10.  In this example, we’ve got our ten groups, but the other two guidelines didn’t work out.  That’s why they are guidelines rather than rules.

Now we need to make a histogram.  We start with a Bar Chart with no legend.  Go ahead and make the chart and put a reasonable title in it.  It should something like the following.

This is NOT yet a histogram.  One of the rules of histograms is that the bars need to touch.  This represents the distribution of data across the groups in contrast to distinct categories in a categorical bar chart.

To get rid of the spacing between the bars, right click on any bar and select “Format Data Series”.

Move the “Gap Width” slider down to “No Gap”.  Then click “Border Color” and “Solid line”.  Select a color that contrasts with your bar color (usually black is fine) and click Close.

This histogram was fairly simple because the data was “nice”.  Sometimes the data isn’t so nice.  Try the same process with JOB MONTHS instead of AGE.  The Group command won’t work because there is missing data coded as N/A.  This isn’t numeric, so Excel won’t let us group.

Instead we need to copy the original dataset to a new sheet, sort on JOB MONTHS and delete all the records with N/A for JOB MONTHS.  By copying, we preserve the original data set.

Once you’ve deleted the “bad” records, make the PivotTable and group on JOB MONTHS.  Take all the default options for grouping.

This is way too many categories, but it shows a new problem.  There is no group for 270-280.  There were no values there so Excel didn’t show them.  To change this, right click on a group label and select “Field Settings”.

Click on “Show items with no data” and click OK.  In this example, it add three items.

We need to fix two things.  We want the blanks to show 0 and we don’t want the <0 or >290 groups to show (at least not this time).  For this, right click on a group and select PivotTable Options.

Type a 0 into the box after “For empy cells show:” and click OK.  Next, click on the arrow on the table next to JOB MONTH and click off the two groups that we don’t want.

Now we could make a complete histogram but it would still be ugly because there are just too many categories.  Let’s regroup by 25.

Technically in statistics, all classes are supposed to be the same width which means you should never show a “>” or “<” group.  In practice however, this can improve a graph.  Group by 25 but stop at 200.  Then turn off the <0 group.

We can make a nice histogram from this.

Just like for categorical data, the table can be copied elsewhere in the spreadsheet file to improve the formatting.

V. Getting results into Word (or PowerPoint)

Paste Special is your friend.  The Office suite will often copy linkages between applications rather than data or output.  For a final report, you want the output instead of the links.

To copy any table from Excel into Word, it’s best to select the entire table in Excel and Paste Special => Picture (Windows Metafile) to get it where you want it in Word.  As far as I can tell, the Enhanced Metafile will come out the same so you can use either one.

For graphs, Paste Special => Picture (JPEG).  There are also GIF and PNG formats, but I tend to prefer the look of JPEG.