Box & Whisker plot – the under-appreciated chart
Updated: Nov 23, 2020
Analysts tend to under-use certain charts. Sometimes it is because the charts are hard to set up, other times it’s because business people have a hard time reading them. One of those charts is the Box & Whisker plot. I believe, part of its bad reputation comes from the fact that it’s often confused with the Stock chart which is both harder to read, harder to set up and brings less analytical value. On the other hand, the Box & Whisker is actually amazing – it contains ALL the critical analytical information, which you need for the start of any analysis. The usual components of this chart are the mean (average), the median, the 25th and 75th percentile, the interquartile range (a measure of data dispersion) and the outliers. You can do all this without any complicated setup of the data.
Looking at the list of statistics which the chart provides, you might be thinking that the data setup takes time. This couldn’t be further away from the truth. All it takes is two columns of data – one categorical (discrete) and one numerical (continuous). After that, you select the chart type and you’re all set.
In this example, we will be using data from an SQLite loan database loaded into Excel (Google Sheets currently does not support this chart, although you could achieve a similar effect with a combo chart and a lot of shenanigans). If you want to re-create the charts shown below, the way you can get the data in Excel from the database, as well as the data itself, are described in this article.
After all the praises for the Box & Whisker plot so far, I need to make a disclaimer – one of the biggest problems of the chart is that it has to work on your observation level data. This means that you cannot pre-group the data in any way, or the chart simply won’t work. As such if you are working with large datasets Excel is not the optimal tool to create this chart. I have successfully created one out of ~1 million rows of data, but the performance is suboptimal, to say the least.
There is way around this – where you can pre-aggregate all the statistics yourself and then play with combining different chart types, so you can obtain the same results, but I do not think it is worth the effort unless you intend the re-use it frequently and have it fully automated. Finally the much simpler solution – use a different tool than Excel to generate Box & Whisker plots for large datasets.
SQL queries used for data extraction. As you can see both are very simple.
/*The reason why we are filtering the data to only one month is performance*/ select grade, int_rate*1 i_rate /*Reason for *1 is that SQLite tends to return % values as Strings, which gives Excel problems. The multiplication does an implicit coversion*/ from loan where issue_d = 'May-2014'
select issue_d, installment from loan where substr(issue_d,5,4) = '2014' and grade="A"
There is no part of the article for the chart setup because it’s quite simple – select the two columns data and click insert chart. Just make sure the categorical data is in the first column and the numeric data in the second. In other words, you need a very simple list format. You can also set up which chart elements you want there, but the default ones are usually all you need.
If you are still not sure what a Box plot looks like – it’s the first element in the visual to the right, above the bell-shaped curve. Usually, it’s vertical rather than horizontal and sometimes shows the average and the outliers. Let’s go over the main elements:
Average – the sum of all the elements divided by their count.
Median – 50% of the values in the set are larger than this value.
Q1 – first quartile – 25% of the values are smaller than this value.
Q3 – third quartile – 25% of the values are larger than this value.
IQR – Q3 minus Q1.
The whiskers – this is a bit more tricky. On the picture to the right, you see that they are 1.5 times IQR offset from Q1/Q3. While this is technically correct it seems that the length of both whiskers should always be the same. This is not the case. The length of the left whisker is equal to the smallest value (actually present in the set) which is greater than Q1 – 1.5 x IQR (even if it’s equal to Q1). Respectively the reverse for the right whisker.
The relation between a Box plot and normally distributed data N(0, σ2).
By Jhguch at en.Wikipedia, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=14524285
Data is uniformly distributed. This leads to the fact that the Mean(average) is equal to the median. Q3 is exactly 3 times Q1. Due to the uniform distribution plus the fact that the number of entries in our set is even, none of the statistics is exactly equal to values from our set. IQR = Q3 – Q1 = 12.75 – 4.25 = 8.5, thus the maximum length of the top whisker is 12.75 +8.5 = 21.25, but since the largest value int he set is 16, this is where the whisker ends. Due to the same reason, there are no outliers.
Only one number changed from the previous example. Since it’s one of the values in the frontiers of the set it has no effect on Q1, Q3 and Median, but it changes the Average. This means our distribution is “right-skewed”, although this is a bit counter-intuitive with the vertical representation of the plot. Try thinking of it, as if we flipped the picture 90 degrees to the right. Since the new value is greater than Q3+ 1.5 x IQR it is an outlier. Also now the largest value which is not an outlier is 15, so our top whisker is shorter than the bottom one.
Now the top values are capped. Since the average now is less than the median this is a “left-skewed” distribution. Furthermore, as none of the top values is greater than Q3 the top whisker is equal to Q3 and thus not visible. If you try changing the last value in the set to 30 – we get an interesting result – there is one outlier, but still no top whisker.
Going back to our real data, query one. We are examining the interest rate by risk grades. With the simple query provided at the beginning, we extract the data and then with just a few clicks we get the chart to the right. What conclusions can we draw from the chart:
Data from the D grade needs further exploration. It has the only significant outlier plus is doesn’t seem to follow the pattern of the other risk – we see the spread of the IR getting lower and lower, except for D.
The better the risk grade, seem to have higher spread within the middle 50% of their IR, but less outside of it.
Grade G appears to be only a single value.
Grade F is bottom capped.
The second example is looking at instalment amount by months. The conclusions we can draw:
The average instalment amount is increasing. Since Q3 (quartile) seems to be raising more than Q1, plus the average is starting to separate from the median, we can say there are higher amount instalments than before, as opposed to all instalments increasing by an equal percentage. Still, as the mean – the median difference is not huge – both factors are playing a role.
There are less and fewer outliers, which is likely the product of some price unification policy.
I strongly encourage you to incorporate the Box&Whisker plot into your analysis. If not as presentation output for your business colleagues – as least the main pillar of your own analysis. Few other techniques can give you so much initial information with so little effort. Even taking some time and effort to educate your colleagues on how to read and interpret the results of this plot can pay off significantly in the long run.