Interactive Dashboards – data preparation
Updated: Nov 23, 2020
This article is part of the series for creating interactive dashboards. If you have not checked the post for Interactive Dashboards – why not do in on your own? , I suggest that you start from there.
After we have collected all the business requirements for the dashboard, it’s time to prepare the source data. Generally speaking, the process is pretty straightforward: identify the KPIs, extract them in the correct data structure, and finally automate the process. Easy as it may seem, there are a few things you need to keep in mind. How should you optimise the datasets? What aggregation level do you want? Do you care about size and speed? What is your added value to the request?
As discussed previously, Lists are the go-to structure, especially for analysis and dashboards. They are the only one, allowing for easy slicing and exploration of the data.
In practice, this means, that all the dimensions which we want to explore, need to be in separate columns holding discrete values (categories). Similarly, all the measures of those dimensions also need to be in separate columns. The difference is that those values are continuous and pre-summarized accordingly (sum, count, std, etc). For example, if we are exploring sales over a time period and regions – the time period and the region would be in separate discretized columns and the sales volume would be summed over those dimensions in a third column.
For example, let’s take the sample data to the right. If we need to explore loans across states (addr_state) and status (loan_status) – those would represent our discrete dimensions. Meaning both of them will be in separate columns and can hold a pre-defined set of values. We know all the possible states and all the possible loan statuses. On the other hand, you need also to decide the measures of those dimensions – also prepared as separate columns. The two most commonly used are counts (for example number of loans) and sums (for example sum of total loan volume). You will seldom need anything more exotic in your source data.
But why aggregate data at all? Can’t you just use the entire source?
The dataset used for all of the dashboard series is Loan club’s loan data. Detailed information on where to get the data and how to query it can be found here. Sample from the dataset is shown below.
To aggregate or not
If your data source has less than 100K rows and 30ish columns – you can just skip this section entirely. Chances are, for almost any intent and purpose, you can just take your entire dataset and dump it in your tool – it should be just fine.
Similarly, some tools (like Tableau) will do the aggregation for you. You can take a 500 MB dataset, dump it in Tableau, do your dashboards and export the final result. More often than not it will be orders of magnitude smaller than the original. For example, in the Tableau dashboard, which we are re-creating, the dataset is 500 MB, while the published dashboard is 75 MB. The compression comes from the fact that Tableau only keeps the data which you actually use and aggregates it as needed.
In all other cases, and even in some specific scenarios of the two cases above, you want to aggregate your data. Actually, I would go as far as saying that in 95% of the cases where you will update the dashboard frequently, you want to aggregate the data. Regardless if it is not a lot or your tool does implicit grouping. Almost always you will see advantages in terms of speed and size.
Let’s see how file sizes (and by extension speed) change. We will take very simple data from our dataset. Only 3 columns – annual income, loan amount, loan status. ~900K rows.
At this point, you are probably thinking – “Wow, Tableau is so much better”. While it does provide the benefit of not having to group data on our own, remember that the export does not store the source data. Only the aggregated dimensions which you are using. Excel, on the other hand, has to keep data exactly as provided, because it is not only a tool for visualisations.
The main takeaways here are:
Grouping data in Excel, even only across two dimensions, provided us with a 15x compression factor (this will of course vary based on your data).
Even if the tool has build-in grouping (like Tableau), if you do it on your own, there can be even further benefits (2x in this case).
Excel has some built-in compression (compared to raw TXT file), but it is not stellar.
Discretization and re-discretization
Another thing you should consider is is the granularity of your dimensions and how you intend to use them.
Let’s take granularity first. If you are going to show a report for sales month by month, and your sales data is daily – consider re-discretizing your date dimension. By this I mean – it is already a discrete value (there is some argument on the topic, but in most cases, you can consider DATE as discrete) and you are just changing the granularity from day to a month. For example, you would represent both 01/01/2016 and 21/01/2016 as JAN-2016 and group them together. This is a 365 / 12 = over 30 fold reduction in the number of rows in your dataset! And you are not losing data, since one way or another, your focus period is one month.
On the other hand, you might have a continuous variable, which you do not intend to use as a measure. In our previous example, we are grouping by annual income, which means we intend to use it as a dimension (we are going to use its discrete values, rather than sums and averages of its values). In this case, a good question is – do I need all of the values? Can’t I group them in meaningful groups? Almost always the answer is – yes – you should create a discrete dimension from the continuous measure. The groups and method which you should use depend on your goal and data. Let’s see how it turns out.
As you can see, data in all cases is now beyond trivial size. Obviously, you cannot always discretize or sometimes you will need more groups. Still, you should keep this method in your arsenal for when you need it. Just in case you are wondering – in the very small file sizes, pure TXT files are smaller, due to having no “overhead”. That is the additional (non-user) file data associated with the Excel and Tableau files.
This topic applies to Excel only. If you are using another tool, feel free to skip it.
One thing a lot of people don’t know is that Pivot Tables (and even more so Power Pivot) have a build in compression methods. This applies mainly to numbers, however, you can achieve significant size reductions if you deliver your data as a Pivot directly. In all the examples so far, data was delivered to excel as an Excel Table. When you are executing a query from Excel, you get to choose the output method between a Table (all examples so far), a Pivot Table and a Pivot Chart (don’t use this one). Let’s take a look at the additional compression of Pivot Tables:
As can be seen, pivots are achieving 3x compression compared to just dumping the data in a table. This is a very useful approach, when you are really trying to keep your file sizes small – keep your source data and Pivot Table outputs in different files.
However, there is a drawback to this approach. Sometimes you need to manipulate the source data (apply specific formulas) before grouping it with a pivot. In those cases, you either need the data delivered to a normal table or need to find a workaround for the transformations. For example in the SQL getting the data from the source or in the source dataset directly.
Selecting the KPI
This is the hardest part of all. Or the easiest one. It all depends on what you want to achieve and the value which you want to bring. If you are pressed for time or frankly don’t care too much – then it’s pretty easy. You have collected the business requirements and you select your dimensions and measures based on them. The only thought process here is what should be discreet and what continuous value. However, this is also easily solvable. In 95% of the time, anything which needs to be an axis in a chart should be discrete. Anything which needs to be a value in a chart should be continuous. Yes, there are cases when this is not true. At least you can use this as a starting point rule.
In case you wish to bring additional value, the situation is quite different. Great analysts do not just execute requirements. They ask the right questions. Based on this they can understand the business need and add significant additional value to the dashboard. Remember, business users usually do not know the structure of the data or even all available data. Sometimes they do not understand basic chart building principles. It is up to you to understand what is needed and bring additional value. This includes adding new charts. Transforming bad requirements into better ones. Adding relevant dimensions and measures, which were not part of the requirement. Answering questions, even before they were asked.
Putting it all together
Everything described above is an iterative process. Don’t expect to do it all at once and be over with it. No one does. Usually, I start with the very basic data set and minimal aggregation. After I have the first version of my dashboard, I start iterating. Each iteration adds dimensions and measures, which I feel would add value. Likewise, each iteration aggregates and discretizes my dataset further. Every now and then, I look through all the charts and ask – is this really needed?
Finally, I use one simple rule:
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
In other words – your job is to add the maximum amount of value while using the minimum amount of data. Don’t overcomplicate, keep things simple.
So what and why was used in the dashboards which we are re-creating in this series.
For the Tableau one – it was a 1:1 data dump. The whole 500 MB dataset was imported. I wanted to give Tableau a chance to show it’s power as a dedicated dashboard tool and it did not disappoint. The final output was only 75 MB. Of course, this can be optimised further, whit the principles, which we explored above.
For the Excel dashboards, you can find the SQL code to the right. If you are not sure where the data is coming from or how to connect it to Excel, take a look at this article. Let’s explore the logic behind the aggregation decisions:
Obviously, the goal of those dashboards is to be as small as possible (147 KB each). For the purposes of those articles, they had to be fast to load and accessible to a larger audience. This is an example of limitations applied due to technical reasons. On the other hand, all the conditions in the WHERE clause are something which should be business limitations. Such kind of restrictions (subletting the observations instead of dimensions) should NEVER be due to technical reasons. Otherwise, you risk misleading the people reading the report. It’s all about what business needs to see and focus on.
The loan term and grades are a great example of really critical dimensions, which you should add even if they were not part of the business requirements. Based on your expertise you should be able to determine which are those dimensions.
Our data is already on month level, so we do not need to discretize further.
You will notice that all the dimensions here are summarised by sum and count. No averages. This should be your rule of thumb. The reasoning behind this is that the average, cannot be further aggregated. We will explore this further when talking about the back-end calculations. Similarly, if you look at the term (weighted term) and wIR (weighted Interest Rate) calculations, you might be wondering why are they calculated like this. They are an example of calculations which are critical to be performed on the raw data. Otherwise, the calculations would be wrong. Again, we will explore this further when talking about the back-end calculations.
SQL code used in the example dashboards for Excel.
SELECT l.issue_d, l.term, l.grade, l.emp_length, Count(l.member_id) AS 'num', Sum(l.funded_amnt) AS 'funded_amt', Sum(l.out_prncp) AS 'oustanding', Sum(l.term*l.out_prncp) AS 'wterm', Sum(l.term*l.out_prncp*l.int_rate) AS 'wIR' FROM loan l WHERE (substr(l.issue_d,5,4)>='2014') AND (l.grade In ('A','B','C','D')) AND (l.emp_length In ('1 year','2 years','3 years','4 years')) GROUP BY l.issue_d, l.term, l.grade, l.emp_length
So, what’s next? We have collected the requirements, we know how we want to structure and aggregate the data. We have considered size and speed limitations. The data source and the tool in which we will display the output are chosen. We have considered the dimensions and the measures, keeping in mind added value. In the next article, it’s time for back-end development and automation. Again it can range from very simple, to fairly complicated, depending on the desired output.