• Hristo Piyankov

Before you start: pick the right dataset structure

Updated: Nov 23, 2020


A problem occurs in a few scenarios. If you try to do your data analysis in your data extraction tool. For example in SQL. Likewise, if you try to dump raw data in your analysis tool before aggregation. In this article, we will go over the basic formats for reporting and analysis. We will do this by answering a simple question from our sample dataset. “What is the company profit over time and region?




Main dataset components


There are a lot of ways to classify data. However, for the purposes of this article, we will consider the following four:

  1. ID (missing if data is grouped)– unique identifier for each row. Also known as Primary key. Example from our data: “Row ID” and “Order ID”. Both are IDs, however, the former is more technical, the latter has business meaning. Thus, we should use the latter in such cases.

  2. External ID (missing if grouped) – an identifier which is not unique in the current list, but is unique in another list (Foreign key). In our example “Customer ID”. We can use it to obtain customer-specific information from another table.

  3. Discrete data (dimensions) – those are variables which can only take a predefined, limited set of values. For example: “Ship date” and “Country”.

  4. Continuous data (analysis KPI)– variables which can take on any numerical value. For example “Sales” and “Profit”. This is what we explore in the analysis. If you need to examine discrete data – create a continuous variable which explains it.

Putting those data types in a different format then produces the three main data structures we use in analytics. Namely the list, the table, and the transaction dataset. Each of them is more or less useful in certain scenarios. As a result, any analyst must be comfortable to work with any of them.


Lists - format


Lists are your bread and butter when you need to analyze data. We can consider a list of any dataset, where the dimensions are separate discretized columns (one each). The KPIs of those dimensions also need to be in separate columns holding numerical values. Data (thankfully) usually comes in this format. Lists have the following basic structure:

Obviously, if your data has a Primary key, you can expect one row per key. Otherwise, if your data is grouped, you can expect one row for each unique combination of your discrete values.

For example: Assume you have data grouped. Your groups are by years (2012,2013,2014) and regions (East, West). In your list data, you can expect 6 rows. That is 3 years X 2 regions.


Examples below are from a live Excel sheet. Use the scroll bars to browse the data.

Full sample data available here, the examples in this article use a subset.


Lists - properties


As stated lists are your go-to format for data analysis. The main reason is that most tools are very good at dealing with this format. This includes Excel, Google Sheets, Tableau, SAS Visual Analytics, and others. The list format is especially well suited to produce Pivot Tables, which are in essence… table format if you want to present your data this way.


There is one caveat though. You might have speed issues in case your dataset is huge. In our example, the data is just 1 000 rows (10 000 in the original dataset). Most tools can handle a lot more than this. With Excel, it is usually OK until half a million rows. Tableau can handle a few million. SAS VA – several million, but it is server-side.


However, if you are not using a server-side solution, you should consider grouping the data first. Let’s assume you are examining the sales data by regions. It’s very likely you do not need to look at individual customer/order level. Or the city level for this matter. Grouping can be achieved in many ways but is usually done during the data extraction. Another way is grouping when importing the data to your analytics tool. Finally, you can use a Pivot Table to group the data and then save it as a separate data source.


Another issue you might face is adding new variables. Once you have an analysis build, not all tools support easy integration of new dimensions. It’s not hard, but not painless as well. Not to mention that each new category (column with discrete values), increases the size of your file. In fact by a factor of the unique elements in the column. For example: Assume you have a pre-grouped list with 10K rows. You decide to add a new dimension to the data – groups by region. If case you have 15 distinct regions, your data will most likely now be 150K rows. In conclusion – always group your data when possible. Just be smart about it.


List format allows you to quickly Pivot the data in order to get the needed answer.



Pros of lists

  1. Best for analysis (usually)

  2. The best format for Pivot Tables

  3. Datasets usually come in this format

Cons of lists

  1. Hard to read without pre-processing

  2. Not good for presentations

  3. Needs grouping for large datasets

  4. Adding new dimensions is tedious

Tables - format


This is the structure most people are familiar with and prefer using. It is surely the most readable format for presenting data. However, it is not great for analyzing it. Usually, it can just compare two-three dimensions, across one-two KPI. If you try really hard you could fit 3+ dimensions or 2+ KPI, but the readability will suffer greatly. Tables have the following basic format:

Tables are great for presenting the data. But there are no real benefits in structuring your data source as one. Even for the simplest of data – use a list.




Tables - properties


It is important to note that list data is easily pivoted into a Table format. However, the conversion does not work the other way around. If you have data in a Table format, unpivoting it requires some effort in most tools. Furthermore, Tables are a form of data aggregation. There is no way to get detailed ID-level information from them. Thus you usually lose data. Let’s look at the simplest example.

Consider the data to the right. Let our pivot shows only the grand total row, then copy/paste the data as text. From this text, there is no way to retrieve the original dimensions. As a result, you have lost the “year” detail.


There are a few advantages to tables outside of being a presentational layer. Generally, they are the structure which takes the least amounts of space. This is due to the reduced repetition of the category labels. That being said, this advantage is becoming less and less relevant, as now most file formats come pre-compressed. If the file size is really a concern for you, try out if a table will bring any improvement. Another advantage is that most charts need a table as a data source. Still, there is no reason, why you can’t have a pivot as your data source. In a lot of cases, it is actually beneficial!

Another fact is that most charts need a table as a data source. As a result, you will need to produce one. Still, there is no reason, why you can’t have a pivot table as your data source. In a lot of cases, it is actually beneficial!


List format allows you to convert your data to Table format. You can apply styles and different calculations in seconds via Pivot tables.



Pros of tables

  1. Best for presenting data

  2. Charts use tables as the data source

  3. Take up less space then lists

Cons of tables

  1. Not really suitable as source data

  2. Can represent aggregations only

  3. Cannot have many dimensions (and remain readable)

Transactions - structure


The transaction structure is one you are not likely to encounter often. You can think of it as a certain (extreme) level of unpivoting your data. The general data structure looks like:




The ID, in the case of transactions, has no specific business meaning. It is just a logical grouping of the data.

As you can see to the right, transactions usually come as name-value pairs, grouped by an ID. In some extreme cases, it can be just values grouped by an ID.


Transactions are quite uncommon. Unless you are often dealing with semi-structured or unstructured data.



Transactions - properties


Although not common in data analysis, transactions have some interesting properties. For example, they can support (semi) unstructured data. On top of this, their structure is flexible to add new attributes.  As a result, you always end up with the same amount of columns. This means that any properly constructed analysis sitting on top of them is flexible. It does not need to be redone upon adding new dimensions.

There is one notable exception with this data structure when it comes to analytics. In fact, transactions are the preferred structure for Market Basket Analysis. You might also know it as Association analysis or Sequence analysis.


Pros of transactions

  1. Very resilient to change

  2. Supports unstructured data


Cons of transactions

  1. Hardest structure to read directly

  2. Takes up the most space

  3. The hardest one to use

Conclusion


Most of the data structures commonly used in analysis and reporting are some form of the three types listed above. They might not be exactly the same, but the logical structure will be.

You can consider the process of transforming data from Transaction to List as a certain degree of pivoting. Likewise from List to Table. While the opposite processes are certain degrees of unpivoting. The pivoting adds structure, readability and compresses the data. On the other hand, unpivoting adds flexibility. You should always choose the structure which best serves your purpose. However, the List structure is commonly referred to as the golden mean for data analysis.


#Analysis #Beginner #Fundamentals #Reporting

9 views