• Hristo Piyankov

Automation: fetching data


Any analysis or report begins with getting the data which you need into the tool, you intend to use. In general, this is a one time exercise: analyses are repeated infrequently and usually need to be completely re-done based on new information and findings. In case data mining models require automation it is for the scoring part and not for the development. Some reports are one-off data delivery. In all those cases, the data extraction mechanism is not important, as any automation is likely to be more time-consuming than the future time-savings.


On the other hand, there are regular reports and dashboards. In both those cases optimising and automating data collection is crucial to ensure the smooth generation and distribution. Too often people use database query to generate and then export the data only to copy/paste it into Excel. For obvious reasons, this is quite inefficient. In this article, we will go over data collection from databases, files, the web and via APIs. The examples will be covering the two most commonly used tools: Excel and Google Sheets. If you are using another tool for data presentation (SAS Visual Analytics, Tableau, SAP Business Objects) it most likely has its own data extraction mechanism, but the general principles are usually the same.


Fetching data from a database via ODBC (Oracle, Access, SQLite)


Open Database Connectivity (or ODBC) is one of the most commonly used methods to fetch data. It allows your Excel file to connect to a database and execute a query, eliminating the need for a “human interface”. Excel natively supports some connections such as Access, Azure or Microsoft SQL Server (surprise – all Microsoft products), while for others you might need to install an ODBC driver in order to create the connection. Here we will go over establishing a connection to the SQLite database. It is common for personal use, while if you have a server-side solution such as Oracle, it is best to contact your database administrator to enable the proper connection and give you the TNS connection strings.

Excel does not natively support SQLite, so you will need an ODBC driver like the one here. After you install the driver, you can follow the instructions as shown in the pictures below. The example assumes you have some knowledge of SQL, but even if you don’t, the graphical user interface should be enough for most of your needs.


The SQLite database used in this example is available on Kaggle where you can find a lot of other free data sets.


Steps:

  1. Open Excel go the Data Tab

  2. Select "From Other sources"

  3. Select "From Microsoft Query"

  4. If you installed the ODBC driver, you should have the option to select SQLiteDatabase

  5. Browse for your database file

  6. If not table show up, open Options and select "Views"

  7. From here you can finish the process in the GUI, but it is better to hit cancel and open the query editor

  8. Click the SQL button and enter the query which you would like to execute

  9. When you are ready click "return data"

  10. Select in what format you want to return the query results


SQL used in the example. Remember to aggregate your data!

SELECT
    l.issue_d,
    l.term,
    l.grade,
    l.emp_length,
    count(l.member_id) as num,
    sum(l.funded_amnt) as funded_amt,
    sum(l.out_prncp) as oustanding,
    sum(l.term) as term,
    sum(l.term*l.funded_amnt) as wterm,
    sum(l.term*l.funded_amnt*l.int_rate) as wIR
FROM
    loan l
WHERE
    substr(l.issue_d,5,4) >= '2014'
GROUP BY
    issue_d,
    l.term,
    l.grade,
    l.emp_length

Some additional points to consider:

  • You can use the GUI to finalize the query if you do not know SQL. The options are limited to filtering and sorting.

  • If data is in a database format usually the size and complexity are not suitable for Excel. In this case, avoid fetching the raw data without aggregating it.

  • The output formats are Table, Pivot Table and Pivot Chart. The table is useful when you need to add some more complicated formulas to the data. In all other cases – use a Pivot table. Pivot charts… I do not recommend as an output format. If you need one, you can always create it on top of the Pivot table.

  • You can set a refresh interval for your query. If it is a fixed period, obviously the file needs to remain open for the query to execute. More commonly used is “Refresh data when opening the file”.


Fetching data from files (TXT, CVS, XLS)


Fetching data from files (Excel files, CSV, txt) is not conceptually different than the process for ODBC described above. Actually, they are quite similar, sans the following differences:

  • CSV and txt files do not support queries or filters. The only advantage you get from importing them is the possibility to set up an automatic refresh when opening the file.

  • If your CSV/txt files have more records than the Excel limits (a little over a million, unless you are using a version older than 2007, then it’s just 65K) consider using the Power Pivot plugin. It can handle large volumes of data.

  • CSV/txt files are imported via Data -> From Text ; Excel files are imported via Data -> From Other Sources -> From Microsoft Query -> Excel files.

  • If you are using Excel 2016 you can take advantage of the new “Get & Transform” menu under the Data tab. It is an intuitive graphical user interface for importing and manipulating data from files.

  • When importing data from Excel files, sometimes you will not be able to see the data range that you are trying to import. You can remedy this by inserting the data range in a table or defining it as a named range.

Getting data from the Web (HTML/XML)


Both Excel and Google Sheets can fetch data from a web source, however, Google Sheets does a much better job doing it, being able to specify more complicated structure with its ImportXML function. Let’s take a look at both.


Excel has built-in features to fetch data from HTML tables (data enclosed in the tags). Furthermore, you can pass parameters to your query, as described on the MS office support site. In theory, this is a nice feature, however, in practice, I find it hardly usable. Most of the data which is available online these days uses custom created tables and background scripts which give Excel trouble. For the current example, I had to search a bit in order to find a useful, business-related table, in the proper format. Even Microsoft’s own example is not working properly. Luckily, I did find a web page with exchange rates which can be queried, steps to do so are shown below.


  1. Open Excel

  2. Go to Data

  3. From Web

  4. Paste the URL and hit "Go"

  5. From the options menu, you can select the format in which you want the data returned to Excel

  6. The HTML tables which excel finds will be marked with an arrow. Select the ones which you want and click Import

  7. Select a cell in the produced table and go to Data -> Properties in order to configure the refresh frequency

The same can be achieved via Google Sheets’ ImportHTML formula.


What really makes a difference is the ImportXML formula which adds more flexibility to what you can import and also having the output come from a formula, brings you additional options to format and filter the result before displaying it. For example, you try to query google for the top 20 search result in the US when looking for “analytics”. The formula should look like:


=IMPORTXML("http://www.google.com/search?hl=en&gl=US&q=analytics&num=20","//h3")


Keep in mind, the IMPORTXML formula in Google Sheets is a bit “temperamental” quite often people report that the formula stopped working for no apparent reason or cannot get the results in the first place, even if it’s properly constructed.


Connecting to a web API (OAuth)


This is a much more common scenario than the HTML tables one, due to the fact that data here is designed to be programmatically accessible and does not focus on human-readability. Before we can get the data, however, it might be needed to authenticate our access. This is very common in case the data is either sensitive, restricted or personal. In this part of the article, we will go over connecting to an API over OAuth and in the next part – retrieving the data itself.


In this example, we will connect to the API of a bitcoin marketplace (BitStamp) where we will set up getting the latest BitCoin price every minute. This will also enable us to execute BitCoin transactions on demand.


Here we will switch to using Google Sheets. In the particular case it has a few advantages:

  • Native support for JavaScript, which will simplify our task

  • Since we will be making an API call every minute this makes Excel unpractical. Google Sheets reside on a server and any code in them is executed server-side. This means even if you are offline, your code will continue to execute on the pre-defined intervals.

Let’s first set up our Google Sheet:


  1. Open your Google Spreadsheet

  2. Go to Tools -> Script Editor

  3. Rename your project and hit Save

Whenever you are trying to connect to an API, there will usually be an instruction on how to connect. While those will always vary to a certain degree, the general principles should be the same. Based on Bitstamp API documentation we should define the functions as described below.



//For ismplicity let's first define a function which will take care of generating our authentication string.
function getPayLoadObject() {                                                            
  try {
    //The nonce is an integer which should always increased with each request. It's easiest to use the current datettime for this purpose.
    var nonce = (new Date()*(10^12)).toFixed(0);
    
    //The message consists of the nonce, our used ID on the site and our API key provided by the site
    var message = nonce+""+999999+"";
    
    //The signature is then HMAC-SHA256 encoded using a secret key, also provided by the site
    var signature = Utilities.computeHmacSha256Signature(message, ""); 
    var signatureStr = '';                                                 
    
    //The site requests that the signature comes in upper-case hexadecimal representation, so we are converting it here 
    for (i = 0; i < signature.length; i++) { 
      var byte = signature[i]; 
      if (byte < 0) byte += 256; 
      var byteStr = byte.toString(16); 
      if (byteStr.length == 1) byteStr = '0'+byteStr;                               
      signatureStr += byteStr; 
    }    
    
    //We generate the request header as a JavaScript object
    var payLoad =                                                                  
        {
          'key': '',
          'signature': signatureStr.toUpperCase(),
          'nonce': nonce
        };
    
  } catch (e) {
    throw("Error in generating PayLoad: " + e );
    return;
  }
  return payLoad;
}
//Now we can create a generic function for posting requests to the API. The Payload parameter is the generic payload which we created + additinal options for the API call, the URL is the address of the specific API which we want to invoke
function bitStampPOSTLoad(payLoad, url) {       
  
  //Post the request to the API URL and get the HTTP Response
  var httpResp = UrlFetchApp.fetch(url, { 'method': 'post','payload': payLoad,'muteHttpExceptions':true});
  
  //Parse the JSON response to a JavaScript object
  var obj = JSON.parse(httpResp.getContentText());
  //If there is no errors return the object. Otherwise put all the error strings together and throw an exception
  if (obj.error == undefined) {return obj;}                                           
  else {
    var errorQueue = "";
    for (var k in obj.error) {
      errorQueue = errorQueue + "-["+k+"="+obj.error[k]+"]-";
    }    
    throw errorQueue};
  return;
}

Fetching web data (JSON)


After we have set up our authentication (as described above), now we can access the data. JavaScript Object Notation (JSON) is a very common format for transferring data over the web and querying APIs. You are likely to encounter this format more often than not. The important thing to note here is that the data is semi-structured which means that in order to “translate” it to a table or list format, you need first to understand the data structure, which is most easily done by inspecting it. Below is an example of a very simple JSON output representing the latest BitCoin price on BitStamp. Bear in mind that in general, the structure can be much more complicated, having multiple levels of nested objects.



{"high": "1192.50", "last": "1177.81", "timestamp": "1492310786", "bid": "1174.62", "vwap": "1179.98",
"volume": "1707.75910561", "low": "1165.00", "ask": "1176.30", "open": 1171.00}

Now, let’s define the functions which will be querying and processing the data.



//Simple function to a public API which does not need authentication. Gets the latest BitCoin price
function bitstampFetchPrice() {                                                         
  try {
    
    //Which is the API Url
    var myUrl = "https://www.bitstamp.net/api/ticker/";
    
    //Fetch the JSON respose
    var httpResp = UrlFetchApp.fetch(myUrl, {"muteHttpExceptions" : true});
    
    //Parse it to JavaScript Object
    var obj = JSON.parse(httpResp.getContentText());
    
    //Create a container array for the values
    var tmpArr = new Array(3);
    
    //Get the variables which we need and store them in the Array
    var dt = new Date();
    tmpArr[0] = dt;                                                                
    tmpArr[1] = obj.last;   //Last price                                                       
    tmpArr[2] = obj.vwap;   //Weighted average price                                                  
  } catch (e) {
    throw("Error in fetching ticker: " + e );
    return;
  }
  
  //We can return the array to the calling function or now even paste it directly to the worksheet
  SpreadsheetApp.openById("Goole Sheets spreadheet ID").getSheetByName("Sheet Name").getRange(2, 1, 1, tmpArr.length).setValues([tmpArr]); 
  return tmpArr;
}

And one example using the authentication functions which we created.



//Issuing an order for buying BitCoin. This API call requires authentication. The function recieves as parameters
//what ammount of Bitcoin we want to purchase and at what price
function bitStampBuyBTC(ammount, price) {                                                  
  try {
    //URL of the API
    var myUrl = "https://www.bitstamp.net/api/buy/";
    
    //First get our generic payload containing our authentication
    var payLoad = getPayLoadObject();      
    
    //Add additional parameters to the payload specifying the price and ammount
    payLoad.amount = Math.round(ammount*10000)/10000;
    payLoad.price = Math.round(price*10000)/10000;
    
    //Call our generic post function with the new payload
    var obj = bitStampPOSTLoad(payLoad, myUrl);
    
  } catch (e) {
    throw("Error in buy transaction: " + e + " ammount: " + ammount + " price: " + price );
    return;
  }
  return "Posting buys at ammount: " + ammount + " price: " + price +"/";
}


Conclusion


Automating your data extraction goes a long way in preventing human error, especially when the process needs to be repeated over and over. The techniques covered in this article mainly concern data which was designed to be queried. There is a whole different topic of website scraping for data which was intended as the presentation layer and optimised for human readability. This will be covered in a separate dedicated article.


#Analysis #Beginner #Fundamentals #Reporting

33 views