Thursday, August 20, 2009

Web Analytics

It’s hard to find an area in the field of data mining which deals with more data than the web. The whole web is exploding with information. As of May'09 there were an estimated 109.5 million websites operating over the web. Add to that the information about every click, ad impression, purchase and the information captured by the underlying infrastructure (routers, servers, gateways ...) and soon you'll be inundated with more data than you know what to do with.

Web analytics aims to use the information as mentioned above to optimize one's WebPages to achieve higher clicks, conversions, ROI etc.


There are a variety of tools available for analyzing the activity of users on your website. Google analytics is by far the most popular. It's admired for its simplicity of design and powerful features.



Top 10 Web Analytics Trackers found by Ghostery - June 2009



You can get a wide range of data using these tools. For example - where do your users come from, most popular keywords driving your traffic, bounce rate, demographic distribution, conversion rate, time spent per visit etc. You can even get information about how your peers are doing - by enabling benchmarking in Google Analytics you can view metrics of websites in the similar category.

However, the tools mentioned above invariably give the answer to “what” rather than “why”. The hard part is to come with an answer to why things are the way they are. For example you can have all sorts of statistics about clicks, impressions, bounce rates, average time spent, most viewed pages etc but unless you tie these with an explanation as to why these are happening web analytics will not be as fruitful as it can be.


One of the good things about web analytics or web in general is its dynamic nature. If you have good traffic on your website you can analyze effects of changes within hours or a few days. Which on many occasions means you don’t have to wreck your brain on discussing whether something will work or not. You can just try it out and quickly find out the answer to whether it’s working or not.

Friday, August 14, 2009

Modeling a Business Problem

In this post, I'll briefly go over on how do we go about transforming a business problem into a statistical model. At this stage, I assume you have already identified the problem and collected some sort of data which you think is relevant for addressing this problem. The modeling process can be divided into the following steps:
  1. Data Exploration: This is one of the important, and often neglected step in modeling. A number of people spend enormous amount of time thinking about the modeling technique to use and end up spending insufficient time on this step. Generating frequency tables for categorical / nominal variables and distribution of values for numeric variables is pretty useful at this stage.
  2. Identifying the Target and the Predictor Variables: The Target variable is identified based on the problem you are trying to model or the goal of the modeling process. For example, the Target variable in modeling credit card fraud can be a binary variable indicating whether the transaction is fraudulent or not. The predictor variables are those variables whose values are likely to be influencing the Target variable. Its almost impossible to identify the correct Predictor variable at the first go and this is often an iterative process.
  3. Data Preparation & Cleansing: Various operations are performed on the raw data in this step to make it suitable for the modeling process. Standardization of the numeric variables, handling missing values, binning of numeric variables, under / over sampling of data and generation of other derived variables are the examples of transformations that might be performed in this step.
  4. Model Training: The first task here is to identify the modeling technique to be used, which in turn depends on the modeling task that you are trying to accomplish. For example in Classification type of problems Decision Trees, K- Nearest Neighbor, Neural Network are used. Similarly for Clustering problems standard clustering techniques, Self Organizing Maps etc and for Prediction problems Regression Models, Neural Networks are very popular.
  5. Model Evaluation: This again depends on the type of modeling problem. You can use the standard approaches like Cross Validation, ROC curves, Divergence, KS, or can design your own custom metrics which are indicative of lift achieved through the model. You may have to go back to the step2-4 if the results after this step do not seem to be satisfactory.
There is enough material available on the web on the standard terms and techniques described above. In my futures posts, I'll try to elaborate on a few of the things mentioned above with real life examples.


Sunday, August 9, 2009

What is (and why) Business Analytics

The information captured has been growing exponentially over the past decade or so. There were days when 20 Megabytes of storage was considered as huge and today even a Terabyte is considered as normal. Today, information ranging from what people buy, read, listen to where they travel, eat and stay gets captured somewhere.

Often this huge amount of data has got interesting secrets hidden inside it. Business Analytics is the use of this plethora of information with data analysis, statistical modeling techniques to gain insights into a problem. These insights can then be converted into actionable plans and decisions to power the business to new heights.

Business analytics finds its application across wide variety of areas viz. Finance, Marketing, Sales, CRM, Supply Chain Management, Product Lifecycle etc. The ultimate goal everywhere is overall improvement of Business process and hence increased productivity.

Limitations – Business analytics is heavily dependent on historical trends or data. Without significant information about what has happened in the past, one cannot take evidence-based decisions which form the heart of Analytics.