Friday, August 14, 2009

Modeling a Business Problem

In this post, I'll briefly go over on how do we go about transforming a business problem into a statistical model. At this stage, I assume you have already identified the problem and collected some sort of data which you think is relevant for addressing this problem. The modeling process can be divided into the following steps:
  1. Data Exploration: This is one of the important, and often neglected step in modeling. A number of people spend enormous amount of time thinking about the modeling technique to use and end up spending insufficient time on this step. Generating frequency tables for categorical / nominal variables and distribution of values for numeric variables is pretty useful at this stage.
  2. Identifying the Target and the Predictor Variables: The Target variable is identified based on the problem you are trying to model or the goal of the modeling process. For example, the Target variable in modeling credit card fraud can be a binary variable indicating whether the transaction is fraudulent or not. The predictor variables are those variables whose values are likely to be influencing the Target variable. Its almost impossible to identify the correct Predictor variable at the first go and this is often an iterative process.
  3. Data Preparation & Cleansing: Various operations are performed on the raw data in this step to make it suitable for the modeling process. Standardization of the numeric variables, handling missing values, binning of numeric variables, under / over sampling of data and generation of other derived variables are the examples of transformations that might be performed in this step.
  4. Model Training: The first task here is to identify the modeling technique to be used, which in turn depends on the modeling task that you are trying to accomplish. For example in Classification type of problems Decision Trees, K- Nearest Neighbor, Neural Network are used. Similarly for Clustering problems standard clustering techniques, Self Organizing Maps etc and for Prediction problems Regression Models, Neural Networks are very popular.
  5. Model Evaluation: This again depends on the type of modeling problem. You can use the standard approaches like Cross Validation, ROC curves, Divergence, KS, or can design your own custom metrics which are indicative of lift achieved through the model. You may have to go back to the step2-4 if the results after this step do not seem to be satisfactory.
There is enough material available on the web on the standard terms and techniques described above. In my futures posts, I'll try to elaborate on a few of the things mentioned above with real life examples.


1 comment:

  1. Nice post, but please don't forget what's the most important/difficult task for every data analyst.

    I talk about this at Highstone Tower blog

    The Most Difficult Task for Every Data Analyst is to…

    ReplyDelete