- Data Exploration: This is one of the important, and often neglected step in modeling. A number of people spend enormous amount of time thinking about the modeling technique to use and end up spending insufficient time on this step. Generating frequency tables for categorical / nominal variables and distribution of values for numeric variables is pretty useful at this stage.
- Identifying the Target and the Predictor Variables: The Target variable is identified based on the problem you are trying to model or the goal of the modeling process. For example, the Target variable in modeling credit card fraud can be a binary variable indicating whether the transaction is fraudulent or not. The predictor variables are those variables whose values are likely to be influencing the Target variable. Its almost impossible to identify the correct Predictor variable at the first go and this is often an iterative process.
- Data Preparation & Cleansing: Various operations are performed on the raw data in this step to make it suitable for the modeling process. Standardization of the numeric variables, handling missing values, binning of numeric variables, under / over sampling of data and generation of other derived variables are the examples of transformations that might be performed in this step.
- Model Training: The first task here is to identify the modeling technique to be used, which in turn depends on the modeling task that you are trying to accomplish. For example in Classification type of problems Decision Trees, K- Nearest Neighbor, Neural Network are used. Similarly for Clustering problems standard clustering techniques, Self Organizing Maps etc and for Prediction problems Regression Models, Neural Networks are very popular.
- Model Evaluation: This again depends on the type of modeling problem. You can use the standard approaches like Cross Validation, ROC curves, Divergence, KS, or can design your own custom metrics which are indicative of lift achieved through the model. You may have to go back to the step2-4 if the results after this step do not seem to be satisfactory.
There is enough material available on the web on the standard terms and techniques described above. In my futures posts, I'll try to elaborate on a few of the things mentioned above with real life examples.