I was motivated to write this post from a discussion on the Machine Learning Connection group.
For classification and regression problem, there are different choices of Machine Learning Models each of which can be viewed as a blackbox that solve the same problem. However, each model come from a different algorithm approaches and will perform differently under different data set. The best way is to use cross-validation to determine which model perform best on test data.
Here I'll try to provide a high level summary of its underlying algorithmic approach and hopefully can give a sense of whether it will be a good fit for your particular problem.
Step n°1 |
Decision Tree based methods
The fundamental learning approach is to recursively divide the training data into buckets of homogeneous members through the most discriminative dividing criteria. The measurement of "homogeneity" is based on the output label; when it is a numeric value, the measurement will be the variance of the bucket; when it is a category, the measurement will be the entropy or gini index of the bucket. During the learning, various dividing criteria based on the input will be tried (using in a greedy manner); when the input is a category (Mon, Tue, Wed ...), it will first be turned into binary (isMon, isTue, isWed ...) and then use the true/false as a decision boundary to evaluate the homogeneity; when the input is a numeric or ordinal value, the lessThan, greaterThan at each training data input value will be used as the decision boundary. The training process stops when there is no significant gain in homogeneity by further split the Tree. The members of the bucket represented at leaf node will vote for the prediction; majority wins when the output is a category and member average when the output is a numeric.
The good part of Tree is that it is very flexible in terms of the data type of input and output variables which can be categorical, binary and numeric value. The level of decision nodes also indicate the degree of influences of different input variables. The limitation is each decision boundary at each split point is a concrete binary decision. Also the decision criteria only consider one input attributes at a time but not a combination of multiple input variables. Another weakness of Tree is that once learned it cannot be updated incrementally. When new training data arrives, you have to throw away the old tree and retrain every data from scratch.
However, Tree when mixed with Ensemble methods (e.g. Random Forest, Boosting Trees) addresses a lot of the limitations mentioned above. For example, Gradient Boosting Decision Tree consistently beat the performance of other ML models in many problems and is one of the most popular method these days.
Step n°2 |
Linear regression based methods
The basic assumption is that the output variable (a numeric value) can be expressed as a linear combination (weighted sum) of a set of input variable (which is also numeric value).
y = w1x1 + w2x2 + w3x3 ....
The whole objective of the training phase is to learn the weights w1, w2 ... by minimizing the error function lost(y, w1x1 + w2x2 + ...). Gradient descent is the classical technique of solving this problem with the general idea of adjusting w1, w2 ... along the direction of the maximum gradient of the loss function.
The input variable is required to be numeric. For binary variable, this will be represented as 0, 1. For categorical variable, each possible value will be represented as a separate binary variable (and hence 0, 1). For the output, if it is a binary variable (0, 1) then a logit function is used to transform the range of -infinity to +infinity into 0 to 1. This is called logistic regression and a different loss function (based on maximum likelihood) is used.
To avoid overfitting, regularization technique (L1 and L2) is used to penalize large value of w1, w2 ... L1 is by adding the absolute value of w1 into the loss function while L2 is by adding the square of w1 into the loss function. L1 has the property that it will penalize redundant features or irrelevant feature more (with very small weight) and is a good tool to select highly influential features.
The strength of Linear model is that it has very high performance in both scoring and learning. The Stochastic gradient descent-based learning algorithm is highly scalable and can handle incremental learning.
The weakness of linear model is linear assumption of input features, which is often false. Therefore, an important feature engineering effort is required to transform each input feature, which usually involved domain expert. Another common way is to throw different transformation functions 1/x, x^2, log(x) in the hope that one of them will have a linear relationship with the output. Linearity can be checked by observing whether the residual (y - predicted_y) is normally distributed or not (using the QQplot with the Gaussian distribution).
add_shopping_cartContinue reading for free (70% left)
by Ricky Ho Software Architect & Data ScientistFollow