In the primary Machine Learning course we explored various algorithms via experimental analysis on two datasets - white wine
and abalone
. Below is an overview of my analysis on supervised learning algorithms.
Choosing Datasets
The popular wine quality score dataset I chose is based on a chemical analysis of wines grown in Italy - and the resulting user "quality" scores - scored from 0 to 10. Key attributes of the wine dataset include the amounts of.:
alcohol, malic acid, magnesium, phenols, and flavonoids
The idea of choosing a dataset like this is that it's a simple classification problem that's well suited for determining if chemical constituents of wine can predict quality scores.. The wine industry is worth billions in California alone, and if wine producers can find out what chemical constituents produce higher quality wines (at least in the mind of customers), then they likely will.
The other dataset I used was a modified version of the ablone age dataset - the idea being to predict the age of an abalone based on physical measurements such as:
length, diameter, shucked weight, and gender
The actual age of the animal is determined by counting the numbers of rings - however this is a time consuming process that requires a microscope. Therefore the goal of this dataset is to see if there's a better proxy attribute to use to count rings.
Algorithm Analysis
The code itself is basically searching all possible hyperparameters
of each algorithm to find optimal results. I used sklearn
's GridSearchCV
function which search over a specified range of hyperparameters, finding the optimal hyperparameter combination. The code splits the data into a 70/30 train/test split, normalizing it to a standard scale, performs 5-fold cross validation on the 70% training data for each hyperparameter combination, and finally calculates the average cross validation score among the 5 folds. The best 5-fold cross validation score from our grid search lets us know what the optimal hyperparameters are.
With the optimal hyperparameters, I can use these settings on the 30% held out test set for each different supervised algorithm and determine which algorithm would perform the best for solving the central idea of each algorithm (predicting quality or predicting age).
Supervised
In the supervised assignment I looked at K-Nearest Neighbors, Decision Trees, Artificial Neural Networks, Boosting, and Support Vector Machines.
K-Nearest Neighbors (KNN)
KNN is a simple instance-based algorithm that simply looks at the k
nearest points in feature space. For classification, this is simply a vote or choosing the mode of the neighbors classes, whereas in a regression setting this would be the weighted mean. Some of the hyperparameters chosen for KNN include k
, distance functions (Manhatta, Euclidean, and Chebyshev), and whether we weight each neighbor equally or by its distance to the point.
As seen in one of the convergence curves where the train and test were plotted against our value of k
- we can see the **bias variance tradeoff**
at work. For low k
values, we have a high variance
situation with a highly complex model that only really works for training data - the model is not "general" enough. An intuitive way to think about this is if we have an outlier "good" wine classification in a normally "bad" wine region of space, if we have k=1, points around that outlier may be labelled incorrectly.
As we increase the value of k
- it can be though of as smoothing out the decision surface - which will decrease variance and increase bias. Increasing k
too far leads to the opposite scenario - a high bias
situation. It's all about finding the right balance point.
Decision Trees
In contrast to KNN, a decision tree is an eager learner as it builds the model on the training data first. Decision trees ask yes or no questions on features to determine which direction to branch. The final leaf node contains the prediction class. Decision trees work well with good splits of the data. In order to measure the quality of splits, I ran Entropy and Gini Index in the hyperparameter grid search. Some decision tree algorithms only look one move ahead to determine the current root of the tree - this is generally ok but usually not the most optimal. You usually want good splits at the top which leads to shorter trees.
In some of the max depth experiments run, increasing the max depth leads to overfitting. If the tree is allowed too many nodes it starts to exactly fit the training data - therefore it generalizes poorly for test data. Pruning or trimming nodes can help alleviate some of these overfitting issues. Error reduction pruning will attempt remove the subtree at nodes, make them leaf and re-assigning a class. If the resulting tree performs at least as good that prune is kept. In both datasets pruning boosted scores from about 75 to 80%.
Artificial Neural Networks (ANN)
Artificial Neural Networks are learning algorithms modeled after brain neurons. The most basic unit is a perceptron
that takes a weighted sum of inputs - if the sum is greater than some threshold (defined by the activation function
) then that output is considered on or activated. ANN can tune the weight vectors (for each node connection) at each layer of the network by iteratively nudging them in small steps. The perceptron training rule
will gaurantee convergence (appropriate score/error) in a finite number of iterations if the data is linearly separable. Data that's not linearly separable will use the backpropagation
rule and gradient descent to iteratively step down the gradient of the cost function. Issues can still occure such as slow convergence or getting stuck in local minima..
In my experiments I used sklearn
's MLPClassifier
which uses stochastic gradient descent - this looks at each training example individually to update weights, offering slightly better runtime than the alternative batch gardient descent.
Some of the experiments here including comparing some of the activation functions (logistic, relu, tanh, identity), comparing varying network dimensions (5, 10, 20 in the first layer versus 5, 10 or 20 in the second layer). Overall ANN needs the fewest number of tarining examples of all the supervised algorithms to converge.
Boosting
Boosting is an ensemble
method of learning - the idea being combining many simple learners leads to a complex system that improves performance. In a common boosting algorithm like Adaboost
each individual learner gets a weighted (depending on its error rate) vote towards the final hypothesis. In addition, each training example is given a variable weight. If the example is modeled poorly in previous learners, its weight increases, if its modeled correctly it decreases. Therefore, even with weak learners, we are always gaining information with learners. In general boosting will reduce the bias (model complexity) in a model that's too general, whereas a related but different method bagging
will reduce reduce variance of a model that is too complex. Boosting can occasionally cause overfitting in a rare case where the first learner perfectly fits the training data causing weights to never be updated.
The experiments run on Adaboost included varying number of estimators, varying learning rate, and comparing learning rate vs the number of estimators in 3d contour plots. Adaboost has default learning rate of 1 - however this will slow down the adaptation of the model to training data. A max Abalone score around 30 estimators and a learning rate of 0.04 indicates that estimators have an equal voting power.
Support Vector Machines (SVM)
Support vector machines attempts to draw an optimal boundary that maximizes the margin width between different classes. Gutter lines define the max margin and go through points closest to the boundary. While ANN tries to reduce the error cost function, an SVM tries to maximize the margin while still classifying properly.
Using sklearn
's support vector classifier, I plotted various hyperparameters including gamma
(complexity of decision boundary), C
(another parameter for decision boundary complexity), and several kernel functions
(linear, rbf, poly).
One of the plots including comparing gamma
vs c on a contour plot with scores. Generally a high gamma (like a low k
in KNN) gives closer points more influence and creating a more complicated decision boundary. C
acts similarly - a large value of C
indicates a more complicated decision boundary. Often times the contour plots perfectly met intuition or expectations, but when I started plotting contour plots it's sometimes tricky for them to exactly match expectations. That's true for the contour plot below - in general the high value of these hyperparameters represents overfit areas - while the opposite corner represents underfit.
Comparison and Conclusion of Supervised Algorithms
No final comparison is complete without consideration of both test performance and runtimes. Boosting was the clear winner and outperforms all other algorithms for both datasets. The only issue is if you look at time, AdaBoost performs the worst. KNN and Decision Trees run the fastest.
A key takeaway is the no free lunch theorem
, which states that an algorithm effective for one problem may not be suitable for others. Each algorithm includes its own inductive biases and complexities. The only way to find the right tool for the job is to find the optimal hyperparameters and plot the performance - in addition to consider running time performance.
|
Inductive Bias |
Pros |
Cons |
Good at |
---|---|---|---|---|
KNN |
-Classes are similar to nearest points |
-Intuitive -Fast runtime -Less prone to overfitting -Limited parameter tuning |
-Difficulty to model complex -Lazy learner is slow to predict new instances -Poor on high dimensional datasets |
-Low-dimensional datasets |
Decision Trees |
-Shorter trees preferred over longer trees -High information gain splits at the top |
-Fast runtime -Robust to noise or missing values -Visually intuitive -Fast training and prediction |
-Possible duplication within tree -Complex trees difficult to interpret and easy to start overfitting |
-Medical diagnosis -Credit risk analysis |
Artificial Neural Networks |
-Smooth interpolation between data points -Bias towards minima via gradient descent |
-Can model complex relationships -Can separate signal from noise |
-Can overfit -Potential for long training times -Black box -Many parameters to tune |
-High dimensional datasets like images |
Boosting (AdaBoost) |
-Ensembles reduce the bias of individual learners |
-Flexibility to choose any type of weak learner -Provable effectively given it picks a weak learner -Increases margin which improves margin with additional learners -Reduces overfitting |
-Weak classifiers that are too complex lead to overfitting -Vulnerable to noise -Training is slow |
-Problems with lack of performance due to individual model instability |
Support Vector Machines |
-Classes separated by wide margins |
-Can model complex relationships -Maximizing margins makes it robust to noise |
-Parameters non-intuitive -Long runtime -Does not perform well for large or imbalanced class datasets |
-Work well in complicated domains with clear margin or separation. |
Future ML algorithm discussion coming...