In machine learning, the problem of classification entails correctly identifying to which class or group a new observation belongs, by learning from observations whose classes are already known. In what follows, I will build a classification experiment in Azure ML Studio to predict wine quality based on physicochemical data. Several classification algorithms will be applied on the data set and the performance of these algorithms will be compared. I will also present a tutorial on how to do similar exercise using MRS (Microsoft R Server, formerly Revolution R Enterprise). I will use wine quality data set from the UCI Machine Learning Repository. The dataset contains quality ratings (labels) for a 1599 red wine samples. The features are the wines’ physical and chemical properties (11 predictors). We want to use these properties to predict the quality of the wine. The experiment is shown below and can be found in the Cortana Intelligence Gallery.
There are several classification algorithms available in Azure ML viz. Multiclass Decision Forest, Multiclass Decision Jungle, Multiclass Logistic regression, Multiclass Neural Network and One-vs-All Multiclass which creates a multiclass classification model from an ensemble of binary classification models. Each of these algorithms have their advantages. The Decision Forest consists of an ensemble of randomly trained decision trees. The ensemble models in general provide better coverage and accuracy than single decision trees. Building multiple random decision trees and training them independently improves generalization and resilience to noisy data. Decision Jungles are a recent extension to decision forests. They require less memory and have considerably improved generalization. Given sufficient number of hidden layers and nodes, neural networks can approximate any function. However, they can be computationally expensive due to a number of hyperparameters. Multiclass Logistic Regression is an extension of Logistic Regression and predicts the probability of an outcome. The best practice for finding which algorithm will perform best is to try them!
The original data had several labels with some of the labels having very few instances.;