
Machine Learning
Gradient boosting works by building weak prediction models sequentially where each model tries to predict the error left over by the previous model.
Gradient Boosting is a machine learning algorithm, used for both classification and regression problems. It works on the principle that many weak learners (eg: shallow trees) can together make a more accurate predictor.
A Concise Introduction to Gradient Boosting. Photo by Zibik
Gradient boosting works by building simpler (weak) prediction models sequentially where each model tries to predict the error left over by the previous model.
But, what is a weak learning model?
A model that does slightly better than random predictions is a weak learner. I will show you the exact formula shortly. But for clearly understanding the underlying principles and working of GBT, it’s important to first learn the basic concept of ensemble learning.
This tutorial will take you through the concepts behind gradient boosting and also through two practical implementations of the algorithm:
Ensemble learning, in general, is a model that makes predictions based on a number of different models. By combining a number of different models, an ensemble learning tends to be more flexible (less bias) and less data sensitive (less variance). The two most popular ensemble learning methods are bagging and boosting.
The application of bagging is found in Random Forests. Random forests are a parallel combination of decision trees. Each tree is trained on random subset of the same data and the results from all trees are averaged to find the classification. The application of boosting is found in Gradient Boosting Decision Trees, about which we are going to discuss in more detail.
Boosting works on the principle of improving mistakes of the previous learner through the next learner.
In boosting, weak learners (ex: decision trees with only the stump) are used which perform only slightly better than a random chance. Boosting focuses on sequentially adding up these weak learners and filtering out the observations that a learner gets correct at every step.
Basically, the stress is on developing new weak learners to handle the remaining difficult observations at each step. One of the very first boosting algorithms developed was Adaboost.
Gradient boosting improvised upon some of the features of Adaboost to create a stronger and more efficient algorithm. Let’s look at a brief overview of Adaboost.
Adaboost uses decision stumps as weak learners.
Decision stumps are nothing but decision trees with only one single split. It also attached weights to observations, adding more weight to ‘difficult-to-classify’ observations and less weight to those that are easy to classify.
The aim is to put stress on the difficult to classify instances for every new weak learner. So, for the next subsequent model, the misclassified observations will receive more weight, as a result, in the new dataset these observations are sampled more number of times according to their new weights, giving the model a chance to learn more of such records and classify them correctly.
As a result, misclassifying the ‘difficult-to-classify’ would be discouraged. Gradient boosting algorithm is slightly different from Adaboost. How? Gradient boosting simply tries to explain (predict) the error left over by the previous model.
And since the loss function optimization is done using gradient descent, and hence the name gradient boosting. Further, gradient boosting uses short, less-complex decision trees instead of decision stumps.
To understand this in more detail, let’s see how exactly a new weak learner in gradient boosting algorithm learns from the mistakes of previous weak learners.
In gradient boosting decision trees, we combine many weak learners to come up with one strong learner.
The weak learners here are the individual decision trees. All the trees are connected in series and each tree tries to minimize the error of the previous tree. Due to this sequential connection, boosting algorithms are usually slow to learn (controllable by the developer using the learning rate parameter), but also highly accurate.
In statistical learning, models that learn slowly perform better. The weak learners are fit in such a way that each new learner fits into the residuals of the previous step so as the model improves. The final model adds up the result of each step and thus a stronger learner is eventually achieved.
A loss function is used to detect the residuals.
For instance, mean squared error (MSE) can be used for a regression task and logarithmic loss (log loss) can be used for classification tasks. It is worth noting that existing trees in the model do not change when a new tree is added.
The added decision tree fits the residuals from the current model.
Hyperparameters are key parts of learning algorithms which effect the performance and accuracy of a model.
The Learning rate and n_estimators are two critical hyperparameters for gradient boosting decision trees. Learning rate, denoted as α, controls how fast the model learns.
This is done by multiplying the error in previous model with the learning rate and then use that in the subsequent trees.
So, the lower the learning rate, the slower the model learns. Each tree added modifies the overall model.
The advantage of slower learning rate is that the model becomes more robust and efficient and avoids overfitting. In statistical learning, models that learn slowly perform better.
However, learning slowly comes at a cost. It takes more time to train the model which brings us to the other significant hyperparameter. n_estimator is the number of trees used in the model. If the learning rate is low, we need more trees to train the model. However, we need to be very careful at selecting the number of trees. It creates a high risk of overfitting to use too many trees.
One problem that we may encounter in gradient boosting decision trees but not random forests is overfitting due to the addition of too many trees. In random forests, the addition of too many trees won’t cause overfitting.
The accuracy of the model doesn’t improve after a certain point but no problem of overfitting is faced. On the other hand, in gradient boosting decision trees we have to be careful about the number of trees we select, because having too many weak learners in the model may lead to overfitting of data.
Therefore, gradient boosting decision trees require very careful tuning of the hyperparameters.
Till now, we have seen how gradient boosting works in theory.
Now, we will dive into the maths and logic behind it so that everything is very clear. Let’s discuss the algorithm step-by-step and make a python program that applies this algorithm to real-time data.
First, let’s go over the basic principle behind gradient boosting once again. Your main aim is to predict a y given a set of x.
The difference between the prediction and the actual value is known as the residual (or in this case, pseudo residuals), on the basis of which the gradient boosting builds successive trees. We know this.
Let’s say the output model y when fit to only 1 decision tree, is given by: $$y = A_1 + (B_1*X) + e_1$$ where, e_1 is the residual from this decision tree.
In gradient boosting, we fit the consecutive decision trees on the residual from the last one. So when gradient boosting is applied to this model, the consecutive decision trees will be mathematically represented as: $$e_1 = A_2 + (B_2 * X) + e_2$$ and $$e_2 = A_3 + (B_3 * X) + e_3$$
Note that here we stop at 3 decision trees, but in an actual gradient boosting model, the number of learners or decision trees is much more. Combining all three equations, the final model of the decision tree will be given by: $$ y = A_1 + A_2 + A_3 + (B_1 * x) + (B_2 * x) + (B_3 * x) + e_3 $$
Let’s consider simulated data as shown in scatterplot below with 1 input (x) and 1 output (y) variables. 
# The above plot has been generated using the following code
x = np.arange(0, 50)
x = pd.DataFrame({'x': x})
# just random uniform distributions in different range
y1 = np.random.uniform(10, 15, 10)
y2 = np.random.uniform(20, 25, 10)
y3 = np.random.uniform(0, 5, 10)
y4 = np.random.uniform(30, 32, 10)
y5 = np.random.uniform(13, 17, 10)
y = np.concatenate((y1, y2, y3, y4, y5))
y = y[:, None]
xi = x # initialization of input
yi = y # initialization of target
# x,y --> use where no need to change original y
ei = 0 # initialization of error
n = len(yi) # number of rows
predf = 0 # initial prediction 0
for i in range(30): # loop will make 30 trees (n_estimators).
# DecisionTree scratch code can be found on www.kaggle.com/grroverpr/gradient-boosting-simplified
tree = DecisionTree(xi, yi)
# It just create a single decision tree with provided min. sample leaf
# For selected input variable, this splits (<n and >n) data so that std. deviation of
tree.find_better_split(0)
# target variable in both splits is minimum as compared to all other splits
# finds index where this best split occurs
r = np.where(xi == tree.split)[0][0]
left_idx = np.where(xi <= tree.split)[0] # index lhs of split
right_idx = np.where(xi > tree.split)[0] # index rhs of split
# predictions by ith decisision tree
predi = np.zeros(n)
# replace left side mean y
np.put(predi, left_idx, np.repeat(np.mean(yi[left_idx]), r))
np.put(predi, right_idx, np.repeat(
np.mean(yi[right_idx]), n-r)) # right side mean y
predi = predi[:, None] # make long vector (nx1) in compatible with y
# final prediction will be previous prediction value + new prediction of residual
predf = predf + predi
ei = y - predf # needed originl y here as residual always from original y
yi = ei # update yi as residual to reloop
The code above is a very basic implementation of gradient boosting trees. The actual libraries have a lot of hyperparameters that can be tuned for better results. This can be better understood by using the gradient boosting algorithm on a real dataset.
For implementation on a dataset, we will be using the PIMA Indians Diabetes dataset, which has information about a an individual’s health parameters and an output of 0 or 1, depending on whether or not he has diabetes.
The task here is classify a individual as diabetic, when given the required inputs about his health. First, let’s import all required libraries.
# Import all relevant libraries
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import preprocessing
import warnings
warnings.filterwarnings("ignore")
Now load the dataset and look at the columns to understand the given information better. < pre># Load the dataset
pima = pd.read_csv('diabetes.csv')
pima.head()
For training and testing our model, the data has to be divided into train and test data. We will also scale the data to lie between 0 and 1.
# Split dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(pima.drop('Outcome', axis=1),
pima['Outcome'], test_size=0.2)
# Scale the data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)
Now that the data has been sufficiently preprocessed, let’s go ahead with defining the Gradient Boosting Classifier along with it’s hyperparameters. Next, we will fit this model on the training data.
# Define Gradient Boosting Classifier with hyperparameters
gbc=GradientBoostingClassifier(n_estimators=500,learning_rate=0.05,random_state=100,max_features=5 )
# Fit train data to GBC
gbc.fit(X_train_transformed, y_train)
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.05, loss='deviance', max_depth=3,
max_features=5, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_iter_no_change=None, presort='auto',
random_state=100, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
The model has been trained and we can now observe the outputs as well. Below, you can see the confusion matrix of the model, which gives a report of the number of classifications and misclassifications.
# Confusion matrix will give number of correct and incorrect classifications
print(confusion_matrix(y_test, gbc.predict(X_test_transformed)))
[[82 17]
[25 30]]
The number of misclassifications by the Gradient Boosting Classifier are 42, comapared to 112 correct classifications. The model has performed decently. Let’s check the accuracy
# Accuracy of model
print("GBC accuracy is %2.2f" % accuracy_score(
y_test, gbc.predict(X_test_transformed)))
GBC accuracy is 0.73
The accuracy is 73%, which is average.
This can be improved by tuning the hyperparameters or processing the data to remove outliers.| The discussion above is just the tip of iceberg when it comes to gradient boosting. The underlying concepts can be understood in more detail by starting with the very basics of machine learning algorithms and understanding the working of python code. This however gives us the basic idea behind gradient boosting and its underlying working principles.
As we already discussed above, gradient boosting algorithms are prone to overfitting and consequently poor performance on test dataset. There are some pointers you can keep in mind to improve the performance of gradient boosting algorithm.
Stochastic gradient boosting involves subsampling the training dataset and training individual learners on random samples created by this subsampling. This reduces the correlation between results from individual learners and combining results with low correlation provides us with a better overall result. A few variants of stochastic boosting that can be used:
The predictions of each tree are added together sequentially. Instead, the contribution of each tree to this sum can be weighted to slow down the learning by the algorithm.
This weighting is called a shrinkage or a learning rate. Using a low learning rate can dramatically improve the performance of your gradient boosting model.
Usually a learning rate in the range of 0.1 to 0.3 gives the best results. Keep in mind that a low learning rate can significantly drive up the training time, as your model will require more number of iterations to converge to a final loss value.
L1 and L2 regularization penalties can be implemented on leaf weight values to slow down learning and prevent overfitting. Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees’ terminal nodes
There are a number of ways in which a tree can be constrained to improve performance.
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Get the exact 10-course programming foundation that Data Science professionals use.