(This article was contributed by the SUTD AISG Student Chapter)
In a traditional context, a random forest is defined as ‘A supervised machine learning algorithm that is constructed from decision tree algorithms’. In this article, we will attempt to better understand and explain the working principles of the random forest algorithm.
Random forest is essentially a machine learning algorithm that utilizes ‘ensemble learning’, which involves combining several classifiers (decision trees) to obtain the solution to a problem. This is commonly done through a technique called ‘bagging’, which helps to reduce variance to make the algorithm more accurate.
A decision tree is a support model that uses a tree-like visualization of decisions and possible consequences. A decision tree usually has a root node, and branches out into multiple nodes, ultimately ending at the leaf nodes.
Supervised learning is a branch of machine learning where the model maps an input to an output based on some data (input-output pairs). Therefore, the model has some idea on the types of input-output matches to work with. The main types of supervised learning are regression and classification.
Regression involves predicting continuous values, such as the fluctuation in stocks as a time series. Classification on the other hand is predicting distinct outcomes of input, such as detecting if an animal is a cat or a dog.
Random forest models are commonly used to solve problems of regression and classification.
A random forest essentially works by feeding our input data into several decision trees randomly for training our model. Then, once the data flows through all the decision trees, the final output of the random forest is usually assumed to be the majority output of the decision trees (classification) or the average of all outputs from the decision trees (regression).
Here is an example of a simple classification problem to better understand this procedure:
In this example, our input data is a description of a fruit, (such as colour, size, shape, taste etc.) that we would like to classify into an apple or a banana. This data is fed into ‘n’ decision trees and allows each of the decision trees to classify it. Once all the decision trees have returned a final output, the majority voting process is initiated, where the model calculates the value that the majority of the decision trees returned. This value (in this case apple ) is the output of the random forest algorithm.
Random forest is a highly efficient algorithm that can handle large datasets. The random forest is a relatively simple algorithm to understand and provides greater accuracy than a simple decision tree.
The random forest algorithm is used actively in several industries, such as banking to understand banking patterns to classify customers, in healthcare to predict and diagnose patients based on their medical history patterns, trading to understand and regress stock patterns, and e-commerce to predict customer preferences based on user consumption and cookie tracking patterns.
Anirudh Shrinivason, Sahana Katragadda, Dhanush Kumar Manogar, Lai Pin Nean
The views expressed in this article belong to the SUTD AISG Student Chapter and may not represent those of AI Singapore.