Random Forest - Yousef's Notes

Random Forest is an ensemble learning method that combines multiple Decision Trees to improve predictive accuracy and reduce overfitting (combine weak learners). It is one of the most popular and powerful machine learning algorithms, used for both #ml/classification and #ml/regression tasks. The key idea behind random forests is to introduce randomness into the tree-building process to create diverse trees, which are then combined to make more robust predictions.

#How it works

Ensemble Learning:
- Random forest is a type of Bagging (Bootstrap Sampling) algorithm.
- It builds multiple decision trees independently and combines their predictions (e.g., by majority voting for classification or averaging for regression).
Training Process:
- Step 1: Bootstrap Sampling:
  - Randomly sample the training data with replacement (bootstrapping) to create multiple subsets. Each subset is used to train a separate decision tree.
- Step 2: Feature Randomness:
  - At each split in a decision tree, instead of considering all features, a random subset of features is selected. This introduces diversity among the trees.
- Step 3: Tree Construction:
  - Each tree is grown to its maximum depth without pruning, ensuring low bias.
- Step 4: Aggregation:
  - For classification: The final prediction is the majority vote of all trees.
  - For regression: The final prediction is the average of all trees’ predictions.
- Step 5: Out-of-Bag (OOB) Error:
  - Uses data points not included in a tree’s bootstrap sample to internally evaluate the model’s performance without requiring a separate validation dataset.

#Preconditions

Requires tuning Hyperparameters: number of trees, bagging, etc

#Evaluation

#Advantages

Robust
Reduces overfitting

#Limitations

Computationally expensive