Chapter 5 Discussion

Relative to the OLS model’s Rsquared which explains around 53% of the variation, the RF model did improve, explaining 55% of the variation. However, there seems to be room for improvement, particularly to try and get closer to the arbitrary golden standard Rsquared measure of 70%.

One improvement to the model, would be to include more features from the raw dataset. The random forest algorithm is naturally adept at handling large datasets, and therefore the number of features included in the model should be increased. For example, proximity to city centre could be a variable that can be developed from the raw data, and perhaps more importantly, have a significant influence on the listing’s potential price. Expanding the number of features included in the model, the tuning grid would have to be revised, particularly increasing the upper threshold of the mtry to accommodate for the number of predictors. With that being said, simply increasing the number of features is not a guarantee for increased model performance, and should therefore be approached with careful consideration.

Breiman points out that, random forests don’t tend to overfit due to the Law of Large Numbers, because of the injection of randomness, they are accurate regressors (Breiman 2001). However, as Lantz pointed out, it is difficult to inspect the method through which random forests achieve their results (Breiman 2001). In fact, Breiman demonstrates that random features and inputs produce good results when confronted with a classification problem, as opposed to a regression problem. Similarly, from a theoretical point of view, the analysis of the model remains difficult, particularly when it comes to the dependencies between the induced partitions of the input space and predictions within the partitions (Lorenzen, Igel, and Y 2019). They suggest using PAC-Bayesian bounds, for the averaging and majority vote classifiers. This is mostly proposed, in order to overcome the deterioration of the C-bound, when there are strong individual classifiers and high correlation, which is the case with random forests; the bounds for the Gibbs classifier are more robust (Lorenzen, Igel, and Y 2019). The authors demonstrate that PAC-Bayesian generalisation bounds can be implemented to achieve robust performance guarantees, for the random forest classifier. The Gibbs classifier, outperformed the majority vote bounds, which take the correlation between ensemble members into consideration; the reason for this, according to the authors, is due to the high accuracy of decision trees as classifiers making estimation of correlations of error more difficult (Lorenzen, Igel, and Y 2019).

In contrast, (Geurts, Ernst, and Wehenkel 2006) propose the extra-trees algorithm. What distinguishes the extra- trees algorithm from other tree-based ensemble methods, is that nodes are split by cut-points chosen fully at random and the tree growing process is done using the whole learning sample, as opposed to a boostrap replica, as does a random forest for instance. In other words, the novelty of this approach is the combination of the attribute randomisation of Random subspace along with a completely random selection of the cut-point (Geurts, Ernst, and Wehenkel 2006).

5.1 Future directions

In conclusion, the random forest algorithm significantly improved the predictive performance from the OLS model. The next steps involve expanding the scope of the model to include other features present in the raw data and adjusting the tuning grid accordingly. Furthermore, the application of PAC-Bayesian bounds can be implemented into the model as well as building a model using the extra-trees algorithm, can be pursued to achieve a higher Rsquared measure.

References

———. 2001. “Random Forests.” Machine Learning 45: 5–32. https://doi.org/10.1023/A:1010933404324.

Geurts, P, D Ernst, and L Wehenkel. 2006. “Extremely random trees.” Machine Learning 16: 3–42. https://doi.org/10.1186/1471-2121-8-S1-S2.

Lorenzen, S, C Igel, and Seldin Y. 2019. “On PAC-Bayesian bounds for random forests.” Machine Learning 108: 1503–22. https://doi.org/10.1007/s10994-019-05803-4.