Machine Learning
Collin Knezevich 2022-07-17
Of the machine learning methods we learned about, I found Random Forests to be the most interesting. The idea of a Random Forest model is to fit many different models on many different re-samples of data, then average all predictions from these models. In addition, each model will be fit using a random selection of predictor variables. This is done to reduce variance in the model and correlations between predictions. All of this will result in more accurate predictions.
Fitting a Random Forest Model
We will use the randomForest
package to fit a model, and we will use
the mtcars
dataset.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks randomForest::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ ggplot2::margin() masks randomForest::margin()
mtcars <- mtcars
Now we can build our random forest model, with MPG as our response variable.
rfModel <- randomForest(mpg ~ ., data = mtcars,
mtry = ncol(mtcars)/3,
ntree = 200,
importance = TRUE)
Finally, we can evaluate the accuracy of the model by creating predictions from our model, and using them to calculate the root MSE.
preds <- predict(rfModel, newdata = select(mtcars, -mpg))
rmse <- sqrt(mean(preds - mtcars$mpg)^2)
rmse
## [1] 0.004355878