Introduction
This analysis was developed using chapter 7 of Machine Learning with R by Brett Lantz and data from the UCI Machine Learning Repository
In engineering it is vital to have appropriate estimates of building material performance. This enables the development of safety guidelines when using these materials in construction.
Estimating concrete’s strength is a particularly interesting challenge. Concrete is used in almost every construction project, however its performance varies due to variations in ingredients interacting in complex ways. Hence, the difficulty when predicting the strength of the final product, developing a model to reliably predict concrete strength could yield safer construction practices.
Exploration and preparation of data
Exploration
The data used for this analysis is on compressive strength of concrete, which was donated to UCI Machine Learning Repository by I-Cheng Yeh. The dataset contains 1,030 examples of concrete, and 8 variables describing the components of the mixture. It is thought that these features are related to the final compressive strength. We can get a quick look at these variables, using the skim function:
concrete <- read.csv("concrete.csv")
skimr::skim(concrete)| Name | concrete |
| Number of rows | 1030 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| cement | 0 | 1 | 281.17 | 104.51 | 102.00 | 192.38 | 272.90 | 350.00 | 540.0 | ▆▇▇▃▂ |
| slag | 0 | 1 | 73.90 | 86.28 | 0.00 | 0.00 | 22.00 | 142.95 | 359.4 | ▇▂▃▁▁ |
| ash | 0 | 1 | 54.19 | 64.00 | 0.00 | 0.00 | 0.00 | 118.30 | 200.1 | ▇▁▂▂▁ |
| water | 0 | 1 | 181.57 | 21.35 | 121.80 | 164.90 | 185.00 | 192.00 | 247.0 | ▁▅▇▂▁ |
| superplastic | 0 | 1 | 6.20 | 5.97 | 0.00 | 0.00 | 6.40 | 10.20 | 32.2 | ▇▆▁▁▁ |
| coarseagg | 0 | 1 | 972.92 | 77.75 | 801.00 | 932.00 | 968.00 | 1029.40 | 1145.0 | ▃▅▇▅▂ |
| fineagg | 0 | 1 | 773.58 | 80.18 | 594.00 | 730.95 | 779.50 | 824.00 | 992.6 | ▂▃▇▃▁ |
| age | 0 | 1 | 45.66 | 63.17 | 1.00 | 7.00 | 28.00 | 56.00 | 365.0 | ▇▁▁▁▁ |
| strength | 0 | 1 | 35.82 | 16.71 | 2.33 | 23.71 | 34.44 | 46.14 | 82.6 | ▅▇▇▃▁ |
We might be interested in the relationship between the final compressive strength of the mixture and the amount of cement (in kilograms per cubic meter). This is visualised below:
concrete %>% ggplot(aes(cement, strength))+geom_point() +geom_smooth(method ="lm") +theme_minimal()+xlab("Cement (kg/cubic meter)")+ ylab("Strength")+ ggtitle("Relationship Between Concrete Amount and Final Compressive Strength ")Preparation
For this analysis our outcome feature will be strength predicted by the eight features. Neural networks preform best when the input data are scaled to a narrow range around zero, however, in our dataset the features range from 0 to over 1,000.
In order to address this issue we need to normalise the dataset, e define a normalisation function as follows:
normalise <- function(x){
return((x-min(x))/(max(x)-min(x)))
}After defining our function, we need tp apply it to every column in the dataset, we can do so using the lapply function:
concrete_norm <- as.data.frame(lapply(concrete, normalise))We may wish to confirm that the data has indeed been normalised:
summary(concrete_norm$strength)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2664 0.4001 0.4172 0.5457 1.0000
Compare to original data:
summary(concrete$strength)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.33 23.71 34.45 35.82 46.13 82.60
For this analysis we will partition the data into a training set with 75% of the examples and a testing set with 25%. The dataset was already sorted randomly, so we only need to split the dataset according to the proportions:
concrete_train <- concrete_norm[1:773, ]
concrete_test <- concrete_norm[774:1030, ]The training data will be used to construct the neural network and the testing data to evaluate how well the model generalises to future data.
Training a model on the data
A multilayer feedforward neural network, will be used to model the relationship between the ingredients in concrete and the strength of the final mixture. The neuralnet package was developed by Stefan Fritsch and Frauke Guenther, and provides an easy-to-use neural network implementation. We begin by constructing the simplest multilayer feedforward network with the default settings it uses only one hidden node:
library(neuralnet)
set.seed(1234)
concrete_model1 <- neuralnet(strength ~ cement + slag + ash +water + superplastic + coarseagg + fineagg + age,
data=concrete_train)Lets visualise the visualize the network topology on the resulting model object:
plot(concrete_model1,rep="best")In this simple model, each of the eight features has one input node, followed by a single hidden node and a single output note, predicting concrete strength. The respective weights for each connection are also includes, so are the bias terms, which indicated by the nodes labeled with the number 1. The bias terms are numeric constants, they allow the value at the indicated nodes to be shifted either upward or downward, you can think of them as the intercept in a linear equation.
The bottom of the first figure, includes the number of steps an error measure known as the sum of squared errors (SSE), in short it indicates the squared differences between the actual and predicted values. The ower the SSE the more the model conforms to the training data, giving us an indication of how it performs on the training data, but doesn’t give us an idea of how it would perform on unseen data.
Evaluating model performance
We can generate predictions on the test dataset, using the compute function:
model_results <- compute(concrete_model1, concrete_test[1:8])Compute is slightly different than predict in that it two list components are returned: $neurons and $net.result,the former stores the neurons of each layer and the latter the predicted values. We want the predicted values:
predicted_strength <- model_results$net.resultGiven that this is not a classification problem, a confusion matrix cannot be used to examine model accuracy. In this instance the correlation between predicted concrete strength and the true value. iF predicted and actual values are highly correlated, we can consider the model useful for predicting concrete strength:
cor(predicted_strength, concrete_test$strength)## [,1]
## [1,] 0.8063505
Correlations closer to one indicates a strong linear relationships between the variables. In this case the correlation between the actual and predicted values is 0.806, indicating a fairly strong relationship, implying that the model is doing a good job, even though it only has a single hidden node. Since we only used one hidden node, it is likely that the model performance can be improved upon.
Improving model performance:
As the complexity of network topologies grow, so too does the model’s learning capabilities. Lets see how much model performance increases if we use five hidden nodes instead of one:
set.seed(12345)
concrete_model2 <- neuralnet(strength ~ cement+slag+ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train,hidden=5)If we plot the network, we can see the substantial increase in the number of connections, but, has this improved performance?
plot(concrete_model2, rep="best")For this model the reported error has been reduced to 1.63 from 5.08 in the previous model. In addition, the number of training steps has also increased, to 86,849, from 4,822. The more complex the network the more iterations necessary to find optimal weights. We can once again apply similar steps as with the previous model, to find the correlation between the actual and predicted values:
model_results2 <- compute(concrete_model2, concrete_test[1:8])
predicted_stregth_2 <- model_results2$net.result
cor(predicted_stregth_2, concrete_test$strength)## [,1]
## [1,] 0.9244533
A correlation of around 0.92 is a substantial improvement from 0.80. Nevertheless, model performance can be further improved upon. For instance, we are able to add more hidden layers as well as altering the network’s activation function. By doing so we are engaging with in simple deep neural networks.
For this analysis well use a smooth approximation of the rectified linear unit (ReLU) as an activator function. It is known as softplus or SmoothREeLU, and can be defined as \(log(1+e^x)\). We define the softplus function below:
softplus <- function(x) { log(1+exp(x)) }In order to provide the activation function t the model we need to specify it in the act.fct parameter. In addition we will also add another hidden layer of five nodes by supplying the integer vector c(5,5) to the hidden parameter. This creates a two layer hidden network, with five nodes, all of which utilise the softplus activation function.
set.seed(12345)
concrete_model3 <- neuralnet(strength ~ cement+slag+ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train, hidden=c(5,5), act.fct=softplus)Similar to before, the network can be visualised:
plot(concrete_model3,rep="best")As with previous models, we can compute the correlation between the predicted and actual concrete strength:
model_results3 <- compute(concrete_model3, concrete_test[1:8])
predicted_stregth_3 <- model_results3$net.result
cor(predicted_stregth_3, concrete_test$strength)## [,1]
## [1,] 0.9348395
The correlation between the predicted and actual strength for this model is 0.935, indicating that this is the best performing model so far.
Given that we normalised our data prior to modelling, the predictions are also normalised. Lets compare the concrete strength in our original dataset and the predicted values:
strengths <- data.frame(
actual = concrete$strength[774:1030],
pred = predicted_stregth_3
)
head(strengths, n=10)## actual pred
## 774 30.14 0.2860639
## 775 44.40 0.4777305
## 776 24.50 0.2840964
## 777 71.62 0.9446964
## 778 36.30 0.4361824
## 779 38.46 0.5072858
## 780 35.86 0.4844175
## 781 40.15 0.5061428
## 782 31.42 0.2482940
## 783 20.73 0.2887341
The choice of normalised or unnormalised data does not affect our computed performance statistics, the correlation of 0.935 remains the same as before:
cor(strengths$pred, strengths$actual)## [1] 0.9348395
Nevertheless, the choice of normalised or unnormalised data would affect other performance metrics, such as the absolute difference between predicted and actual value, in that case the selection of scale would affect the outcome. Taking this into account, lets create a function to unnormalise the predictions and convert them back to the original scale:
unnormalise <- function(x) {
return((x*(max(concrete$strength))- min(concrete$strength)) + min(concrete$strength))
}Now that our predictions are back to normal scale, lets calculate the absolute error value:
strengths$pred_new <- unnormalise(strengths$pred)
strengths$error <- strengths$pred_new - strengths$actual
head(strengths, n =10)## actual pred pred_new error
## 774 30.14 0.2860639 23.62888 -6.5111211
## 775 44.40 0.4777305 39.46054 -4.9394636
## 776 24.50 0.2840964 23.46636 -1.0336353
## 777 71.62 0.9446964 78.03192 6.4119240
## 778 36.30 0.4361824 36.02867 -0.2713317
## 779 38.46 0.5072858 41.90180 3.4418038
## 780 35.86 0.4844175 40.01289 4.1528872
## 781 40.15 0.5061428 41.80739 1.6573939
## 782 31.42 0.2482940 20.50909 -10.9109142
## 783 20.73 0.2887341 23.84944 3.1194396
The correlation remains the same:
cor(strengths$pred_new, strengths$actual)## [1] 0.9348395
Session Info
sessionInfo()## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] neuralnet_1.44.2 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
## [5] purrr_0.3.4 readr_2.0.1 tidyr_1.1.3 tibble_3.1.4
## [9] ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 lattice_0.20-44 lubridate_1.7.10 assertthat_0.2.1
## [5] digest_0.6.27 utf8_1.2.2 R6_2.5.1 cellranger_1.1.0
## [9] repr_1.1.3 backports_1.2.1 reprex_2.0.1 evaluate_0.14
## [13] httr_1.4.2 highr_0.9 pillar_1.6.2 rlang_0.4.11
## [17] readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.4 Matrix_1.3-4
## [21] rmarkdown_2.10 labeling_0.4.2 splines_4.1.0 munsell_0.5.0
## [25] broom_0.7.9 Deriv_4.1.3 compiler_4.1.0 modelr_0.1.8
## [29] xfun_0.25 pkgconfig_2.0.3 base64enc_0.1-3 mgcv_1.8-36
## [33] htmltools_0.5.2 tidyselect_1.1.1 bookdown_0.24 fansi_0.5.0
## [37] crayon_1.4.1 tzdb_0.1.2 dbplyr_2.1.1 withr_2.4.2
## [41] grid_4.1.0 nlme_3.1-152 jsonlite_1.7.2 gtable_0.3.0
## [45] lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1 scales_1.1.1
## [49] rmdformats_1.0.2 cli_3.0.1 stringi_1.7.4 farver_2.1.0
## [53] fs_1.5.0 skimr_2.1.3 xml2_1.3.2 bslib_0.3.0
## [57] ellipsis_0.3.2 generics_0.1.0 vctrs_0.3.8 tools_4.1.0
## [61] glue_1.4.2 hms_1.1.0 fastmap_1.1.0 yaml_2.2.1
## [65] colorspace_2.0-2 rvest_1.0.1 knitr_1.33 haven_2.4.3
## [69] sass_0.4.0