Show the code
importAURN(site = "leed", year = 2009:2023)->leeds_pollution # Import for years 2003 to 2023This project explores air quality data for the Leeds region in the United Kingdom. The dataset is extracted using the openair API, it contains information on pollutant species spanning from 2009 to 2023. First, the project explores the dataset with an emphasis on data visualisation, to uncover pollution patterns. Second, the data is modeled with a view to forecast future levels of Nitric Oxide. Finally, the findings are concluded and future research directions are recommended.
Let’s begin by loading the data. The openair package let’s us retrieve information about air quality for a range of years when we specify a city we are interested in. For this project we will focus on Leeds air quality data, therefore we specify the following:
importAURN(site = "leed", year = 2009:2023)->leeds_pollution # Import for years 2003 to 2023The dataset contains a range of variables, such as date the type of pollutant (nox,no2,etc.), and so on:
glimpse(leeds_pollution)Rows: 125,928
Columns: 18
$ site <chr> "Leeds Centre", "Leeds Centre", "Leeds Centre", "Leeds Centre…
$ code <chr> "LEED", "LEED", "LEED", "LEED", "LEED", "LEED", "LEED", "LEED…
$ date <dttm> 2009-01-01 00:00:00, 2009-01-01 01:00:00, 2009-01-01 02:00:0…
$ co <dbl> 0.3, 0.3, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0…
$ nox <dbl> 71, 76, 36, 27, 29, 25, 25, 27, 32, 32, 31, 36, 42, 42, 46, 5…
$ no2 <dbl> 48, 48, 27, 21, 23, 19, 21, 23, 27, 25, 23, 25, 31, 31, 34, 4…
$ no <dbl> 15, 19, 6, 4, 4, 4, 3, 3, 4, 5, 5, 8, 8, 8, 8, 9, 9, 8, 13, 1…
$ o3 <dbl> 0, 0, 4, 14, 14, 14, 14, 14, 10, 12, 12, 12, 14, 14, 14, 10, …
$ so2 <dbl> 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 0, 0, 0, 0, 3, 0…
$ pm10 <dbl> 85, 89, 75, 50, 47, 50, 51, 48, 46, 47, 47, 50, 51, 52, 55, 5…
$ pm2.5 <dbl> 76, 79, 69, 50, 47, 46, 48, 47, 45, 43, 45, 45, 46, 44, 47, 4…
$ v10 <dbl> 11, 9, 12, 11, 12, 13, 12, 11, 11, 12, 12, 13, 13, 14, 13, 12…
$ v2.5 <dbl> 11, 10, 12, 13, 14, 13, 13, 13, 13, 11, 12, 12, 14, 14, 14, 1…
$ nv10 <dbl> 74, 80, 63, 39, 35, 37, 39, 37, 35, 35, 35, 37, 38, 38, 42, 4…
$ nv2.5 <dbl> 65, 69, 57, 37, 33, 33, 35, 34, 32, 32, 33, 33, 32, 30, 33, 3…
$ ws <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ wd <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ air_temp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Lets take a look at where most of the wind comes from according to the data collected. The windrose plot demonstrates this pattern across our whole dataset, i.e. since 2003 to 2023:
windRose(leeds_pollution) 
The figure shows that the majority of the wind gusts come predominantly from the West, Northwest and Southwest. The mean ms-1 across the sample is 3.7. It might be interesting to see how the pattern has evolved over time:
windRose(leeds_pollution %>%
filter(year(date) >= 2018), type = "year", layout = c(4, 2)) # Only looking at the year 2018 onward
It seems the pattern is pretty consistent when for the years 2018 to 2023. The mean ms-1 ranges from 3.5 to 3.8. We can and idea of how wind direction and pollutant species interact by making a pollution rose plot. Let’s focus on NOx, otherwise known as Nitrox oxide, a pollutant species that is measured by the Leeds city center station:
pollutionRose(mydata, pollutant = "nox",alpha = 0.8)
The figure above demonstrates that the higher values of NOx come from the west to south regions, where higher values of 350 to 1092 are present by frequency of wind direction. In other words a higher percentage of these higher values comes from these directions. How do these values fluctuate over time? The figure below looks at the values of the NOx pollutant in 2022:
calendarPlot(leeds_pollution, pollutant = "nox",year=2022)
It seems the highest values of NOx concentration were on the 15th and 18th of January 2022. Some other months experienced higher levels of NOx, however not to the same scale, these include some days in the month of March and November. How do these high levels of NOx in the month of January, compare to the other poillutant species. Let’s consider CO, NO2 and 03:
timePlot(selectByDate(leeds_pollution, year = 2022,month = "jan"),
pollutant = c("nox","co","no2","o3"),
y.relation = "free",smooth = TRUE)
We can see that the other pollutant species follow a similar patter reaching peaks between January 10 and January 20. This trend seems to be the least evident for 03, with a more constant fluctuation pattern across the month.
In this section we will attempt to predict the average monthly NOx levels using 6 different forecasting models. In order to do so we will first need to transform our data, getting an average level for each month of each year in the dataset, the figure below demonstrates what this trend looks like:
leeds_pollution %>%
select(date, nox) %>%
mutate(date = lubridate::as_date(date)) %>%
group_by(date) %>%
summarise(nox = median(nox, na.rm = TRUE)) %>%
ungroup() %>%
mutate(date=floor_date(date,unit="month")) %>%
group_by(date) %>%
summarise(nox=median(nox,na.rm=TRUE)) %>%
ungroup() -> time_series # Wrangle the data
time_series %>%
plot_time_series(date,nox, .interactive=TRUE,.plotly_slider = TRUE,.title = "") # Plot the dataFigure 6. Median monthly NOx levels
At an overall level the average NOx concentration is trending downwards, perhaps this is a good thing. Given our wrangling, this now means we have one observation per month, the implication of outliers/spikes have on these values will be discussed in the final section. We define our training and test splits for the modelling:
# Split the data for training:
set.seed(1234)
splits <- rsample::initial_time_split(time_series,prop=0.9)For this forecasting exercise 6 models will be trained, an auto ARIMA, boosted ARIMA, exponential smoothing, prophet, linear regression and a MARS model. We start by defining the ARIMA models:
# Model 1:
# Auto ARIMA:
arima <- arima_reg() %>%
set_engine(engine="auto_arima") %>%
fit(nox~date, data=training(splits))
# Boosted Auto ARIMA:
boosted_arima <- arima_boost(min_n = 2,
learn_rate = 0.0015) %>%
set_engine(engine = "auto_arima_xgboost") %>%
fit(nox ~ date + as.numeric(date) + factor(month(date, label = TRUE), ordered =
F),
data = training(splits))We now have model objects that can be called for some quick summaries:
arimaparsnip model object
Series: outcome
ARIMA(0,1,1)(2,1,1)[12]
Coefficients:
ma1 sar1 sar2 sma1
-0.7559 -0.0165 -0.0805 -0.8857
s.e. 0.0709 0.1327 0.1163 0.1887
sigma^2 = 45.75: log likelihood = -481.25
AIC=972.49 AICc=972.94 BIC=987.27
Next, the exponential smoothing model is defined:
# Exponential Smoothing:
model_fit_ets <- exp_smoothing() %>%
set_engine(engine = "ets") %>%
fit(nox~date, data=training(splits))Next, the prophet regression model:
# Prophet:
model_fit_prophet <- prophet_reg() %>%
set_engine(engine = "prophet") %>%
fit(nox ~ date, data = training(splits))Next, the linear regression model:
# Linear regression
model_fit_lm <- linear_reg() %>%
set_engine("lm") %>%
fit(nox ~ as.numeric(date) + factor(month(date, label = TRUE), ordered = FALSE),
data = training(splits))Finally, we define the MARS model using a workflow:
# MARS:
model_spec_mars <- mars(mode = "regression") %>%
set_engine("earth")
recipe_spec <- recipe(nox ~ date, data = training(splits)) %>%
step_date(date, features = "month", ordinal = FALSE) %>%
step_mutate(date_num = as.numeric(date)) %>%
step_normalize(date_num) %>%
step_rm(date)
wflw_fit_mars <- workflow() %>%
add_recipe(recipe_spec) %>%
add_model(model_spec_mars) %>%
fit(training(splits))It would be useful to bring these models together, this can be done using a model table:
model_tbl <- modeltime_table(
arima,
boosted_arima,
model_fit_ets,
model_fit_prophet,
model_fit_lm,
wflw_fit_mars) %>%
mutate(.model_desc = case_when(.model_desc == "ARIMA(0,1,1)(2,1,1)[12]" ~ "ARIMA", # clean labels
.model_desc == "ARIMA(0,1,1)(2,1,1)[12] W/ XGBOOST ERRORS" ~ "BOOSTED ARIMA",
TRUE ~ .model_desc))The model table contains the model ID, the model fit as a nested variable and the model description:
model_tbl # Modeltime Table
# A tibble: 6 × 3
.model_id .model .model_desc
<int> <list> <chr>
1 1 <fit[+]> ARIMA
2 2 <fit[+]> BOOSTED ARIMA
3 3 <fit[+]> ETS(M,N,M)
4 4 <fit[+]> PROPHET
5 5 <fit[+]> LM
6 6 <workflow> EARTH
Next, we want to calibrate the models by fitting them to the test set. The model table allows us to do this more efficiently, with the use of the modeltime_calibrate() function:
model_tbl %>%
modeltime_calibrate(new_data = testing(splits)) -> calibration_tbl
calibration_tbl# Modeltime Table
# A tibble: 6 × 5
.model_id .model .model_desc .type .calibration_data
<int> <list> <chr> <chr> <list>
1 1 <fit[+]> ARIMA Test <tibble [18 × 4]>
2 2 <fit[+]> BOOSTED ARIMA Test <tibble [18 × 4]>
3 3 <fit[+]> ETS(M,N,M) Test <tibble [18 × 4]>
4 4 <fit[+]> PROPHET Test <tibble [18 × 4]>
5 5 <fit[+]> LM Test <tibble [18 × 4]>
6 6 <workflow> EARTH Test <tibble [18 × 4]>
The calibration table object now contains the predicted data nested in the .calibration_data variable. That variable contains the actual values and the predicted value, along with the residuals. These values can be explored by unnesting the .calibration_data column:
calibration_tbl %>%
unnest(cols = c(.calibration_data))# A tibble: 108 × 8
.model_id .model .model_desc .type date .actual .prediction
<int> <list> <chr> <chr> <date> <dbl> <dbl>
1 1 <fit[+]> ARIMA Test 2021-12-01 36.5 42.5
2 1 <fit[+]> ARIMA Test 2022-01-01 33.8 36.5
3 1 <fit[+]> ARIMA Test 2022-02-01 28.0 33.9
4 1 <fit[+]> ARIMA Test 2022-03-01 32.6 25.6
5 1 <fit[+]> ARIMA Test 2022-04-01 21.0 15.1
6 1 <fit[+]> ARIMA Test 2022-05-01 20.0 12.9
7 1 <fit[+]> ARIMA Test 2022-06-01 18.8 11.3
8 1 <fit[+]> ARIMA Test 2022-07-01 18.9 9.60
9 1 <fit[+]> ARIMA Test 2022-08-01 17.5 13.2
10 1 <fit[+]> ARIMA Test 2022-09-01 23.7 21.9
# ℹ 98 more rows
# ℹ 1 more variable: .residuals <dbl>
We can fit to the test set and compare the predicted to the actual values:
calibration_tbl %>%
modeltime_forecast(
new_data = testing(splits),
actual_data = time_series
) %>%
plot_modeltime_forecast(
.legend_max_width = 25, # For mobile screens
.interactive = TRUE,
.title = "")Figure 7. Forecasted NOx levels vs. actual
Figure 7 demonstrates plots the predicted NOx levels relative to the actual. It seems that some of the predictions are below zero. This problematic, since the levels can not drop below zero, let’s retrieve the model metrics:
calibration_tbl %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = FALSE
)| Accuracy Table | ||||||||
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
|---|---|---|---|---|---|---|---|---|
| 1 | ARIMA | Test | 5.04 | 22.49 | 0.99 | 25.80 | 5.92 | 0.81 |
| 2 | BOOSTED ARIMA | Test | 4.92 | 21.78 | 0.97 | 24.43 | 5.78 | 0.81 |
| 3 | ETS(M,N,M) | Test | 2.82 | 11.01 | 0.55 | 10.29 | 3.79 | 0.78 |
| 4 | PROPHET | Test | 6.30 | 28.47 | 1.24 | 36.64 | 7.27 | 0.81 |
| 5 | LM | Test | 5.30 | 19.81 | 1.04 | 17.50 | 6.92 | 0.80 |
| 6 | EARTH | Test | 9.86 | 44.21 | 1.94 | 68.29 | 11.27 | 0.82 |
According to the model metrics, the exponential smoothing model has the lowest mean absolute error. The R squared fluctuates between 0.78 and 0.82, it’s safe. Finally, we refit the models to tghe whole data and predict the next year’s values:
refit_tbl %>%
modeltime_forecast(h = "1 year", actual_data = time_series) %>%
plot_modeltime_forecast(
.legend_max_width = 25, # For mobile screens
.interactive = TRUE,
.title = "") Figure 8. Forecasted Median NOx levels June 2023 to May 2024
Zooming in to April 2023 onward, allows us to see the predicted NOx values for June 2023 to May 2024.Again in terms of MAE, the ETS model is performing best. Perhaps this should be the model that is the final candidate/chosen for production.
This project explores air quality data for the Leeds city centre region available from the openair package. The data was visualised using rose plots as well as considering the temporal elements of the pollutant levels. Focusing on Nitric Oxide (NOx), six models were built to forecast the average monthly levels from June 2023 to May 2024. The best performing model according to the MAE metrics, was ETS, with a value of 2.82.
Future research may wish to change the window for forecasting. For this project we considered the monthly average due to computational limitations, but there shouldn’t be any reasons why one cannot do so at the daily level. In addition, this project did not employ cross validation techniques nor seasonality variables, implementing these techniques will likely yield better performance.
sessionInfo()R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] earth_5.3.2 plotmo_3.6.2 TeachingDemos_2.12 plotrix_3.8-2
[5] Formula_1.2-5 yardstick_1.2.0 workflowsets_1.0.1 workflows_1.1.3
[9] tune_1.1.1 rsample_1.1.1 recipes_1.0.6 parsnip_1.1.0
[13] modeldata_1.1.0 infer_1.0.4 dials_1.2.0 scales_1.2.1
[17] broom_1.0.4 tidymodels_1.0.0 timetk_2.8.3 modeltime_1.2.6
[21] skimr_2.1.5 openair_2.16-0 lubridate_1.9.2 forcats_1.0.0
[25] stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1 readr_2.1.4
[29] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.14 jsonlite_1.8.4
[4] magrittr_2.0.3 rmarkdown_2.21 vctrs_0.6.2
[7] base64enc_0.1-3 janitor_2.2.0 htmltools_0.5.5
[10] curl_5.0.0 xgboost_1.7.5.1 TTR_0.24.3
[13] sass_0.4.5 parallelly_1.35.0 StanHeaders_2.21.0-7
[16] htmlwidgets_1.6.2 plotly_4.10.1 zoo_1.8-12
[19] gt_0.9.0 lifecycle_1.0.3 iterators_1.0.14
[22] pkgconfig_2.0.3 Matrix_1.5-4 R6_2.5.1
[25] fastmap_1.1.1 snakecase_0.11.0 future_1.32.0
[28] digest_0.6.31 colorspace_2.1-0 furrr_0.3.1
[31] ps_1.7.5 crosstalk_1.2.0 labeling_0.4.2
[34] fansi_1.0.4 timechange_0.2.0 httr_1.4.5
[37] mgcv_1.8-42 compiler_4.3.0 withr_2.5.0
[40] clock_0.6.1 inline_0.3.19 backports_1.4.1
[43] tseries_0.10-53 pkgbuild_1.4.0 hexbin_1.28.3
[46] maps_3.4.1 MASS_7.3-58.4 lava_1.7.2.1
[49] loo_2.6.0 prophet_1.0 tools_4.3.0
[52] lmtest_0.9-40 quantmod_0.4.22 future.apply_1.10.0
[55] nnet_7.3-18 glue_1.6.2 quadprog_1.5-8
[58] callr_3.7.3 nlme_3.1-162 grid_4.3.0
[61] cluster_2.1.4 generics_0.1.3 gtable_0.3.3
[64] tzdb_0.3.0 class_7.3-21 data.table_1.14.8
[67] hms_1.1.3 xml2_1.3.3 utf8_1.2.3
[70] foreach_1.5.2 pillar_1.9.0 splines_4.3.0
[73] lhs_1.1.6 lattice_0.21-8 survival_3.5-5
[76] deldir_1.0-6 tidyselect_1.2.0 knitr_1.42
[79] gridExtra_2.3 urca_1.3-3 stats4_4.3.0
[82] forecast_8.21 xfun_0.39 hardhat_1.3.0
[85] matrixStats_0.63.0 timeDate_4022.108 rstan_2.21.8
[88] stringi_1.7.12 DiceDesign_1.9 lazyeval_0.2.2
[91] yaml_2.3.7 pacman_0.5.1 evaluate_0.20
[94] codetools_0.2-19 interp_1.1-4 cli_3.6.1
[97] RcppParallel_5.1.7 rpart_4.1.19 processx_3.8.1
[100] repr_1.1.6 munsell_0.5.0 Rcpp_1.0.10
[103] globals_0.16.2 mapproj_1.2.11 png_0.1-8
[106] parallel_4.3.0 ellipsis_0.3.2 fracdiff_1.5-2
[109] gower_1.0.1 assertthat_0.2.1 prettyunits_1.1.1
[112] latticeExtra_0.6-30 jpeg_0.1-10 GPfit_1.0-8
[115] listenv_0.9.0 viridisLite_0.4.1 ipred_0.9-14
[118] xts_0.13.1 prodlim_2023.03.31 crayon_1.5.2
[121] rlang_1.1.1