Air Quality Leeds - Predicting Nitric Oxide Pollution

Author

Nils Indreiten

Executive Summary

This project explores air quality data for the Leeds region in the United Kingdom. The dataset is extracted using the openair API, it contains information on pollutant species spanning from 2009 to 2023. First, the project explores the dataset with an emphasis on data visualisation, to uncover pollution patterns. Second, the data is modeled with a view to forecast future levels of Nitric Oxide. Finally, the findings are concluded and future research directions are recommended.

Exploratory Data Analysis

Let’s begin by loading the data. The openair package let’s us retrieve information about air quality for a range of years when we specify a city we are interested in. For this project we will focus on Leeds air quality data, therefore we specify the following:

Show the code

importAURN(site = "leed", year = 2009:2023)->leeds_pollution # Import for years 2003 to 2023

The dataset contains a range of variables, such as date the type of pollutant (nox,no2,etc.), and so on:

Show the code

glimpse(leeds_pollution)

Rows: 125,928
Columns: 18
$ site     <chr> "Leeds Centre", "Leeds Centre", "Leeds Centre", "Leeds Centre…
$ code     <chr> "LEED", "LEED", "LEED", "LEED", "LEED", "LEED", "LEED", "LEED…
$ date     <dttm> 2009-01-01 00:00:00, 2009-01-01 01:00:00, 2009-01-01 02:00:0…
$ co       <dbl> 0.3, 0.3, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0…
$ nox      <dbl> 71, 76, 36, 27, 29, 25, 25, 27, 32, 32, 31, 36, 42, 42, 46, 5…
$ no2      <dbl> 48, 48, 27, 21, 23, 19, 21, 23, 27, 25, 23, 25, 31, 31, 34, 4…
$ no       <dbl> 15, 19, 6, 4, 4, 4, 3, 3, 4, 5, 5, 8, 8, 8, 8, 9, 9, 8, 13, 1…
$ o3       <dbl> 0, 0, 4, 14, 14, 14, 14, 14, 10, 12, 12, 12, 14, 14, 14, 10, …
$ so2      <dbl> 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 0, 0, 0, 0, 3, 0…
$ pm10     <dbl> 85, 89, 75, 50, 47, 50, 51, 48, 46, 47, 47, 50, 51, 52, 55, 5…
$ pm2.5    <dbl> 76, 79, 69, 50, 47, 46, 48, 47, 45, 43, 45, 45, 46, 44, 47, 4…
$ v10      <dbl> 11, 9, 12, 11, 12, 13, 12, 11, 11, 12, 12, 13, 13, 14, 13, 12…
$ v2.5     <dbl> 11, 10, 12, 13, 14, 13, 13, 13, 13, 11, 12, 12, 14, 14, 14, 1…
$ nv10     <dbl> 74, 80, 63, 39, 35, 37, 39, 37, 35, 35, 35, 37, 38, 38, 42, 4…
$ nv2.5    <dbl> 65, 69, 57, 37, 33, 33, 35, 34, 32, 32, 33, 33, 32, 30, 33, 3…
$ ws       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ wd       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ air_temp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Lets take a look at where most of the wind comes from according to the data collected. The windrose plot demonstrates this pattern across our whole dataset, i.e. since 2003 to 2023:

Show the code

windRose(leeds_pollution)

Figure 1. Wind Rose Plot Leeds City Centre

The figure shows that the majority of the wind gusts come predominantly from the West, Northwest and Southwest. The mean ms-1 across the sample is 3.7. It might be interesting to see how the pattern has evolved over time:

Show the code

windRose(leeds_pollution %>% 
           filter(year(date) >= 2018), type = "year", layout = c(4, 2)) # Only looking at the year 2018 onward

Figure 2. Wind Rose Plot Leeds City Centre, 2018-2023

It seems the pattern is pretty consistent when for the years 2018 to 2023. The mean ms-1 ranges from 3.5 to 3.8. We can and idea of how wind direction and pollutant species interact by making a pollution rose plot. Let’s focus on NOx, otherwise known as Nitrox oxide, a pollutant species that is measured by the Leeds city center station:

Show the code

pollutionRose(mydata, pollutant = "nox",alpha = 0.8)

Figure 3. NOx Pollution Rose Plot Leeds City Centre

The figure above demonstrates that the higher values of NOx come from the west to south regions, where higher values of 350 to 1092 are present by frequency of wind direction. In other words a higher percentage of these higher values comes from these directions. How do these values fluctuate over time? The figure below looks at the values of the NOx pollutant in 2022:

Show the code

calendarPlot(leeds_pollution, pollutant = "nox",year=2022)

Figure 4. NOx concentration throught the year 2022

It seems the highest values of NOx concentration were on the 15th and 18th of January 2022. Some other months experienced higher levels of NOx, however not to the same scale, these include some days in the month of March and November. How do these high levels of NOx in the month of January, compare to the other poillutant species. Let’s consider CO, NO2 and 03:

Show the code

timePlot(selectByDate(leeds_pollution, year = 2022,month = "jan"),
         pollutant = c("nox","co","no2","o3"),
         y.relation = "free",smooth = TRUE)

Figure 5. Pollution levels January 2022 broken down by pollutant species

We can see that the other pollutant species follow a similar patter reaching peaks between January 10 and January 20. This trend seems to be the least evident for 03, with a more constant fluctuation pattern across the month.

Modeling - Forecasting Nitric Oxide (NOx)

In this section we will attempt to predict the average monthly NOx levels using 6 different forecasting models. In order to do so we will first need to transform our data, getting an average level for each month of each year in the dataset, the figure below demonstrates what this trend looks like:

Show the code

leeds_pollution %>% 
  select(date, nox) %>%
  mutate(date = lubridate::as_date(date)) %>%
  group_by(date) %>% 
  summarise(nox = median(nox, na.rm = TRUE)) %>%
  ungroup() %>% 
  mutate(date=floor_date(date,unit="month")) %>%
  group_by(date) %>% 
  summarise(nox=median(nox,na.rm=TRUE)) %>% 
  ungroup() -> time_series # Wrangle the data

time_series %>% 
  plot_time_series(date,nox, .interactive=TRUE,.plotly_slider = TRUE,.title = "") # Plot the data

Figure 6. Median monthly NOx levels

At an overall level the average NOx concentration is trending downwards, perhaps this is a good thing. Given our wrangling, this now means we have one observation per month, the implication of outliers/spikes have on these values will be discussed in the final section. We define our training and test splits for the modelling:

Show the code

# Split the data for training:
set.seed(1234)
splits <- rsample::initial_time_split(time_series,prop=0.9)

For this forecasting exercise 6 models will be trained, an auto ARIMA, boosted ARIMA, exponential smoothing, prophet, linear regression and a MARS model. We start by defining the ARIMA models:

Show the code

# Model 1:
# Auto ARIMA:

arima <- arima_reg() %>% 
  set_engine(engine="auto_arima") %>% 
  fit(nox~date, data=training(splits))


# Boosted Auto ARIMA:

boosted_arima <- arima_boost(min_n = 2,
                             learn_rate = 0.0015) %>%
  set_engine(engine = "auto_arima_xgboost") %>%
  fit(nox ~ date + as.numeric(date) + factor(month(date, label = TRUE), ordered =
                                               F), 
      data = training(splits))

We now have model objects that can be called for some quick summaries:

Show the code

arima

parsnip model object

Series: outcome 
ARIMA(0,1,1)(2,1,1)[12] 

Coefficients:
          ma1     sar1     sar2     sma1
      -0.7559  -0.0165  -0.0805  -0.8857
s.e.   0.0709   0.1327   0.1163   0.1887

sigma^2 = 45.75:  log likelihood = -481.25
AIC=972.49   AICc=972.94   BIC=987.27

Next, the exponential smoothing model is defined:

Show the code

# Exponential Smoothing:

model_fit_ets <-  exp_smoothing() %>% 
  set_engine(engine = "ets") %>% 
  fit(nox~date, data=training(splits))

Next, the prophet regression model:

Show the code

# Prophet: 

model_fit_prophet <- prophet_reg() %>%
  set_engine(engine = "prophet") %>%
  fit(nox ~ date, data = training(splits))

Next, the linear regression model:

Show the code

# Linear regression
model_fit_lm <- linear_reg() %>%
  set_engine("lm") %>%
  fit(nox ~ as.numeric(date) + factor(month(date, label = TRUE), ordered = FALSE), 
      data = training(splits))

Finally, we define the MARS model using a workflow:

Show the code

# MARS:

model_spec_mars <- mars(mode = "regression") %>%
  set_engine("earth") 

recipe_spec <- recipe(nox ~ date, data = training(splits)) %>%
  step_date(date, features = "month", ordinal = FALSE) %>%
  step_mutate(date_num = as.numeric(date)) %>%
  step_normalize(date_num) %>%
  step_rm(date)

wflw_fit_mars <- workflow() %>%
  add_recipe(recipe_spec) %>%
  add_model(model_spec_mars) %>%
  fit(training(splits))

It would be useful to bring these models together, this can be done using a model table:

Show the code

model_tbl <- modeltime_table(
 arima,
 boosted_arima,
 model_fit_ets,
 model_fit_prophet,
 model_fit_lm,
 wflw_fit_mars) %>% 
 mutate(.model_desc = case_when(.model_desc == "ARIMA(0,1,1)(2,1,1)[12]" ~ "ARIMA", # clean labels
                                .model_desc == "ARIMA(0,1,1)(2,1,1)[12] W/ XGBOOST ERRORS" ~ "BOOSTED ARIMA",
                                TRUE ~ .model_desc))

The model table contains the model ID, the model fit as a nested variable and the model description:

Show the code

model_tbl

# Modeltime Table
# A tibble: 6 × 3
  .model_id .model     .model_desc  
      <int> <list>     <chr>        
1         1 <fit[+]>   ARIMA        
2         2 <fit[+]>   BOOSTED ARIMA
3         3 <fit[+]>   ETS(M,N,M)   
4         4 <fit[+]>   PROPHET      
5         5 <fit[+]>   LM           
6         6 <workflow> EARTH

Next, we want to calibrate the models by fitting them to the test set. The model table allows us to do this more efficiently, with the use of the modeltime_calibrate() function:

Show the code

model_tbl %>% 
  modeltime_calibrate(new_data = testing(splits)) -> calibration_tbl

calibration_tbl

# Modeltime Table
# A tibble: 6 × 5
  .model_id .model     .model_desc   .type .calibration_data
      <int> <list>     <chr>         <chr> <list>           
1         1 <fit[+]>   ARIMA         Test  <tibble [18 × 4]>
2         2 <fit[+]>   BOOSTED ARIMA Test  <tibble [18 × 4]>
3         3 <fit[+]>   ETS(M,N,M)    Test  <tibble [18 × 4]>
4         4 <fit[+]>   PROPHET       Test  <tibble [18 × 4]>
5         5 <fit[+]>   LM            Test  <tibble [18 × 4]>
6         6 <workflow> EARTH         Test  <tibble [18 × 4]>

The calibration table object now contains the predicted data nested in the .calibration_data variable. That variable contains the actual values and the predicted value, along with the residuals. These values can be explored by unnesting the .calibration_data column:

Show the code

calibration_tbl %>% 
  unnest(cols = c(.calibration_data))

# A tibble: 108 × 8
   .model_id .model   .model_desc .type date       .actual .prediction
       <int> <list>   <chr>       <chr> <date>       <dbl>       <dbl>
 1         1 <fit[+]> ARIMA       Test  2021-12-01    36.5       42.5 
 2         1 <fit[+]> ARIMA       Test  2022-01-01    33.8       36.5 
 3         1 <fit[+]> ARIMA       Test  2022-02-01    28.0       33.9 
 4         1 <fit[+]> ARIMA       Test  2022-03-01    32.6       25.6 
 5         1 <fit[+]> ARIMA       Test  2022-04-01    21.0       15.1 
 6         1 <fit[+]> ARIMA       Test  2022-05-01    20.0       12.9 
 7         1 <fit[+]> ARIMA       Test  2022-06-01    18.8       11.3 
 8         1 <fit[+]> ARIMA       Test  2022-07-01    18.9        9.60
 9         1 <fit[+]> ARIMA       Test  2022-08-01    17.5       13.2 
10         1 <fit[+]> ARIMA       Test  2022-09-01    23.7       21.9 
# ℹ 98 more rows
# ℹ 1 more variable: .residuals <dbl>

We can fit to the test set and compare the predicted to the actual values:

Show the code

calibration_tbl %>%
  modeltime_forecast(
    new_data    = testing(splits),
    actual_data = time_series
  ) %>%
  plot_modeltime_forecast(
    .legend_max_width = 25, # For mobile screens
    .interactive      = TRUE,
    .title = "")

Figure 7. Forecasted NOx levels vs. actual

Figure 7 demonstrates plots the predicted NOx levels relative to the actual. It seems that some of the predictions are below zero. This problematic, since the levels can not drop below zero, let’s retrieve the model metrics:

Show the code

calibration_tbl %>%
  modeltime_accuracy() %>%
  table_modeltime_accuracy(
    .interactive = FALSE
  )

.model_id	.model_desc	.type	mae	mape	mase	smape	rmse	rsq
Accuracy Table
1	ARIMA	Test	5.04	22.49	0.99	25.80	5.92	0.81
2	BOOSTED ARIMA	Test	4.92	21.78	0.97	24.43	5.78	0.81
3	ETS(M,N,M)	Test	2.82	11.01	0.55	10.29	3.79	0.78
4	PROPHET	Test	6.30	28.47	1.24	36.64	7.27	0.81
5	LM	Test	5.30	19.81	1.04	17.50	6.92	0.80
6	EARTH	Test	9.86	44.21	1.94	68.29	11.27	0.82

According to the model metrics, the exponential smoothing model has the lowest mean absolute error. The R squared fluctuates between 0.78 and 0.82, it’s safe. Finally, we refit the models to tghe whole data and predict the next year’s values:

Show the code

refit_tbl %>%
  modeltime_forecast(h = "1 year", actual_data = time_series) %>%
  plot_modeltime_forecast(
    .legend_max_width = 25, # For mobile screens
    .interactive      = TRUE,
    .title = "")

Figure 8. Forecasted Median NOx levels June 2023 to May 2024

Zooming in to April 2023 onward, allows us to see the predicted NOx values for June 2023 to May 2024.Again in terms of MAE, the ETS model is performing best. Perhaps this should be the model that is the final candidate/chosen for production.

Conclusion and Future Research Directions

This project explores air quality data for the Leeds city centre region available from the openair package. The data was visualised using rose plots as well as considering the temporal elements of the pollutant levels. Focusing on Nitric Oxide (NOx), six models were built to forecast the average monthly levels from June 2023 to May 2024. The best performing model according to the MAE metrics, was ETS, with a value of 2.82.

Future research may wish to change the window for forecasting. For this project we considered the monthly average due to computational limitations, but there shouldn’t be any reasons why one cannot do so at the daily level. In addition, this project did not employ cross validation techniques nor seasonality variables, implementing these techniques will likely yield better performance.

Session Info

Show the code

sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] earth_5.3.2        plotmo_3.6.2       TeachingDemos_2.12 plotrix_3.8-2     
 [5] Formula_1.2-5      yardstick_1.2.0    workflowsets_1.0.1 workflows_1.1.3   
 [9] tune_1.1.1         rsample_1.1.1      recipes_1.0.6      parsnip_1.1.0     
[13] modeldata_1.1.0    infer_1.0.4        dials_1.2.0        scales_1.2.1      
[17] broom_1.0.4        tidymodels_1.0.0   timetk_2.8.3       modeltime_1.2.6   
[21] skimr_2.1.5        openair_2.16-0     lubridate_1.9.2    forcats_1.0.0     
[25] stringr_1.5.0      dplyr_1.1.2        purrr_1.0.1        readr_2.1.4       
[29] tidyr_1.3.0        tibble_3.2.1       ggplot2_3.4.2      tidyverse_2.0.0   

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3   rstudioapi_0.14      jsonlite_1.8.4      
  [4] magrittr_2.0.3       rmarkdown_2.21       vctrs_0.6.2         
  [7] base64enc_0.1-3      janitor_2.2.0        htmltools_0.5.5     
 [10] curl_5.0.0           xgboost_1.7.5.1      TTR_0.24.3          
 [13] sass_0.4.5           parallelly_1.35.0    StanHeaders_2.21.0-7
 [16] htmlwidgets_1.6.2    plotly_4.10.1        zoo_1.8-12          
 [19] gt_0.9.0             lifecycle_1.0.3      iterators_1.0.14    
 [22] pkgconfig_2.0.3      Matrix_1.5-4         R6_2.5.1            
 [25] fastmap_1.1.1        snakecase_0.11.0     future_1.32.0       
 [28] digest_0.6.31        colorspace_2.1-0     furrr_0.3.1         
 [31] ps_1.7.5             crosstalk_1.2.0      labeling_0.4.2      
 [34] fansi_1.0.4          timechange_0.2.0     httr_1.4.5          
 [37] mgcv_1.8-42          compiler_4.3.0       withr_2.5.0         
 [40] clock_0.6.1          inline_0.3.19        backports_1.4.1     
 [43] tseries_0.10-53      pkgbuild_1.4.0       hexbin_1.28.3       
 [46] maps_3.4.1           MASS_7.3-58.4        lava_1.7.2.1        
 [49] loo_2.6.0            prophet_1.0          tools_4.3.0         
 [52] lmtest_0.9-40        quantmod_0.4.22      future.apply_1.10.0 
 [55] nnet_7.3-18          glue_1.6.2           quadprog_1.5-8      
 [58] callr_3.7.3          nlme_3.1-162         grid_4.3.0          
 [61] cluster_2.1.4        generics_0.1.3       gtable_0.3.3        
 [64] tzdb_0.3.0           class_7.3-21         data.table_1.14.8   
 [67] hms_1.1.3            xml2_1.3.3           utf8_1.2.3          
 [70] foreach_1.5.2        pillar_1.9.0         splines_4.3.0       
 [73] lhs_1.1.6            lattice_0.21-8       survival_3.5-5      
 [76] deldir_1.0-6         tidyselect_1.2.0     knitr_1.42          
 [79] gridExtra_2.3        urca_1.3-3           stats4_4.3.0        
 [82] forecast_8.21        xfun_0.39            hardhat_1.3.0       
 [85] matrixStats_0.63.0   timeDate_4022.108    rstan_2.21.8        
 [88] stringi_1.7.12       DiceDesign_1.9       lazyeval_0.2.2      
 [91] yaml_2.3.7           pacman_0.5.1         evaluate_0.20       
 [94] codetools_0.2-19     interp_1.1-4         cli_3.6.1           
 [97] RcppParallel_5.1.7   rpart_4.1.19         processx_3.8.1      
[100] repr_1.1.6           munsell_0.5.0        Rcpp_1.0.10         
[103] globals_0.16.2       mapproj_1.2.11       png_0.1-8           
[106] parallel_4.3.0       ellipsis_0.3.2       fracdiff_1.5-2      
[109] gower_1.0.1          assertthat_0.2.1     prettyunits_1.1.1   
[112] latticeExtra_0.6-30  jpeg_0.1-10          GPfit_1.0-8         
[115] listenv_0.9.0        viridisLite_0.4.1    ipred_0.9-14        
[118] xts_0.13.1           prodlim_2023.03.31   crayon_1.5.2        
[121] rlang_1.1.1