This project focuses time series forecasting of food commodity prices in Ethiopia. The data was collected using an API from The Humanitarian Data Exchange, which aggregates humanitarian data. The prices of commodities in Ethiopia stored in The Humanitarian Data Exchange was in turn collected from the Famine Early Warning Systems Network,a food insecurity website created in 1985 by the USAID & the US Department of State, after famines in East and West Africa. The methodology of this project is important because it will give policy makers a tool which thye can use to make better decisions and reduce uncertainty surrounding food security in the region. First, the data is collected via the API and then analysed, to uncover patterns and trends in commodity prices. In this part, the data structure is explained, data quality discussed and visualised. Second, we proceed to model the data using the modeltime framework, with a horizon of 18 weeks. Finally, the project is summarized and future directions are discussed.
This will make it easier to work with our data and will come in handy later on!
Data Retrieval & Structure
The first thing we would like to do is to retrieve the data, this can easily be done with the rhdx package. We index the Humanitarian Data Exchange, for weekly commodity price for commodities in Ethiopia, let’s take a look at the output:
Show the code
# Using the hdrx api we can search for Ethiopia food prices:list_of_ds <-search_datasets("Ethiopia Weekly FEWS NET Staple Food Price Data", rows =10)ds <-pluck(list_of_ds, 1)# Extract the resources/data:ethiopia_prices<-get_resource(ds, 1)ethiopia_prices_df <-read_resource(ethiopia_prices, simplify_json =TRUE, download_folder =tempdir())ethiopia_prices_df %>%head(5)
USAID Freeze
The data retrieved in the code above may not work as expected due to the freeze on USAID by the Trump administration (The Quarto document was rendered using a backup taken on Janury 25th).
The data has a few identifiers, most notably the market the commodity price was recorded in, as well as some geographical markers (admin level 1, longitude & latitude). In terms of the granularity of the data, it is as follows:
Show the code
# Create values for funnel:funnel <-data.frame(gran =c("Admin 1 (region)", "Market", "Product (commodity)"), value =c(100, 50, 10))# Visualise the funnel:funnel |>e_charts() |>e_funnel(value, gran) |>e_title("Data Hierarchy")
Figure 1. Data Hierarchy
Where the admmin_1 columns denotes the region of Ethiopia, the market the market in which the commodity was recorded, and finally the product column, the commodity for which the price was recorded. The price is recorded in Ethiopian Birr (ETB) per kg. The table below provides a better breakdown of all the region, market and product combinations:
Show the code
# Create the base data frame:ethiopia_prices_df %>%select(admin_1, period_date, market, product, value) %>%mutate(value =as.numeric(value), year =as.numeric(year(period_date))) %>%filter( product %in%c('Wheat Grain','Wheat Flour','Beans (Haricot)','Rice (Milled)','Maize Grain (White)' )) %>%group_by(admin_1, market, product) %>%summarize(across(where(is.numeric), list)) %>%select(-year) -> sparkline_df# Define some color palettes: all_colours <-colorRampPalette(c("tomato", "skyblue", "yellow2"))(100)set.seed(600)choose_colours <-sample(all_colours, size =9)# Get a unique color by region (admin_1):sparkline_df %>%ungroup() %>%select(admin_1) %>%unique() %>%cbind(choose_colours)-> color_ad_lvlsparkline_df %>%left_join(color_ad_lvl, by="admin_1") -> sparkline_df# Create shared data frame:shared_data <- crosstalk::SharedData$new(sparkline_df)# Bring all the elements together:bscols(widths =c(12),filter_select("admin_1", "Admin Lvl.", shared_data, ~admin_1),bscols(widths =c(6, 6),filter_select("market", "Market", shared_data, ~market),filter_select("product", "Product", shared_data, ~product) ),reactable( shared_data,theme =clean(),compact =TRUE,defaultPageSize =5,width ='100%',showSortIcon =FALSE,columns =list(admin_1 =colDef(name ="Admin Lvl. 1",maxWidth =150,cell =pill_buttons(data = sparkline_df,color_ref ='choose_colours' ) ),choose_colours =colDef(show =FALSE),market =colDef(maxWidth =100, header ="Market"),product =colDef(maxWidth =100, header ="Commodity"),value =colDef(header ="ETB/kg",minWidth =400, cell =react_sparkline(data = shared_data,height =150,line_width =1.5,bandline ='innerquartiles',bandline_color ='forestgreen',bandline_opacity =0.6,highlight_points =highlight_points(min ='blue', max ='red'),margin = reactablefmtr::margin(t=15,r=2,b=15,l=2),tooltip_type =2 ) ) ) ) )
Figure 2. ETB/kg by admin level 1, market, and commodity
The table above demonstrates the broad range of data quality across the data set. Some admin level, market and commodity combinations have no recorded price, some have sporadic entries, with periods of missing data (for example the Afar admin level). Missing data is common in real life, however how we handle those values would have implications for forecasting later on. For now let’s consider the least complete series for the market with the most complete data (Addis Ababa market & Wheat Grain) and another which is very incomplete (the Jinka market and Haricot Beans):
Show the code
# Filter for Addis Ababa, Merkato & Wheat Grain only:ethiopia_prices_df %>%filter( product %in%c('Wheat Grain','Wheat Flour','Beans (Haricot)','Rice (Milled)','Maize Grain (White)' ) , market =='Addis Ababa, Merkato') %>%mutate(date = lubridate::as_date(period_date),value =as.numeric(value)) %>%filter(product =='Wheat Grain') %>%select(date,value)-> to_impute_wheat# Missing value plotggplot_na_distribution(to_impute_wheat$value) -> plot1# We do the same for Jinka and Haricot Beans:ethiopia_prices_df %>%filter( product =='Beans (Haricot)' , market =='Jinka', admin_1 =='South Ethiopia') %>%mutate(date = lubridate::as_date(period_date),value =as.numeric(value)) %>%select(date,value)-> to_impute_jinkaggplot_na_distribution(to_impute_jinka$value) -> plot2# Bring the plots together:plot_grid(plot1,plot2)
As you can imagine the impact of imputing the wheat price for the Addis Ababa market would be lower than imputing the price for Haricot Beans in the Jinka market. For now we proceed with the first series, where we only have to impute 1 value, we can visualise the imputed price along all the other prices over time:
Show the code
# Isolate the value to impute:imp <-na_interpolation(to_impute_wheat)imp %>%mutate(product ='Wheat Grain') -> imp# Create plot data frame and bind to imputed data:ethiopia_prices_df %>%filter( product %in%c('Wheat Grain','Wheat Flour','Beans (Haricot)','Rice (Milled)','Maize Grain (White)' ) , market =='Addis Ababa, Merkato') %>%mutate(date = lubridate::as_date(period_date),value =as.numeric(value)) ->plot_data# Plot wioth Highcharter:hchart(plot_data %>%select(date, value, product) %>%filter(product !='Wheat Grain') %>%rbind(imp) %>%filter(date >='2019-12-25'), "line", hcaes(x = date, y = value, group = product)) %>%hc_legend(layout ="vertical",align ="left",verticalAlign ="top",floating =TRUE,x=90,y=30,title =list(text ="Commodity",style =list(fontWeight ="bold"))) %>%hc_title(text ="Food Prices in Addis Ababa, Merkato",align ="center") %>%hc_yAxis(title =list(text ="Price (ETB)/Kg")) %>%hc_xAxis(title =list(text =""))
Figure 4: Food commodity prices over time Addis Ababa Merkato
Given the spatial identifiers in the data, we can also take a look at the spatial distribution of the data, also giving us additional insight into what data is missing for which regions:
Apart from the Beneshagul Gumu region which is missing from the dataset entirely, missing data is concentrated in the south western region of the country and for the wheat flour commodity. A common practice when working with time series data, is to decompose a time series into it’s different components. These are the seasonal, trend, remainder or noise. They allow for a study of the time series behavior and has implications for the modelling process, for instance, the seasonal component will influence the seasonality parameter in an ETS model. Focusing on the commodity prices for Addis Ababa, Merkato, we decompose the time series for the different commodities:
Show the code
# Create base plot data for seasonal decomposition:plot_data %>%select(date, value, product) %>%filter(product !='Wheat Grain') %>%rbind(imp) %>%filter(date >='2020-06-10') -> base_data# Unique products:products <-c("Beans (Haricot)" ,"Maize Grain (White)","Rice (Milled)","Wheat Flour","Wheat Grain")# Create decomposition for all products:create_all_decompositions <-function(data) { data %>%group_by(product) %>%group_modify(~{ ts_data <-ts(.x$value, frequency =52) decomp <-stl(log(ts_data), "periodic")data.frame(date = .x$date,data =as.numeric(log(ts_data)),seasonal =as.numeric(decomp$time.series[, "seasonal"]),trend =as.numeric(decomp$time.series[, "trend"]),remainder =as.numeric(decomp$time.series[, "remainder"]) ) })}decomp_data <-create_all_decompositions(base_data)# Set the series to start with Wheat Grain:decomp_data <- decomp_data %>%arrange(product =='Wheat Grain')# Create the shared data frame:shared_data <- SharedData$new(decomp_data, ~product)# Create the different sub plots:p1 <-plot_ly(shared_data, x =~date, y =~data, color =I("purple"), type ='scatter', mode ='lines',hoverongaps =FALSE,hovertemplate =paste0("<b>Log of Original Data</b><br>","Date: %{x|%Y-%m-%d}<br>","Value: %{y:.3f}","<extra></extra>" )) %>%layout(yaxis =list(title ="Log of Original Data"))p2 <-plot_ly(shared_data, x =~date, y =~seasonal, color =I("#90ED7D"), type ='scatter', mode ='lines',hoverongaps =FALSE,hovertemplate =paste0("<b>Seasonal Component</b><br>","Date: %{x|%Y-%m-%d}<br>","Value: %{y:.3f}","<extra></extra>" )) %>%layout(yaxis =list(title ="Seasonal Component"))p3 <-plot_ly(shared_data, x =~date, y =~trend, color =I("#7CB5EC"), type ='scatter', mode ='lines',hoverongaps =FALSE,hovertemplate =paste0("<b>Trend Component</b><br>","Date: %{x|%Y-%m-%d}<br>","Value: %{y:.3f}","<extra></extra>" )) %>%layout(yaxis =list(title ="Trend Component"))p4 <-plot_ly(shared_data, x =~date, y =~remainder, color =I("#F7A35C"), type ='scatter', mode ='lines',hoverongaps =FALSE,hovertemplate =paste0("<b>Remainder Component</b><br>","Date: %{x|%Y-%m-%d}<br>","Value: %{y:.3f}","<extra></extra>" )) %>%layout(yaxis =list(title ="Remainder Component"))# Bring the sub plots together:subplot_data <-subplot(p1, p2, p3, p4, nrows =4, shareX =TRUE) %>%layout(title =list(text ="Seasonal Decomposition of Time Series",y =0.98 ),showlegend =FALSE,autosize =TRUE,width =850,height =800, margin =list(l =50,r =50,t =60, b =20, pad =2 ),yaxis =list(title ="Log of Original Data"),yaxis2 =list(title ="Seasonal Component"),yaxis3 =list(title ="Trend Component"),yaxis4 =list(title ="Remainder Component"),hovermode ='closest',hoverdistance =10,spikedistance =-1,xaxis =list(showspikes =TRUE,spikemode ='across',spikesnap ='cursor',spikecolor ='#000000',spikethickness =1 ),hoverlabel =list(bgcolor ="white",bordercolor ="black",font =list(size =12) ) )# Create the filter:filter_picker <-filter_select(id ="product_filter",label ="Select Product",sharedData = shared_data,group =~product,multiple =FALSE)# Bring all the elements together:bscols(widths =c(12),div(style ="margin-bottom: 10px;", filter_picker), div(style ="height: 800px;", subplot_data) )
Figure 6. Seasonal Decomposition for commodities in Addis Ababa, Merkato.
Similarly, ACF and PACF are useful tools, when working with the data. The ACF indicates how the different values in a time series are correlated with itself at different lag values (ACF). In turn, the PACF measures the correlation between time series values at two time points, taking into account the values of observations at shorter lags. This is useful for helping identify direct relationships between time series values at different lags, stripping out intermediary observation influence. In the graph below we can see ACF curve for the different commodities for Addis Ababa, Merkato, the values are differentiated:
Show the code
# Define the data frame to apply ACF to:ethiopia_prices_df %>%filter( product %in%c('Wheat Grain','Wheat Flour','Beans (Haricot)','Rice (Milled)','Maize Grain (White)' ) , market =='Addis Ababa, Merkato') %>%mutate(date = lubridate::as_date(period_date),value =as.numeric(value)) %>%select(admin_1, period_date, market, product, value) %>%na.omit()->acf_df# Calculate ACF values for each product:products <-unique(acf_df$product)acf_list <-list()for(prod in products) {# Get time series data for this product ts_data <- acf_df %>%filter(product == prod) %>%arrange(period_date) %>%pull(value)# Calculate ACF on differentiated data acf_result <- stats::acf(diff(ts_data, differences =1), lag.max =50, plot =FALSE)# Store results in data frame: acf_list[[prod]] <-data.frame(lag =0:50,acf =as.numeric(acf_result$acf),ci_upper =1.96/sqrt(length(ts_data)),ci_lower =-1.96/sqrt(length(ts_data)),product = prod )}# Combine all ACF results:all_acf <-bind_rows(acf_list)# Define the palette for each product:product_colors <-c("Beans (Haricot)"="#E41A1C","Maize Grain (White)"="#377EB8","Rice (Milled)"="#4DAF4A","Wheat Flour"="#984EA3","Wheat Grain"="#FF7F00")# Create a list for the default series display:visibility_list <-list()for(p inunique(all_acf$product)) { visibility_list[[p]] <-if(p =="Wheat Grain") TRUEelseFALSE}# Create the plot:hchart(all_acf, "lollipop", hcaes(x = lag, y = acf, group = product)) %>%hc_title(text ="Autocorrelation Function by Product") %>%hc_xAxis(title =list(text ="Lag")) %>%hc_yAxis(title =list(text ="ACF"),min =-1,max =1,tickInterval =0.25,gridLineWidth =1,gridLineColor ="rgba(128, 128, 128, 0.1)",plotLines =list(# Zero line:list(value =0,color ="#666666",width =1,zIndex =4 ),# Upper CI:list(value =1.96/sqrt(nrow(acf_df)),color ="#666666",dashStyle ="shortdash",width =1,zIndex =4,label =list(text ="95% CI") ),# Lower CI:list(value =-1.96/sqrt(nrow(acf_df)),color ="#666666",dashStyle ="shortdash",width =1,zIndex =4 ) ) ) %>%hc_colors(c("#3182BD", "#000000", "#31A354", "#E6550D", "#756BB1")) %>%hc_plotOptions(series =list(marker =list(symbol ="circle", radius =4 ) ) ) %>%hc_chart(events =list(load =JS("function() { var series = this.series; series.forEach(function(s) { if (s.name !== 'Wheat Grain') { s.setVisible(false, false); } }); this.redraw(); }") ))%>%hc_tooltip(shared =TRUE,headerFormat ="<b>Lag: {point.x}</b><br/>",pointFormat ="<span style='color:{point.color}'>\u25CF</span> {series.name}: <b>{point.y:.3f}</b><br/>" )
Figure 7. ACF for commodities in Addis Ababa, Merkato.
We can also visualise the pacf curve for the same time series:
Show the code
# Calculate PACF values for each product:products <-unique(acf_df$product)pacf_list <-list()for(prod in products) {# Get time series data for this product: ts_data <- acf_df %>%filter(product == prod) %>%arrange(period_date) %>%pull(value)# Calculate PACF: pacf_result <- stats::pacf(ts_data, lag.max =50, plot =FALSE)# Store results in data frame: pacf_list[[prod]] <-data.frame(lag =1:50, # PACF starts from lag 1pacf =as.numeric(pacf_result$acf),ci_upper =1.96/sqrt(length(ts_data)),ci_lower =-1.96/sqrt(length(ts_data)),product = prod )}# Combine all PACF results:all_pacf <-bind_rows(pacf_list)# Create the plot:hchart(all_pacf, "lollipop", hcaes(x = lag, y = pacf, group = product)) %>%hc_title(text ="Partial Autocorrelation Function by Product") %>%hc_xAxis(title =list(text ="Lag")) %>%hc_yAxis(title =list(text ="ACF"),min =-1,max =1,tickInterval =0.25,gridLineWidth =1,gridLineColor ="rgba(128, 128, 128, 0.1)",plotLines =list(# Zero line:list(value =0,color ="#666666",width =1,zIndex =4 ),# Upper CI:list(value =1.96/sqrt(nrow(acf_df)),color ="#666666",dashStyle ="shortdash",width =1,zIndex =4,label =list(text ="95% CI") ),# Lower CI:list(value =-1.96/sqrt(nrow(acf_df)),color ="#666666",dashStyle ="shortdash",width =1,zIndex =4 ) ) ) %>%hc_colors(c("#3182BD", "#000000", "#31A354", "#E6550D", "#756BB1")) %>%hc_tooltip(shared =TRUE) %>%hc_plotOptions(series =list(marker =list(symbol ="circle",radius =4 ) ) ) %>%hc_chart(events =list(load =JS("function() { var series = this.series; series.forEach(function(s) { if (s.name !== 'Wheat Grain') { s.setVisible(false, false); } }); this.redraw(); }") ))%>%hc_tooltip(shared =TRUE,headerFormat ="<b>Lag: {point.x}</b><br/>",pointFormat ="<span style='color:{point.color}'>\u25CF</span> {series.name}: <b>{point.y:.3f}</b><br/>" )
Figure 8. PACF for commodities in Addis Ababa, Merkato.
Now that we have a better understanding of the data, we can proceed to model it. The focus of the time series forecasting will be the prices for the commodities in the Addis Ababa, Merkato market.
Modeling
Now that we have our data frame and the imputed value, we bring them together, filtering for a date greater than 2020-06-10 to ensure the same start date for all series. Apart from the Rice (Milled) series, all the other ones start 2019-12-25 onward, therefore we filter to ensure the series are consistent, and to avoid imputing backwards in time:
Show the code
# Create a modeling data frame with imputed value:plot_data %>%select(date, value, product) %>%filter(product !="Wheat Grain") %>%rbind(imp) %>%filter(date >='2020-06-10') %>%mutate(product =ifelse(product =="Wheat Grain Imputed", "Wheat Grain", product)) %>%arrange(date) -> to_model# visualise the data to model:to_model %>%group_by(product) %>%plot_time_series(date, value, .facet_ncol =2, .interactive =FALSE)+theme_minimal()
Figure 9. Time series for commodities in Addis Ababa, Merkato.
With our data frame now consistently starting and ending at the same dates, we can proceed to create some time series features by product by date. These include creating the time series features to be used as additional variables when modeling. We do this below:
Show the code
# Create the base data:to_model %>%group_by(product) %>%# Differentiate data:mutate(value =log1p(value)) %>%group_by(product) %>%# Create future data frame with horizon of 18 weeks:future_frame(date,.length_out =18,.bind_data =TRUE) %>%ungroup() %>%mutate(product =as_factor(product)) %>%group_by(product) %>%group_split() %>%# Create time series features, fourier and lags:map(.f =function(df) { df %>%arrange(date) %>%tk_augment_fourier(date, .periods =c(2, 4)) %>%tk_augment_lags(value, .lags =52) %>%tk_augment_slidify( value_lag52,.f =~mean(.x, na.rm =TRUE),.period =c(2, 4, 4*2),.partial =TRUE, .align ="center" ) }) %>%bind_rows() %>%rowid_to_column(var ="rowid") -> full_datafull_data
The data frame we just created contains both the historical data and the future data which we want to predict over. The difference is that the future dates have the computed lags, but have no price associated with it, therefore we filter the full_data data frame, where the value is not NA:
Show the code
# Prepare the historical:data_prepared_tbl <- full_data %>%filter(!is.na(value)) %>%drop_na()data_prepared_tbl
Similarly, we also need the future values, for that we filter for where the value is NA and keep that for later, we also handle some missing values for the computed lags:
Show the code
# Create the future from the base:future_table <- full_data %>%filter(is.na(value))# Replace missing na lags with 0:future_tbl <- future_table %>%mutate(across(.cols =contains("_lag"), .fns =function(x) ifelse(is.nan(x), NA, x)) ) %>%mutate(across(.cols =contains("_lag"), .fns =~replace_na(.x, 0)))
With the historical and future data frames defined, we can now build the training and test sets from which our models will be built:
Show the code
# Create the splits, with an 18 week assess set:splits <- data_prepared_tbl %>%time_series_split(date, assess =18, cumulative =TRUE)# Visualise the splits:splits %>%tk_time_series_cv_plan() %>%plot_time_series_cv_plan(date, value)
Figure 10. Time series splits with an 18 week test set.
Note, the visual above has overlapping series, but each series will have its training and testing split. Lets take a look at the transformed values:
Show the code
# Clean up data to reduce outliers, and missing values (less relevant for our case):train_cleaned <-training(splits) %>%group_by(product) %>%mutate(value =ts_clean_vec(value)) %>%ungroup()# Visualise the transformed values:train_cleaned %>%group_by(product) %>%plot_time_series( date, value, .facet_ncol =2, .smooth =FALSE, .interactive =FALSE)+theme_minimal()
Figure 11. Transformed time series.
Now that we have our full data, we can start to build the modeling steps. We start by specifying a recipe for feature engineering:
Show the code
# Specify the recipe:recipe_spec <-recipe(value ~ ., data = train_cleaned) %>%update_role(rowid, new_role ="indicator") %>%# Create time series signature features:step_timeseries_signature(date) %>%# Remove irrelevant features, due to date granularity:step_rm(matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(am.pm)")) %>%# Normalise some of the numeric ts features:step_normalize(date_index.num, date_year) %>%# Handling categorical variables:step_other(product) %>%step_dummy(all_nominal(), one_hot =TRUE)# Check output of the data:# recipe_spec %>%# prep() %>%# juice() %>%# glimpse()
Let’s create a few models we can use to forecast our data, we bring in the Prophet family of models to benchmark, even though they have several limitations:
With our workflows, we can now combine them into a modeltime table, allowing us to handle them in a grouped manner more easily:
Show the code
# Make a modeltime table from the models:submodels_1_tbl <-modeltime_table( wflw_fit_prophet, wflw_fit_xgboost, wflw_fit_prophet_boost, wflw_fit_rf, wflw_fit_nnet, wflw_fit_mars)# Get the accuracy from the test set:submodels_1_tbl %>%modeltime_accuracy(testing(splits)) %>%arrange(rmse)
# A tibble: 6 × 9
.model_id .model_desc .type mae mape mase smape rmse rsq
<int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 NNET Test 0.105 2.27 0.318 2.28 0.133 7.27e-1
2 1 PROPHET W/ REGRESSORS Test 0.110 2.33 0.333 2.34 0.146 6.09e-1
3 3 PROPHET W/ XGBOOST ERRO… Test 0.128 2.76 0.389 2.78 0.146 5.60e-1
4 4 RANGER Test 0.131 2.81 0.398 2.85 0.166 5.58e-1
5 2 XGBOOST Test 0.144 3.06 0.436 3.13 0.191 4.47e-1
6 6 EARTH Test 0.256 5.47 0.774 6.02 0.481 8.01e-4
It seems that the nnet and prophet models come out on top. However, this can be misleading since it is just on a single test split. Let’s see how the models hold up when using time series cross validation, whilst we are at it let’s tune the xgboost and ranger models.
Towards Greater Robustness
The first thing we do is set up the folds for cross validation.This takes random resamples from the time series at different periods of time. We will use 5-fold cross validation:
Show the code
# Set up 5-fold cross validation:set.seed(123)resamples_kfold <- train_cleaned %>%vfold_cv(v =5)resamples_kfold %>%tk_time_series_cv_plan() %>%plot_time_series_cv_plan(date, value, .facet_ncol =2,.interactive=FALSE)+theme_minimal()+theme(legend.position ="bottom")
Figure 12. 5-fold CV.
The folds will be handy for tuning the xgboost and random forest models, let’s specify the tuning specs for the models:
Show the code
# Set up tuning specs for the xgboost:model_spec_xgboost_tune <-boost_tree(mode ="regression", mtry =tune(),trees =tune(),min_n =tune(),tree_depth =tune(),learn_rate =tune(),loss_reduction =tune()) %>%set_engine("xgboost")# Created the boosted workflow:wflw_spec_xgboost_tune <-workflow() %>%add_model(model_spec_xgboost_tune) %>%add_recipe(recipe_spec %>%update_role(date, new_role ="indicator"))# Tune the model:set.seed(123)tune_results_xgboost <- wflw_spec_xgboost_tune %>%tune_grid(resamples = resamples_kfold,param_info =parameters(wflw_spec_xgboost_tune) %>%update(learn_rate =learn_rate(range =c(0.001, 0.400), trans =NULL) ),grid =10,control =control_grid(verbose =TRUE, allow_par =TRUE) )# Retrieve the results and fit to the training data:tune_results_xgboost %>%show_best(metric="rmse")wflw_fit_xgboost_tuned <- wflw_spec_xgboost_tune %>%finalize_workflow(select_best(tune_results_xgboost, metric="rmse")) %>%fit(train_cleaned)# Set up the tuning specs for the random forest modelmodel_spec_rf_tune <-rand_forest(mode ="regression",mtry =tune(),trees =tune(),min_n =tune()) %>%set_engine("ranger")wflw_spec_rf_tune <-workflow() %>%add_model(model_spec_rf_tune) %>%add_recipe(recipe_spec %>%update_role(date, new_role ="indicator"))# Tune the model:set.seed(123)tune_results_rf <- wflw_spec_rf_tune %>%tune_grid(resamples = resamples_kfold,grid =5,control =control_grid(verbose =TRUE, allow_par =TRUE) )# Retrieve the results and fit to the training data:tune_results_rf %>%show_best(metric="rmse")wflw_fit_rf_tuned <- wflw_spec_rf_tune %>%finalize_workflow(select_best(tune_results_rf,metric="rmse")) %>%fit(train_cleaned)
Next, we combine the non tuned models with the tuned models, calibrate on the test data and arrange my the rmse metric:
Show the code
# Combine all models into modeltime table:submodels_2_tbl <-modeltime_table( wflw_xgb_tuned, wflw_rf_tuned) %>%update_model_description(1, "XGBOOST - Tuned") %>%update_model_description(2, "RANGER - Tuned") %>%combine_modeltime_tables(submodels_1_tbl)# Calibrate on test data:calibration_tbl <- submodels_2_tbl %>%modeltime_calibrate(testing(splits))# Arrange accuracy descending:calibration_tbl %>%modeltime_accuracy() %>%arrange(rmse)
It looks like the tuning didn’t really improve the random forest model and only marginally improved the xgboost model. Let’s try to get an even more robust measure of the model performance over time. For this we use time series cross validation, which is in essence cross validation iteratively over time, we again use 5-fold CV:
Show the code
# Time series CV:resamples_tscv <- train_cleaned %>%time_series_cv(assess =18,skip =18,cumulative =TRUE, slice_limit =5)resamples_tscv %>%tk_time_series_cv_plan() %>%plot_time_series_cv_plan(date, value,.interactive=FALSE)+theme_minimal()+theme(legend.position ="bottom")
Figure 13. Time series 5-fold cross validation.
We fit the models to the time series cross validation folds and visualise the results:
Show the code
# Fit to the time series cross validation folds:model_tbl_tuned_resamples <- submodels_2_tbl %>%modeltime_fit_resamples(resamples = resamples_tscv,control =control_resamples(verbose =FALSE, allow_par =TRUE))# Extract the accuracies:model_tbl_tuned_resamples %>%modeltime_resample_accuracy(metric_set =metric_set(rmse, rsq),summary_fns =list(mean = mean, sd = sd)) %>%arrange(rmse_mean) %>%select(.model_desc,rmse_mean,rmse_sd) -> model_accuracies# Visualise the results:hchart( model_accuracies, "column",hcaes(x = .model_desc, y = rmse_mean)) %>%hc_add_series(model_accuracies,"errorbar",hcaes(y=rmse_mean,x=.model_desc,low = rmse_mean-rmse_sd,high=rmse_mean +rmse_sd)) %>%hc_tooltip(shared =TRUE,formatter =JS("function() { var point = this.points[0]; var errorPoint = this.points[1]; return 'Mean rmse: ' + Highcharts.numberFormat(point.y, 3) + '<br/>' }" ) ) %>%hc_yAxis(title=list(text="Mean rmse")) %>%hc_xAxis(title=list(text="")) %>%hc_title(text="Mean rmse by model")
Figure 14. Time series cross validation model results.
Next, we take some of the best performing models and combine them using an ensemble model, with the median. The models are the tuned xgboost, the xgboost and the prophet with xgboosted errors models. The other models included, should help adjust the predictions from the Prophet model:
Show the code
# Specify which models to keep:submodels_2_ids_to_keep <-c(1,5,4)# Specify median as central tendency:ensemble_fit <- submodels_2_tbl %>%filter(.model_id %in% submodels_2_ids_to_keep) %>%ensemble_average(type ="median")# Put into modeltime table:model_ensemble_tbl <-modeltime_table(ensemble_fit)
Let’s take our prepared data form earlier and fit the ensemble models to it:
Show the code
# Perform some cleaning on the value column:data_prepared_tbl_cleaned <- data_prepared_tbl %>%group_by(product) %>%mutate(value =ts_clean_vec(value)) %>%ungroup()# Refit on cleaned table:model_ensemble_refit_tbl <- model_ensemble_tbl %>%modeltime_refit(data_prepared_tbl_cleaned)
Finally, now that we have fit the model to the data, we can calibrate on the testing data and forecast on the future_tbl data frame and use conformal intervals with the split method:
Show the code
final_forecast <- model_ensemble_refit_tbl %>%# Calibrate on test:modeltime_calibrate(testing(splits),id="product") %>%# Forecast into the future:modeltime_forecast(new_data = future_tbl,actual_data = data_prepared_tbl_cleaned,conf_method ="conformal_split", # Split Conformal Methodconf_by_id =TRUE, keep_data =TRUE) %>%# Convert back to original scale:mutate(.value =expm1(.value),value =expm1(value),.conf_lo =expm1(.conf_lo),.conf_hi =expm1(.conf_hi))
Finally, now that we have our final predictions at an 18 week horizon, we can visualise the results:
Show the code
# Set the unique products:products <-unique(final_forecast$product)# Create a highcharter time series:highchart(type ="stock") %>% { hc <- .for(p in products) { product_data <- final_forecast %>%filter(product == p) color <-switch(p,"Beans (Haricot)"="#3366CC","Maize Grain (White)"="#109618", "Rice (Milled)"="#990099","Wheat Flour"="#FF9900","Wheat Grain"="#3B3B3B") hc <- hc %>%hc_add_series(data = product_data %>%mutate(low = .conf_lo,high = .conf_hi),type ="line",name = p,color = color,id =paste0("line-", p),visible =ifelse(p =="Wheat Grain", TRUE, FALSE),tooltip =list(pointFormat =paste0("<b>{series.name}</b>: {point.y:.2f}<br/>","Lower Confidence Interval: {point.low:.2f}<br/>","Upper Confidence Interval: {point.high:.2f}" ) ),hcaes(x = .index, y = .value)) %>%hc_add_series(data = product_data,type ="arearange",linkedTo =paste0("line-", p),color =hex_to_rgba(color, 0.2),visible =ifelse(p =="Wheat Grain", TRUE, FALSE),showInLegend =FALSE,fillOpacity =0.3,lineWidth =0,enableMouseTracking =FALSE,hcaes(x = .index, low = .conf_lo, high = .conf_hi)) } hc} %>%hc_legend(enabled =TRUE, layout ="vertical",align ="right",verticalAlign ="top",floating =TRUE,x =-500,y =20,backgroundColor ="rgba(255, 255, 255, 0.9)",borderColor ="#999999",borderWidth =1,title =list(text ="Commodity",style =list(fontWeight ="bold")))%>%hc_title(text ="Predicted Food Prices in Addis Ababa, Merkato",align ="center") %>%hc_yAxis(title =list(text ="Price (ETB)/Kg"),opposite =FALSE) %>%hc_xAxis(title =list(text =""), type ="datetime",dateTimeLabelFormats =list(month ="%b %Y",year ="%Y" )) %>%hc_navigator(enabled =TRUE) %>%hc_scrollbar(enabled =FALSE) %>%hc_rangeSelector(buttons =list(list(type ='month', count =1, text ='1m'),list(type ='month', count =3, text ='3m'),list(type ='month', count =6, text ='6m'),list(type ='ytd', text ='YTD'),list(type ='year', count =1, text ='1y'),list(type ='all', text ='All') ),selected =5,inputEnabled =TRUE,inputPosition =list(align ='right') ) %>%hc_plotOptions(series =list(marker =list(enabled =FALSE ) ))
It looks like the 8 & 4 week rolling average of the 52 week lag are the most important variables.
Conclusion and Future Project Directions
Conclusion
This project applies the data science method to time series forecasting in the humanitarian space, namely food commodity prices, in Ethiopia. The data was retrieved using the api from the hdrx package, which retrieves from the Humanitarian Data Exchange Portal. Once retrieved, the data was explored, uncovering data quality issues, time series patterns and geographical patterns. Furthermore, we used typical time series techniques to get more of a feel for the temporal elements of the data, seasonal decomposition, ACF & PACF curves. Next, we honed in on the Addis Abbaba, Merkato market and modeled each of the commodities in that time series. A few general models were built, and their performance was tested. Moreover, we implemented time series cross validation and tuned one xgboost and random forest model. We got a more robust sense of all model performance using the time series cross validation. Finally, we built and ensemble model with the best performing models to be used for predicting each commodity with an 18 week horizon. Feature gain and frequency were also mentioned towards the end. All in all, this project is a good example of how to use data science forecasting methods in a humanitarian context.
Future Project Directions
Whilst this project greatly improves on forecast accuracy, there is still room for improvement. One potential avenue for exploration would be to play around with different feature engineering steps, this might include adding other time series features which we did not consider, or even adjusting the number of lags to be computed. Similarly, we did not consider any exogenous variables, which would undoubtedly affect the price, these may be droughts, wars, or any other variables which may be deemed to have an effect on price. However, this would require a significant data collection effort, nevertheless, it should further increase time series accuracy. Another additional improvement to this project, would be to increase the forecasting model to more region and market combinations, perhaps even internationally to include markets in other countries. Given the data quality issues in this data set, there would likely be even more data quality issues if the scale was to be increased significantly, and should therefore not be taken lightly. Finally, we may wish to try different models, as the ones used in this project are by no means an extensive list. Different model architectures may lead to further performance increases.