Forecasting Food Commodity Prices- Ethiopia

Author

Nils Indreiten

Published

February 18, 2025

Summary

This project focuses time series forecasting of food commodity prices in Ethiopia. The data was collected using an API from The Humanitarian Data Exchange, which aggregates humanitarian data. The prices of commodities in Ethiopia stored in The Humanitarian Data Exchange was in turn collected from the Famine Early Warning Systems Network,a food insecurity website created in 1985 by the USAID & the US Department of State, after famines in East and West Africa. The methodology of this project is important because it will give policy makers a tool which thye can use to make better decisions and reduce uncertainty surrounding food security in the region. First, the data is collected via the API and then analysed, to uncover patterns and trends in commodity prices. In this part, the data structure is explained, data quality discussed and visualised. Second, we proceed to model the data using the modeltime framework, with a horizon of 18 weeks. Finally, the project is summarized and future directions are discussed.

Data Exploration

Before we begin, let’s load all our dependencies:

Loading Dependencies

Show the code

pacman::p_load(
rhdx,
tidyverse,
highcharter,
RColorBrewer,
imputeTS,
leaflet,
sf,
dplyr,
giscoR,
tidymodels,
modeltime,
modeltime.ensemble,
rgeoboundaries,
forecast,
plotly,
crosstalk,
future,
doFuture,
htmltools,
htmlwidgets,
tidyquant,
tidyverse,
timetk,
echarts4r,
reactablefmtr,
reactable,
crosstalk,
paletteer,
monochromeR,
readr,
cowplot)

This will make it easier to work with our data and will come in handy later on!

Data Retrieval & Structure

The first thing we would like to do is to retrieve the data, this can easily be done with the rhdx package. We index the Humanitarian Data Exchange, for weekly commodity price for commodities in Ethiopia, let’s take a look at the output:

Show the code

# Using the hdrx api we can search for Ethiopia food prices:
list_of_ds <- search_datasets("Ethiopia Weekly FEWS NET Staple Food Price Data", rows = 10)
ds <- pluck(list_of_ds, 1)

# Extract the resources/data:
ethiopia_prices<- get_resource(ds, 1)
ethiopia_prices_df <- read_resource(ethiopia_prices, simplify_json = TRUE, download_folder = tempdir())
ethiopia_prices_df %>% 
  head(5)

USAID Freeze

The data retrieved in the code above may not work as expected due to the freeze on USAID by the Trump administration (The Quarto document was rendered using a backup taken on Janury 25th).

Data Exploration

# A tibble: 5 × 15
  country  market       admin_1 longitude latitude cpcv2 product source_document
  <chr>    <chr>        <chr>       <dbl>    <dbl> <chr> <chr>   <chr>          
1 Ethiopia Addis Ababa… Addis …      38.8     9.03 R017… Beans … Famine Early W…
2 Ethiopia Addis Ababa… Addis …      38.8     9.03 R017… Beans … Famine Early W…
3 Ethiopia Addis Ababa… Addis …      38.8     9.03 R017… Beans … Famine Early W…
4 Ethiopia Addis Ababa… Addis …      38.8     9.03 R017… Beans … Famine Early W…
5 Ethiopia Addis Ababa… Addis …      38.8     9.03 R017… Beans … Famine Early W…
# ℹ 7 more variables: period_date <date>, price_type <chr>,
#   product_source <chr>, unit <chr>, unit_type <chr>, currency <chr>,
#   value <dbl>

The data has a few identifiers, most notably the market the commodity price was recorded in, as well as some geographical markers (admin level 1, longitude & latitude). In terms of the granularity of the data, it is as follows:

Show the code

# Create values for funnel:
funnel <- data.frame(gran = c("Admin 1 (region)", "Market", "Product (commodity)"), value = c(100, 50, 10))
# Visualise the funnel:
funnel |> 
  e_charts() |> 
  e_funnel(value, gran) |> 
  e_title("Data Hierarchy")

Figure 1. Data Hierarchy

Where the admmin_1 columns denotes the region of Ethiopia, the market the market in which the commodity was recorded, and finally the product column, the commodity for which the price was recorded. The price is recorded in Ethiopian Birr (ETB) per kg. The table below provides a better breakdown of all the region, market and product combinations:

Show the code

# Create the base data frame:
ethiopia_prices_df %>%
  select(admin_1, period_date, market, product, value) %>% 
  mutate(value = as.numeric(value), 
         year = as.numeric(year(period_date))) %>%
  filter(
    product %in% c(
      'Wheat Grain',
      'Wheat Flour',
      'Beans (Haricot)',
      'Rice (Milled)',
      'Maize Grain (White)'
    )) %>%
  group_by(admin_1, market, product) %>% 
  summarize(across(where(is.numeric), list)) %>% 
  select(-year)  -> sparkline_df

# Define some color palettes: 
all_colours <- colorRampPalette(c("tomato", "skyblue", "yellow2"))(100)
set.seed(600)
choose_colours <- sample(all_colours, size = 9)

# Get a unique color by region (admin_1):
sparkline_df %>% 
  ungroup() %>%
  select(admin_1) %>% 
  unique() %>% 
  cbind(choose_colours)-> color_ad_lvl

sparkline_df %>% 
  left_join(color_ad_lvl, by="admin_1") -> sparkline_df

# Create shared data frame:
shared_data <- crosstalk::SharedData$new(sparkline_df)

# Bring all the elements together:
bscols(
  widths = c(12),
  filter_select("admin_1", "Admin Lvl.", shared_data, ~admin_1),
  bscols(
    widths = c(6, 6),
    filter_select("market", "Market", shared_data, ~market),
    filter_select("product", "Product", shared_data, ~product)
  ),
  reactable(
    shared_data,
    theme = clean(),
    compact = TRUE,
    defaultPageSize = 5,
    width = '100%',
    showSortIcon = FALSE,
    columns = list(
      admin_1 = colDef(
        name = "Admin Lvl. 1",
        maxWidth = 150,
        cell = pill_buttons(
          data = sparkline_df,
          color_ref = 'choose_colours'
        )
      ),
      choose_colours = colDef(show = FALSE),
      market = colDef(maxWidth = 100, header = "Market"),
      product = colDef(maxWidth = 100, header = "Commodity"),
      value = colDef(
        header = "ETB/kg",
        minWidth = 400,  
        cell = react_sparkline(
          data = shared_data,
          height = 150,
          line_width = 1.5,
          bandline = 'innerquartiles',
          bandline_color = 'forestgreen',
          bandline_opacity = 0.6,
          highlight_points = highlight_points(min = 'blue', max = 'red'),
          margin = reactablefmtr::margin(t=15,r=2,b=15,l=2),
          tooltip_type = 2
        )
      )
    )
  ) 
)

Admin Lvl.

Market

Product

Figure 2. ETB/kg by admin level 1, market, and commodity

The table above demonstrates the broad range of data quality across the data set. Some admin level, market and commodity combinations have no recorded price, some have sporadic entries, with periods of missing data (for example the Afar admin level). Missing data is common in real life, however how we handle those values would have implications for forecasting later on. For now let’s consider the least complete series for the market with the most complete data (Addis Ababa market & Wheat Grain) and another which is very incomplete (the Jinka market and Haricot Beans):

Show the code

# Filter for Addis Ababa, Merkato & Wheat Grain only:
ethiopia_prices_df %>% filter(
  product %in% c(
    'Wheat Grain',
    'Wheat Flour',
    'Beans (Haricot)',
    'Rice (Milled)',
    'Maize Grain (White)'
  ) ,
  market == 'Addis Ababa, Merkato'
) %>% mutate(date = lubridate::as_date(period_date),
             value = as.numeric(value)) %>% 
  filter(product == 'Wheat Grain') %>% 
  select(date,value)-> to_impute_wheat

# Missing value plot
ggplot_na_distribution(to_impute_wheat$value) -> plot1

# We do the same for Jinka and Haricot Beans:
ethiopia_prices_df %>% filter(
  product == 'Beans (Haricot)' ,
  market == 'Jinka',
  admin_1 == 'South Ethiopia'
) %>% mutate(date = lubridate::as_date(period_date),
             value = as.numeric(value)) %>% 
  select(date,value)-> to_impute_jinka

ggplot_na_distribution(to_impute_jinka$value) -> plot2

# Bring the plots together:
plot_grid(plot1,plot2)

Figure 3: Missing data, wheat grain price Addis Ababa market (left) & Haricot beans price Jinka market (right).

As you can imagine the impact of imputing the wheat price for the Addis Ababa market would be lower than imputing the price for Haricot Beans in the Jinka market. For now we proceed with the first series, where we only have to impute 1 value, we can visualise the imputed price along all the other prices over time:

Show the code

# Isolate the value to impute:
imp <- na_interpolation(to_impute_wheat)
imp %>% 
  mutate(product = 'Wheat Grain') -> imp

# Create plot data frame and bind to imputed data:
ethiopia_prices_df %>% filter(
  product %in% c(
    'Wheat Grain',
    'Wheat Flour',
    'Beans (Haricot)',
    'Rice (Milled)',
    'Maize Grain (White)'
  ) ,
  market == 'Addis Ababa, Merkato'
) %>% mutate(date = lubridate::as_date(period_date),
             value = as.numeric(value)) ->plot_data

# Plot wioth Highcharter:
hchart(plot_data %>% 
         select(date, value, product) %>%
         filter(product != 'Wheat Grain') %>% 
         rbind(imp) %>% 
         filter(date >= '2019-12-25'), "line", hcaes(x = date, y = value, group = product)) %>% 
  hc_legend(
    layout = "vertical",
    align = "left",
    verticalAlign = "top",
    floating = TRUE,
    x=90,
    y=30,
    title = list(text = "Commodity",
                 style = list(fontWeight =  "bold"))) %>%
  hc_title(text = "Food Prices in Addis Ababa, Merkato",
           align = "center") %>%
  hc_yAxis(title = list(text = "Price (ETB)/Kg")) %>%
  hc_xAxis(title = list(text = ""))

Figure 4: Food commodity prices over time Addis Ababa Merkato

Given the spatial identifiers in the data, we can also take a look at the spatial distribution of the data, also giving us additional insight into what data is missing for which regions:

Show the code

# Retrieve the geographical boundaries for Ethiopia t the admin 1 level": 
ethiopia_admlvl1_boundaries <- geoboundaries("Ethiopia", "adm1")

# Adjust name for missing region:
ethiopia_prices_df %>% 
  mutate(admin_1 = case_when(
    admin_1 == 'South Ethiopia' ~ 'SNNPR',
    TRUE ~ admin_1)) -> ethiopia_prices_df

unique_names <- unique(ethiopia_admlvl1_boundaries$shapeName)

# Summarise mean price by geography:
ethiopia_prices_df %>% 
  select(admin_1,product,price_type,unit,unit_type,value,currency) %>%  filter(
  product %in% c(
    'Wheat Grain',
    'Wheat Flour',
    'Beans (Haricot)',
    'Rice (Milled)',
    'Maize Grain (White)'
  )) %>% 
  mutate(value = as.numeric(value)) %>%
  group_by(admin_1,product) %>% 
  summarise(mean_price = mean(value,na.rm = TRUE)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = product,values_from = mean_price)-> ethiopia_prices_summary

# Join average price data to admin level 1:

ethiopia_admlvl1_boundaries %>% 
  left_join(ethiopia_prices_summary, by = c('shapeName'='admin_1')) -> ethiopia_prices_summary_map

# And clean up NAs
ethiopia_prices_summary_map <- ethiopia_prices_summary_map %>%
  mutate(across(c(`Wheat Grain`, `Wheat Flour`, `Beans (Haricot)`, 
                  `Rice (Milled)`, `Maize Grain (White)`),
                ~ifelse(is.nan(.), NA, .)))

# We define the color palettes for each series:

wheat_grain_pal <- colorNumeric(
  palette = "Greens", domain = ethiopia_prices_summary_map$`Wheat Grain`,
  na.color = "transparent", reverse = FALSE)

wheat_flour_pal <- colorNumeric(
  palette = "Purples", domain = ethiopia_prices_summary_map$`Wheat Flour`,
  na.color = "transparent", reverse = FALSE)

rice_milled_pal <- colorNumeric(
  palette = "Blues", domain = ethiopia_prices_summary_map$`Rice (Milled)`,
  na.color = "transparent", reverse = FALSE)

maize_grain_pal <- colorNumeric(
  palette = "Oranges", domain = ethiopia_prices_summary_map$`Maize Grain (White)`,
  na.color = "transparent", reverse = FALSE)

beans_har_pal <- colorNumeric(
  palette = "RdPu", domain = ethiopia_prices_summary_map$`Beans (Haricot)`,
  na.color = "transparent", reverse = FALSE)

map <- leaflet(ethiopia_prices_summary_map) %>% 
    addProviderTiles(providers$CartoDB.Positron) %>%

  # Wheat Grain layer:
  addPolygons(
    fillColor = ~ wheat_grain_pal(ethiopia_prices_summary_map$`Wheat Grain`), 
    group = "Wheat Grain Avg. Prices",
    label = ~paste0(shapeName, ": ", round(`Wheat Grain`, 2)),
    weight = 2,
    opacity = 1,
    color = "white",
    dashArray = "3",
    fillOpacity = 0.5,
    highlight = highlightOptions(
      weight = 5,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.7,
      bringToFront = TRUE
    )
  ) %>% 
  addLegend(
    position = "bottomright",
    pal = wheat_grain_pal,
    values = sort(ethiopia_prices_summary_map$`Wheat Grain`, decreasing = TRUE),
    title = "Wheat Grain Prices",
    className = "legend-wheat-grain"
  ) %>%
  
  # Wheat Flour layer:
  addPolygons(
    fillColor = ~ wheat_flour_pal(ethiopia_prices_summary_map$`Wheat Flour`), 
    group = "Wheat Flour Avg. Prices",
    label = ~paste0(shapeName, ": ", round(`Wheat Flour`, 2)),
    weight = 2,
    opacity = 1,
    color = "white",
    dashArray = "3",
    fillOpacity = 0.5,
    highlight = highlightOptions(
      weight = 5,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.7,
      bringToFront = TRUE
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = wheat_flour_pal,
    values = ethiopia_prices_summary_map$`Wheat Flour`,
    title = "Wheat Flour Prices",
    className = "legend-wheat-flour"
  ) %>%
  
  # Milled Rice layer:
  addPolygons(
    fillColor = ~ rice_milled_pal(ethiopia_prices_summary_map$`Rice (Milled)`), 
    group = "Milled Rice Avg. Prices",
    label = ~paste0(shapeName, ": ", round(`Rice (Milled)`, 2)),
    weight = 2,
    opacity = 1,
    color = "white",
    dashArray = "3",
    fillOpacity = 0.5,
    highlight = highlightOptions(
      weight = 5,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.7,
      bringToFront = TRUE
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = rice_milled_pal,
    values = ethiopia_prices_summary_map$`Rice (Milled)`,
    title = "Milled Rice Prices",
    className = "legend-milled-rice"
  ) %>%
  
  # Maize Grain layer:
  addPolygons(
    fillColor = ~ maize_grain_pal(ethiopia_prices_summary_map$`Maize Grain (White)`), 
    group = "Maize Grain Avg. Prices",
    label = ~paste0(shapeName, ": ", round(`Maize Grain (White)`, 2)),
    weight = 2,
    opacity = 1,
    color = "white",
    dashArray = "3",
    fillOpacity = 0.5,
    highlight = highlightOptions(
      weight = 5,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.7,
      bringToFront = TRUE
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = maize_grain_pal,
    values = ethiopia_prices_summary_map$`Maize Grain (White)`,
    title = "Maize Grain Prices",
    className = "legend-maize-grain"
  ) %>%
  
  # Haricot Beans layer:
  addPolygons(
    fillColor = ~ beans_har_pal(ethiopia_prices_summary_map$`Beans (Haricot)`), 
    group = "Haricot Beans Avg. Prices",
    label = ~paste0(shapeName, ": ", round(`Beans (Haricot)`, 2)),
    weight = 2,
    opacity = 1,
    color = "white",
    dashArray = "3",
    fillOpacity = 0.5,
    highlight = highlightOptions(
      weight = 5,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.7,
      bringToFront = TRUE
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = beans_har_pal,
    values = ethiopia_prices_summary_map$`Beans (Haricot)`,
    title = "Haricot Beans Prices",
    className = "legend-haricot-beans"
  ) %>%
  
  # Layer controls
  addLayersControl(
    baseGroups = c(
      "Wheat Grain Avg. Prices",
      "Wheat Flour Avg. Prices",
      "Milled Rice Avg. Prices",
      "Maize Grain Avg. Prices",
      "Haricot Beans Avg. Prices"
    ),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  
  #  Custom JS for legend switch functionality
  htmlwidgets::onRender("
    function(el, x) {
      var map = this;
      
      // Create legend group mapping
      var legendGroups = new Map([
        ['Wheat Grain Avg. Prices', document.querySelector('.legend-wheat-grain')],
        ['Wheat Flour Avg. Prices', document.querySelector('.legend-wheat-flour')],
        ['Milled Rice Avg. Prices', document.querySelector('.legend-milled-rice')],
        ['Maize Grain Avg. Prices', document.querySelector('.legend-maize-grain')],
        ['Haricot Beans Avg. Prices', document.querySelector('.legend-haricot-beans')]
      ]);

      // Function to show active legend and hide others
      function updateLegend(selectedBase) {
        legendGroups.forEach((legend, group) => {
          if (legend) {
            if (group === selectedBase) {
              legend.style.display = 'block';
            } else {
              legend.style.display = 'none';
            }
          }
        });
      } 

      // Initialize with first layer's legend
      updateLegend('Wheat Grain Avg. Prices');

      // Update legend when base layer changes
      map.on('baselayerchange', function(e) {
        updateLegend(e.name);
      });
    }
  ")

map

Figure 5. Average commodity price by region.

Apart from the Beneshagul Gumu region which is missing from the dataset entirely, missing data is concentrated in the south western region of the country and for the wheat flour commodity. A common practice when working with time series data, is to decompose a time series into it’s different components. These are the seasonal, trend, remainder or noise. They allow for a study of the time series behavior and has implications for the modelling process, for instance, the seasonal component will influence the seasonality parameter in an ETS model. Focusing on the commodity prices for Addis Ababa, Merkato, we decompose the time series for the different commodities:

Show the code

# Create base plot data for seasonal decomposition:
plot_data %>% 
  select(date, value, product) %>%
  filter(product != 'Wheat Grain') %>% 
  rbind(imp) %>% 
  filter(date >= '2020-06-10') -> base_data

# Unique products:
products <- c("Beans (Haricot)" ,"Maize Grain (White)","Rice (Milled)","Wheat Flour","Wheat Grain")

# Create decomposition for all products:
create_all_decompositions <- function(data) {
  data %>%
    group_by(product) %>%
    group_modify(~{
      ts_data <- ts(.x$value, frequency = 52)
      decomp <- stl(log(ts_data), "periodic")
      
      data.frame(
        date = .x$date,
        data = as.numeric(log(ts_data)),
        seasonal = as.numeric(decomp$time.series[, "seasonal"]),
        trend = as.numeric(decomp$time.series[, "trend"]),
        remainder = as.numeric(decomp$time.series[, "remainder"])
      )
    })
}

decomp_data <- create_all_decompositions(base_data)

# Set the series to start with Wheat Grain:
decomp_data <- decomp_data %>% arrange(product == 'Wheat Grain')

# Create the shared data frame:
shared_data <- SharedData$new(decomp_data, ~product)

# Create the different sub plots:
p1 <- plot_ly(shared_data, x = ~date, y = ~data, color = I("purple"), 
              type = 'scatter', mode = 'lines',
              hoverongaps = FALSE,
              hovertemplate = paste0(
                "<b>Log of Original Data</b><br>",
                "Date: %{x|%Y-%m-%d}<br>",
                "Value: %{y:.3f}",
                "<extra></extra>"
              )) %>%
  layout(yaxis = list(title = "Log of Original Data"))

p2 <- plot_ly(shared_data, x = ~date, y = ~seasonal, color = I("#90ED7D"), 
              type = 'scatter', mode = 'lines',
              hoverongaps = FALSE,
              hovertemplate = paste0(
                "<b>Seasonal Component</b><br>",
                "Date: %{x|%Y-%m-%d}<br>",
                "Value: %{y:.3f}",
                "<extra></extra>"
              )) %>%
  layout(yaxis = list(title = "Seasonal Component"))

p3 <- plot_ly(shared_data, x = ~date, y = ~trend, color = I("#7CB5EC"), 
              type = 'scatter', mode = 'lines',
              hoverongaps = FALSE,
              hovertemplate = paste0(
                "<b>Trend Component</b><br>",
                "Date: %{x|%Y-%m-%d}<br>",
                "Value: %{y:.3f}",
                "<extra></extra>"
              )) %>%
  layout(yaxis = list(title = "Trend Component"))

p4 <- plot_ly(shared_data, x = ~date, y = ~remainder, color = I("#F7A35C"), 
              type = 'scatter', mode = 'lines',
              hoverongaps = FALSE,
              hovertemplate = paste0(
                "<b>Remainder Component</b><br>",
                "Date: %{x|%Y-%m-%d}<br>",
                "Value: %{y:.3f}",
                "<extra></extra>"
              )) %>%
  layout(yaxis = list(title = "Remainder Component"))

# Bring the sub plots together:
subplot_data <- subplot(p1, p2, p3, p4, nrows = 4, shareX = TRUE) %>%
  layout(
    title = list(
      text = "Seasonal Decomposition of Time Series",
      y = 0.98
    ),
    showlegend = FALSE,
    autosize = TRUE,
    width = 850,
    height = 800,  
    margin = list(
      l = 50,
      r = 50,
      t = 60,  
      b = 20,  
      pad = 2
    ),
    yaxis = list(title = "Log of Original Data"),
    yaxis2 = list(title = "Seasonal Component"),
    yaxis3 = list(title = "Trend Component"),
    yaxis4 = list(title = "Remainder Component"),
    hovermode = 'closest',
    hoverdistance = 10,
    spikedistance = -1,
    xaxis = list(
      showspikes = TRUE,
      spikemode = 'across',
      spikesnap = 'cursor',
      spikecolor = '#000000',
      spikethickness = 1
    ),
    hoverlabel = list(
      bgcolor = "white",
      bordercolor = "black",
      font = list(size = 12)
    )
  )

# Create the filter:
filter_picker <- filter_select(
  id = "product_filter",
  label = "Select Product",
  sharedData = shared_data,
  group = ~product,
  multiple = FALSE
)

# Bring all the elements together:
bscols(
  widths = c(12),
  div(style = "margin-bottom: 10px;", filter_picker),  
  div(style = "height: 800px;", subplot_data)  
)

Figure 6. Seasonal Decomposition for commodities in Addis Ababa, Merkato.

Similarly, ACF and PACF are useful tools, when working with the data. The ACF indicates how the different values in a time series are correlated with itself at different lag values (ACF). In turn, the PACF measures the correlation between time series values at two time points, taking into account the values of observations at shorter lags. This is useful for helping identify direct relationships between time series values at different lags, stripping out intermediary observation influence. In the graph below we can see ACF curve for the different commodities for Addis Ababa, Merkato, the values are differentiated:

Show the code

# Define the data frame to apply ACF to:
ethiopia_prices_df %>% filter(
  product %in% c(
    'Wheat Grain',
    'Wheat Flour',
    'Beans (Haricot)',
    'Rice (Milled)',
    'Maize Grain (White)'
  ) ,
  market == 'Addis Ababa, Merkato'
) %>% mutate(date = lubridate::as_date(period_date),
             value = as.numeric(value)) %>% 
  select(admin_1, period_date, market, product, value) %>% 
  na.omit()->acf_df

# Calculate ACF values for each product:

products <- unique(acf_df$product)

acf_list <- list()

for(prod in products) {
  # Get time series data for this product
  ts_data <- acf_df %>%
    filter(product == prod) %>%
    arrange(period_date) %>%
    pull(value)
  
  
  # Calculate ACF on differentiated data
  acf_result <- stats::acf(diff(ts_data, differences = 1), lag.max = 50, plot = FALSE)
  
  # Store results in data frame:
  acf_list[[prod]] <- data.frame(
    lag = 0:50,
    acf = as.numeric(acf_result$acf),
    ci_upper = 1.96/sqrt(length(ts_data)),
    ci_lower = -1.96/sqrt(length(ts_data)),
    product = prod
  )
}

# Combine all ACF results:
all_acf <- bind_rows(acf_list)

# Define the palette for each product:
product_colors <- c(
  "Beans (Haricot)" = "#E41A1C",
  "Maize Grain (White)" = "#377EB8",
  "Rice (Milled)" = "#4DAF4A",
  "Wheat Flour" = "#984EA3",
  "Wheat Grain" = "#FF7F00"
)

# Create a list for the default series display:
visibility_list <- list()
for(p in unique(all_acf$product)) {
  visibility_list[[p]] <- if(p == "Wheat Grain") TRUE else FALSE
}

# Create the plot:
hchart(all_acf, "lollipop", 
       hcaes(x = lag, y = acf, group = product)) %>%
  hc_title(text = "Autocorrelation Function by Product") %>%
  hc_xAxis(title = list(text = "Lag")) %>%
  hc_yAxis(
    title = list(text = "ACF"),
    min = -1,
    max = 1,
    tickInterval = 0.25,
    gridLineWidth = 1,
    gridLineColor = "rgba(128, 128, 128, 0.1)",
    plotLines = list(
      # Zero line:
      list(
        value = 0,
        color = "#666666",
        width = 1,
        zIndex = 4
      ),
      # Upper CI:
      list(
        value = 1.96/sqrt(nrow(acf_df)),
        color = "#666666",
        dashStyle = "shortdash",
        width = 1,
        zIndex = 4,
        label = list(text = "95% CI")
      ),
      # Lower CI:
      list(
        value = -1.96/sqrt(nrow(acf_df)),
        color = "#666666",
        dashStyle = "shortdash",
        width = 1,
        zIndex = 4
      )
    )
  ) %>% 
  hc_colors(c("#3182BD", "#000000", "#31A354", "#E6550D", "#756BB1")) %>%
  hc_plotOptions(
    series = list(
      marker = list(
        symbol = "circle",  
        radius = 4 
      )
    )
  ) %>%
  hc_chart(events = list(
    load = JS("function() {
      var series = this.series;
      series.forEach(function(s) {
        if (s.name !== 'Wheat Grain') {
          s.setVisible(false, false);
        }
      });
      this.redraw();
    }")
  ))%>%
  hc_tooltip(
    shared = TRUE,
    headerFormat = "<b>Lag: {point.x}</b><br/>",
    pointFormat = "<span style='color:{point.color}'>\u25CF</span> {series.name}: <b>{point.y:.3f}</b><br/>"
  )

Figure 7. ACF for commodities in Addis Ababa, Merkato.

We can also visualise the pacf curve for the same time series:

Show the code

# Calculate PACF values for each product:

products <- unique(acf_df$product)
pacf_list <- list()

for(prod in products) {
  # Get time series data for this product:
  ts_data <- acf_df %>%
    filter(product == prod) %>%
    arrange(period_date) %>%
    pull(value)
  
  
  # Calculate PACF:
  pacf_result <- stats::pacf(ts_data, lag.max = 50, plot = FALSE)
  
  # Store results in data frame:
  pacf_list[[prod]] <- data.frame(
    lag = 1:50,  # PACF starts from lag 1
    pacf = as.numeric(pacf_result$acf),
    ci_upper = 1.96/sqrt(length(ts_data)),
    ci_lower = -1.96/sqrt(length(ts_data)),
    product = prod
  )
}

# Combine all PACF results:
all_pacf <- bind_rows(pacf_list)

# Create the plot:
hchart(all_pacf, "lollipop", 
       hcaes(x = lag, y = pacf, group = product)) %>%
  hc_title(text = "Partial Autocorrelation Function by Product") %>%
  hc_xAxis(title = list(text = "Lag")) %>%
  hc_yAxis(
    title = list(text = "ACF"),
    min = -1,
    max = 1,
    tickInterval = 0.25,
    gridLineWidth = 1,
    gridLineColor = "rgba(128, 128, 128, 0.1)",
    plotLines = list(
      # Zero line:
      list(
        value = 0,
        color = "#666666",
        width = 1,
        zIndex = 4
      ),
      # Upper CI:
      list(
        value = 1.96/sqrt(nrow(acf_df)),
        color = "#666666",
        dashStyle = "shortdash",
        width = 1,
        zIndex = 4,
        label = list(text = "95% CI")
      ),
      # Lower CI:
      list(
        value = -1.96/sqrt(nrow(acf_df)),
        color = "#666666",
        dashStyle = "shortdash",
        width = 1,
        zIndex = 4
      )
    )
  ) %>% 
  hc_colors(c("#3182BD", "#000000", "#31A354", "#E6550D", "#756BB1")) %>%
  hc_tooltip(shared = TRUE) %>%
  hc_plotOptions(
    series = list(
      marker = list(
        symbol = "circle",
        radius = 4
      )
    )
  ) %>%
  hc_chart(events = list(
    load = JS("function() {
      var series = this.series;
      series.forEach(function(s) {
        if (s.name !== 'Wheat Grain') {
          s.setVisible(false, false);
        }
      });
      this.redraw();
    }")
  ))%>%
  hc_tooltip(
    shared = TRUE,
    headerFormat = "<b>Lag: {point.x}</b><br/>",
    pointFormat = "<span style='color:{point.color}'>\u25CF</span> {series.name}: <b>{point.y:.3f}</b><br/>"
  )

Figure 8. PACF for commodities in Addis Ababa, Merkato.

Now that we have a better understanding of the data, we can proceed to model it. The focus of the time series forecasting will be the prices for the commodities in the Addis Ababa, Merkato market.

Modeling

Now that we have our data frame and the imputed value, we bring them together, filtering for a date greater than 2020-06-10 to ensure the same start date for all series. Apart from the Rice (Milled) series, all the other ones start 2019-12-25 onward, therefore we filter to ensure the series are consistent, and to avoid imputing backwards in time:

Show the code

# Create a modeling data frame with imputed value:
plot_data %>% 
  select(date, value, product) %>% 
  filter(product != "Wheat Grain") %>% 
  rbind(imp) %>% 
  filter(date >= '2020-06-10') %>% 
  mutate(product = ifelse(product == "Wheat Grain Imputed", "Wheat Grain", product)) %>% 
  arrange(date) -> to_model

# visualise the data to model:

to_model %>% 
  group_by(product) %>% 
  plot_time_series(date, value, .facet_ncol = 2, .interactive = FALSE)+
  theme_minimal()

Figure 9. Time series for commodities in Addis Ababa, Merkato.

With our data frame now consistently starting and ending at the same dates, we can proceed to create some time series features by product by date. These include creating the time series features to be used as additional variables when modeling. We do this below:

Show the code

# Create the base data:
to_model %>% 
  group_by(product) %>% 
  # Differentiate data:
  mutate(value = log1p(value)) %>% 
  group_by(product) %>%
  # Create future data frame with horizon of 18 weeks:
  future_frame(date,.length_out = 18,.bind_data =TRUE) %>% 
  ungroup() %>% 
  mutate(product = as_factor(product)) %>% 
  group_by(product) %>% 
  group_split() %>%
  # Create time series features, fourier and lags:
  map(.f = function(df) {
    df %>%
      arrange(date) %>%
      tk_augment_fourier(date, .periods = c(2, 4)) %>%
      tk_augment_lags(value, .lags = 52) %>%
      tk_augment_slidify(
        value_lag52,
        .f       = ~ mean(.x, na.rm = TRUE),
        .period  = c(2, 4, 4*2),
        .partial = TRUE, 
        .align   = "center"
      )
  }) %>%
  bind_rows() %>%
  
  rowid_to_column(var = "rowid") -> full_data

full_data

# A tibble: 1,210 × 12
   rowid product         date       value date_sin2_K1 date_cos2_K1 date_sin4_K1
   <int> <fct>           <date>     <dbl>        <dbl>        <dbl>        <dbl>
 1     1 Beans (Haricot) 2020-06-10  3.99       -0.434        0.901       -0.223
 2     2 Beans (Haricot) 2020-06-17  3.99        0.434       -0.901        0.975
 3     3 Beans (Haricot) 2020-06-24  4.01       -0.434        0.901        0.223
 4     4 Beans (Haricot) 2020-07-01  4.01        0.434       -0.901       -0.975
 5     5 Beans (Haricot) 2020-07-08  3.99       -0.434        0.901       -0.223
 6     6 Beans (Haricot) 2020-07-15  3.95        0.434       -0.901        0.975
 7     7 Beans (Haricot) 2020-07-22  3.93       -0.434        0.901        0.223
 8     8 Beans (Haricot) 2020-07-29  3.97        0.434       -0.901       -0.975
 9     9 Beans (Haricot) 2020-08-05  3.99       -0.434        0.901       -0.223
10    10 Beans (Haricot) 2020-08-12  3.99        0.434       -0.901        0.975
# ℹ 1,200 more rows
# ℹ 5 more variables: date_cos4_K1 <dbl>, value_lag52 <dbl>,
#   value_lag52_roll_2 <dbl>, value_lag52_roll_4 <dbl>,
#   value_lag52_roll_8 <dbl>

The data frame we just created contains both the historical data and the future data which we want to predict over. The difference is that the future dates have the computed lags, but have no price associated with it, therefore we filter the full_data data frame, where the value is not NA:

Show the code

# Prepare the historical:

data_prepared_tbl <- full_data %>%
  filter(!is.na(value)) %>%
  drop_na()

data_prepared_tbl

# A tibble: 860 × 12
   rowid product         date       value date_sin2_K1 date_cos2_K1 date_sin4_K1
   <int> <fct>           <date>     <dbl>        <dbl>        <dbl>        <dbl>
 1    53 Beans (Haricot) 2021-06-09  4.09       -0.434        0.901       -0.223
 2    54 Beans (Haricot) 2021-06-16  4.09        0.434       -0.901        0.975
 3    55 Beans (Haricot) 2021-06-23  4.11       -0.434        0.901        0.223
 4    56 Beans (Haricot) 2021-06-30  4.16        0.434       -0.901       -0.975
 5    57 Beans (Haricot) 2021-07-07  4.13       -0.434        0.901       -0.223
 6    58 Beans (Haricot) 2021-07-14  4.14        0.434       -0.901        0.975
 7    59 Beans (Haricot) 2021-07-21  4.16       -0.434        0.901        0.223
 8    60 Beans (Haricot) 2021-07-28  4.16        0.434       -0.901       -0.975
 9    61 Beans (Haricot) 2021-08-04  4.13       -0.434        0.901       -0.223
10    62 Beans (Haricot) 2021-08-11  4.13        0.434       -0.901        0.975
# ℹ 850 more rows
# ℹ 5 more variables: date_cos4_K1 <dbl>, value_lag52 <dbl>,
#   value_lag52_roll_2 <dbl>, value_lag52_roll_4 <dbl>,
#   value_lag52_roll_8 <dbl>

Similarly, we also need the future values, for that we filter for where the value is NA and keep that for later, we also handle some missing values for the computed lags:

Show the code

# Create the future from the base:

future_table <- full_data %>%
  filter(is.na(value))

# Replace missing na lags with 0:
future_tbl <- future_table %>%
  mutate(
    across(.cols = contains("_lag"), 
           .fns  = function(x) ifelse(is.nan(x), NA, x))
  ) %>%
mutate(across(.cols = contains("_lag"), .fns = ~replace_na(.x, 0)))

With the historical and future data frames defined, we can now build the training and test sets from which our models will be built:

Show the code

# Create the splits, with an 18 week assess set:

splits <- data_prepared_tbl %>%
  time_series_split(date, assess = 18, cumulative = TRUE)

# Visualise the splits:
splits %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(date, value)

Figure 10. Time series splits with an 18 week test set.

Note, the visual above has overlapping series, but each series will have its training and testing split. Lets take a look at the transformed values:

Show the code

# Clean up data to reduce outliers, and missing values (less relevant for our case):
train_cleaned <- training(splits) %>%
  group_by(product) %>%
  mutate(value = ts_clean_vec(value)) %>%
  ungroup()

# Visualise the transformed values:
train_cleaned %>%
  group_by(product) %>%
  plot_time_series(
    date, value, 
    .facet_ncol  = 2, 
    .smooth      = FALSE, 
    .interactive = FALSE)+
  theme_minimal()

Now that we have our full data, we can start to build the modeling steps. We start by specifying a recipe for feature engineering:

Show the code

# Specify the recipe:
recipe_spec <- recipe(value ~ ., data = train_cleaned) %>%
  update_role(rowid, new_role = "indicator") %>%
  # Create time series signature features:
  step_timeseries_signature(date) %>%
  # Remove irrelevant features, due to date granularity:
  step_rm(matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(am.pm)")) %>%
  # Normalise some of the numeric ts features:
  step_normalize(date_index.num, date_year) %>%
  # Handling categorical variables:
  step_other(product) %>%
  step_dummy(all_nominal(), one_hot = TRUE)

# Check output of the data:
# recipe_spec %>%
#   prep() %>%
#   juice() %>%
#   glimpse()

Let’s create a few models we can use to forecast our data, we bring in the Prophet family of models to benchmark, even though they have several limitations:

Show the code

# Prophet workflow:
wflw_fit_prophet <- workflow() %>%
  add_model(
    spec = prophet_reg() %>% set_engine("prophet")
  ) %>%
  add_recipe(recipe_spec) %>%
  fit(train_cleaned)

# Boosted Prophet workflow:
wflw_fit_prophet_boost <- workflow() %>%
  add_model(
    spec = prophet_boost(
      seasonality_daily  = FALSE, 
      seasonality_weekly = FALSE, 
      seasonality_yearly = FALSE
    ) %>% 
      set_engine("prophet_xgboost")
  ) %>%
  add_recipe(recipe_spec) %>%
  fit(train_cleaned)

# XGBoost workflow:
wflw_fit_xgboost <- workflow() %>%
  add_model(
    spec = boost_tree(mode = "regression") %>% set_engine("xgboost")
  ) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(train_cleaned)


# Random Forest workflow:
wflw_fit_rf <- workflow() %>%
  add_model(
    spec = rand_forest(mode = "regression") %>% set_engine("ranger")
  ) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(train_cleaned)

# Nnet workflow:
wflw_fit_nnet <- workflow() %>%
  add_model(
    spec = mlp(mode = "regression") %>% set_engine("nnet")
  ) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(train_cleaned)

# Mars workflow:
wflw_fit_mars <- workflow() %>%
  add_model(
    spec = mars(mode = "regression") %>% set_engine("earth")
  ) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(train_cleaned)

With our workflows, we can now combine them into a modeltime table, allowing us to handle them in a grouped manner more easily:

Show the code

# Make a modeltime table from the models:
submodels_1_tbl <- modeltime_table(
  wflw_fit_prophet,
  wflw_fit_xgboost,
  wflw_fit_prophet_boost,
  wflw_fit_rf,
  wflw_fit_nnet,
  wflw_fit_mars
)

# Get the accuracy from the test set:
submodels_1_tbl %>%
  modeltime_accuracy(testing(splits)) %>%
  arrange(rmse)

# A tibble: 6 × 9
  .model_id .model_desc              .type   mae  mape  mase smape  rmse     rsq
      <int> <chr>                    <chr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1         5 NNET                     Test  0.105  2.27 0.318  2.28 0.133 7.27e-1
2         1 PROPHET W/ REGRESSORS    Test  0.110  2.33 0.333  2.34 0.146 6.09e-1
3         3 PROPHET W/ XGBOOST ERRO… Test  0.128  2.76 0.389  2.78 0.146 5.60e-1
4         4 RANGER                   Test  0.131  2.81 0.398  2.85 0.166 5.58e-1
5         2 XGBOOST                  Test  0.144  3.06 0.436  3.13 0.191 4.47e-1
6         6 EARTH                    Test  0.256  5.47 0.774  6.02 0.481 8.01e-4

It seems that the nnet and prophet models come out on top. However, this can be misleading since it is just on a single test split. Let’s see how the models hold up when using time series cross validation, whilst we are at it let’s tune the xgboost and ranger models.

Towards Greater Robustness

The first thing we do is set up the folds for cross validation.This takes random resamples from the time series at different periods of time. We will use 5-fold cross validation:

Show the code

# Set up 5-fold cross validation:

set.seed(123)
resamples_kfold <- train_cleaned %>% vfold_cv(v = 5)

resamples_kfold %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(date, value, .facet_ncol = 2,.interactive=FALSE)+
  theme_minimal()+
  theme(legend.position = "bottom")

The folds will be handy for tuning the xgboost and random forest models, let’s specify the tuning specs for the models:

Show the code

# Set up tuning specs for the xgboost:

model_spec_xgboost_tune <- boost_tree(
  mode            = "regression", 
  mtry            = tune(),
  trees           = tune(),
  min_n           = tune(),
  tree_depth      = tune(),
  learn_rate      = tune(),
  loss_reduction  = tune()
) %>% 
  set_engine("xgboost")

# Created the boosted workflow:
wflw_spec_xgboost_tune <- workflow() %>%
  add_model(model_spec_xgboost_tune) %>%
  add_recipe(recipe_spec %>% 
               update_role(date, new_role = "indicator"))

# Tune the model:
set.seed(123)
tune_results_xgboost <- wflw_spec_xgboost_tune %>%
  tune_grid(
    resamples  = resamples_kfold,
    param_info = parameters(wflw_spec_xgboost_tune) %>%
      update(
        learn_rate = learn_rate(range = c(0.001, 0.400), trans = NULL)
      ),
    grid = 10,
    control = control_grid(verbose = TRUE, allow_par = TRUE)
  )

# Retrieve the results and fit to the training data:

tune_results_xgboost %>%
  show_best(metric="rmse")

wflw_fit_xgboost_tuned <- wflw_spec_xgboost_tune %>%
  finalize_workflow(select_best(tune_results_xgboost, metric="rmse")) %>%
  fit(train_cleaned)

# Set up the tuning specs for the random forest model

model_spec_rf_tune <- rand_forest(
  mode    = "regression",
  mtry    = tune(),
  trees   = tune(),
  min_n   = tune()
) %>% 
  set_engine("ranger")


wflw_spec_rf_tune <- workflow() %>%
  add_model(model_spec_rf_tune) %>%
  add_recipe(recipe_spec %>% 
               update_role(date, new_role = "indicator"))

# Tune the model:

set.seed(123)
tune_results_rf <- wflw_spec_rf_tune %>%
  tune_grid(
    resamples = resamples_kfold,
    grid      = 5,
    control   = control_grid(verbose = TRUE, allow_par = TRUE)
  )

# Retrieve the results and fit to the training data:

tune_results_rf %>% 
  show_best(metric="rmse")

wflw_fit_rf_tuned <- wflw_spec_rf_tune %>%
  finalize_workflow(select_best(tune_results_rf,metric= "rmse")) %>%
  fit(train_cleaned)

Next, we combine the non tuned models with the tuned models, calibrate on the test data and arrange my the rmse metric:

Show the code

# Combine all models into modeltime table:
submodels_2_tbl <- modeltime_table(
  wflw_xgb_tuned,
  wflw_rf_tuned
) %>%
  update_model_description(1, "XGBOOST - Tuned") %>%
  update_model_description(2, "RANGER - Tuned") %>%
  combine_modeltime_tables(submodels_1_tbl)


#  Calibrate on test data:
calibration_tbl <- submodels_2_tbl %>%
  modeltime_calibrate(testing(splits))

# Arrange accuracy descending:
calibration_tbl %>% 
  modeltime_accuracy() %>%
  arrange(rmse)

# A tibble: 8 × 9
  .model_id .model_desc              .type   mae  mape  mase smape  rmse     rsq
      <int> <chr>                    <chr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1         7 NNET                     Test  0.105  2.27 0.318  2.28 0.133 7.27e-1
2         3 PROPHET W/ REGRESSORS    Test  0.110  2.33 0.333  2.34 0.146 6.09e-1
3         5 PROPHET W/ XGBOOST ERRO… Test  0.128  2.76 0.389  2.78 0.146 5.60e-1
4         2 RANGER - Tuned           Test  0.134  2.88 0.405  2.90 0.166 4.88e-1
5         6 RANGER                   Test  0.131  2.81 0.398  2.85 0.166 5.58e-1
6         1 XGBOOST - Tuned          Test  0.140  2.98 0.424  3.04 0.182 4.79e-1
7         4 XGBOOST                  Test  0.144  3.06 0.436  3.13 0.191 4.47e-1
8         8 EARTH                    Test  0.256  5.47 0.774  6.02 0.481 8.01e-4

It looks like the tuning didn’t really improve the random forest model and only marginally improved the xgboost model. Let’s try to get an even more robust measure of the model performance over time. For this we use time series cross validation, which is in essence cross validation iteratively over time, we again use 5-fold CV:

Show the code

# Time series CV:

resamples_tscv <- train_cleaned %>%
  time_series_cv(
    assess      = 18,
    skip        = 18,
    cumulative  = TRUE, 
    slice_limit = 5)

resamples_tscv %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(date, value,.interactive=FALSE)+
  theme_minimal()+
  theme(legend.position = "bottom")

Figure 13. Time series 5-fold cross validation.

We fit the models to the time series cross validation folds and visualise the results:

Show the code

# Fit to the time series cross validation folds:
model_tbl_tuned_resamples <- submodels_2_tbl %>%
  modeltime_fit_resamples(
    resamples = resamples_tscv,
    control   = control_resamples(verbose = FALSE, allow_par = TRUE))

# Extract the accuracies:
model_tbl_tuned_resamples %>%
  modeltime_resample_accuracy(
    metric_set  = metric_set(rmse, rsq),
    summary_fns = list(mean = mean, sd = sd)) %>%
  arrange(rmse_mean) %>% 
  select(.model_desc,rmse_mean,rmse_sd) -> model_accuracies

# Visualise the results:
hchart(
    model_accuracies, 
    "column",
    hcaes(x = .model_desc, y = rmse_mean)) %>% 
  hc_add_series(model_accuracies,"errorbar",hcaes(y=rmse_mean,x=.model_desc,low = rmse_mean-rmse_sd,high=rmse_mean +rmse_sd)) %>%
  hc_tooltip(
    shared = TRUE,
    formatter = JS(
      "function() {
        var point = this.points[0];
        var errorPoint = this.points[1];
        return 'Mean rmse: ' + Highcharts.numberFormat(point.y, 3) + '<br/>'
      }"
    )
  ) %>% 
  hc_yAxis(title=list(text="Mean rmse")) %>% 
  hc_xAxis(title=list(text="")) %>% 
  hc_title(text= "Mean rmse by model")

Figure 14. Time series cross validation model results.

Next, we take some of the best performing models and combine them using an ensemble model, with the median. The models are the tuned xgboost, the xgboost and the prophet with xgboosted errors models. The other models included, should help adjust the predictions from the Prophet model:

Show the code

# Specify which models to keep:
submodels_2_ids_to_keep <- c(1,5,4)

# Specify median as central tendency:
ensemble_fit <- submodels_2_tbl %>%
  filter(.model_id %in% submodels_2_ids_to_keep) %>%
ensemble_average(type = "median")

# Put into modeltime table:
model_ensemble_tbl <- modeltime_table(ensemble_fit)

Let’s take our prepared data form earlier and fit the ensemble models to it:

Show the code

# Perform some cleaning on the value column:
data_prepared_tbl_cleaned <- data_prepared_tbl %>%
  group_by(product) %>%
  mutate(value = ts_clean_vec(value)) %>%
  ungroup()

# Refit on cleaned table:
model_ensemble_refit_tbl <- model_ensemble_tbl %>%
  modeltime_refit(data_prepared_tbl_cleaned)

Finally, now that we have fit the model to the data, we can calibrate on the testing data and forecast on the future_tbl data frame and use conformal intervals with the split method:

Show the code

final_forecast <- model_ensemble_refit_tbl %>%
  # Calibrate on test:
  modeltime_calibrate(testing(splits),id="product") %>% 
  # Forecast into the future:
  modeltime_forecast(
    new_data    = future_tbl,
    actual_data = data_prepared_tbl_cleaned,
    conf_method   = "conformal_split", # Split Conformal Method
    conf_by_id    = TRUE, 
    keep_data     = TRUE) %>%
  # Convert back to original scale:
  mutate(.value = expm1(.value),
         value = expm1(value),
         .conf_lo = expm1(.conf_lo),
         .conf_hi = expm1(.conf_hi))

Finally, now that we have our final predictions at an 18 week horizon, we can visualise the results:

Show the code

# Set the unique products:
products <- unique(final_forecast$product)

# Create a highcharter time series:
highchart(type = "stock") %>% {
  hc <- .
  for(p in products) {
    product_data <- final_forecast %>% filter(product == p)
    color <- switch(p,
                    "Beans (Haricot)" = "#3366CC",
                    "Maize Grain (White)" = "#109618", 
                    "Rice (Milled)" = "#990099",
                    "Wheat Flour" = "#FF9900",
                    "Wheat Grain" = "#3B3B3B")
    
    hc <- hc %>%
      hc_add_series(data = product_data %>% 
                      mutate(low = .conf_lo,
                             high = .conf_hi),
                    type = "line",
                    name = p,
                    color = color,
                    id = paste0("line-", p),
                    visible = ifelse(p == "Wheat Grain", TRUE, FALSE),
                    tooltip = list(
                      pointFormat = paste0(
                        "<b>{series.name}</b>: {point.y:.2f}<br/>",
                        "Lower Confidence Interval: {point.low:.2f}<br/>",
                        "Upper Confidence Interval: {point.high:.2f}"
                      )
                    ),
                    hcaes(x = .index, y = .value)) %>%
      
      hc_add_series(data = product_data,
                    type = "arearange",
                    linkedTo = paste0("line-", p),
                    color = hex_to_rgba(color, 0.2),
                    visible = ifelse(p == "Wheat Grain", TRUE, FALSE),
                    showInLegend = FALSE,
                    fillOpacity = 0.3,
                    lineWidth = 0,
                    enableMouseTracking = FALSE,
                    hcaes(x = .index, low = .conf_lo, high = .conf_hi))
  }
  hc
} %>%
  hc_legend(
    enabled = TRUE, 
    layout = "vertical",
    align = "right",
    verticalAlign = "top",
    floating = TRUE,
    x = -500,
    y = 20,
    backgroundColor = "rgba(255, 255, 255, 0.9)",
    borderColor = "#999999",
    borderWidth = 1,
    title = list(text = "Commodity",
                 style = list(fontWeight = "bold")))%>%
  hc_title(text = "Predicted Food Prices in Addis Ababa, Merkato",
           align = "center") %>%
  hc_yAxis(title = list(text = "Price (ETB)/Kg"),
           opposite = FALSE) %>%
  hc_xAxis(title = list(text = ""),        type = "datetime",
           dateTimeLabelFormats = list(
             month = "%b %Y",
             year = "%Y"
           )) %>%
  hc_navigator(enabled = TRUE) %>%
  hc_scrollbar(enabled = FALSE) %>%
  hc_rangeSelector(
    buttons = list(
      list(type = 'month', count = 1, text = '1m'),
      list(type = 'month', count = 3, text = '3m'),
      list(type = 'month', count = 6, text = '6m'),
      list(type = 'ytd', text = 'YTD'),
      list(type = 'year', count = 1, text = '1y'),
      list(type = 'all', text = 'All')
    ),
    selected = 5,
    inputEnabled = TRUE,
    inputPosition = list(align = 'right')
  ) %>%
  hc_plotOptions(
    series = list(
      marker = list(
        enabled = FALSE
      )
    )
)

Figure 15. Forecasted prices Addis Abbaba, Merkato.

In addition, we can also visualise the variables which are the the most important (gain) and the variables which are used most often (frequency):

Show the code

wflw_xgb_tuned %>%
    extract_fit_parsnip() %>%
    pluck("fit") %>%
    
    xgboost::xgb.importance(model = .) %>%
    
    # Format Data
    as_tibble() %>%
    mutate(Feature = as_factor(Feature)%>% fct_rev()) %>%
    mutate(across(where(is.numeric), ~ round(.x,3))) %>% 
    slice_head(n = 10) %>% arrange(desc(Gain)) %>% 
    e_charts(Feature) %>%
    e_bar(Gain) %>%  
    e_bar(Frequency) %>% 
    e_theme("westeros") %>%   
    e_tooltip(trigger = "axis")

Figure 16. Variable gain and frequency.

It looks like the 8 & 4 week rolling average of the 52 week lag are the most important variables.

Conclusion and Future Project Directions

Conclusion

This project applies the data science method to time series forecasting in the humanitarian space, namely food commodity prices, in Ethiopia. The data was retrieved using the api from the hdrx package, which retrieves from the Humanitarian Data Exchange Portal. Once retrieved, the data was explored, uncovering data quality issues, time series patterns and geographical patterns. Furthermore, we used typical time series techniques to get more of a feel for the temporal elements of the data, seasonal decomposition, ACF & PACF curves. Next, we honed in on the Addis Abbaba, Merkato market and modeled each of the commodities in that time series. A few general models were built, and their performance was tested. Moreover, we implemented time series cross validation and tuned one xgboost and random forest model. We got a more robust sense of all model performance using the time series cross validation. Finally, we built and ensemble model with the best performing models to be used for predicting each commodity with an 18 week horizon. Feature gain and frequency were also mentioned towards the end. All in all, this project is a good example of how to use data science forecasting methods in a humanitarian context.

Future Project Directions

Whilst this project greatly improves on forecast accuracy, there is still room for improvement. One potential avenue for exploration would be to play around with different feature engineering steps, this might include adding other time series features which we did not consider, or even adjusting the number of lags to be computed. Similarly, we did not consider any exogenous variables, which would undoubtedly affect the price, these may be droughts, wars, or any other variables which may be deemed to have an effect on price. However, this would require a significant data collection effort, nevertheless, it should further increase time series accuracy. Another additional improvement to this project, would be to increase the forecasting model to more region and market combinations, perhaps even internationally to include markets in other countries. Given the data quality issues in this data set, there would likely be even more data quality issues if the scale was to be increased significantly, and should therefore not be taken lightly. Finally, we may wish to try different models, as the ones used in this project are by no means an extensive list. Different model architectures may lead to further performance increases.

Session Information

Show the code

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] nnet_7.3-19              prophet_1.0              rlang_1.1.4             
 [4] Rcpp_1.0.13-1            ranger_0.17.0            xgboost_1.7.8.1         
 [7] earth_5.3.4              plotmo_3.6.4             plotrix_3.8-4           
[10] Formula_1.2-5            cowplot_1.1.3            monochromeR_0.2.0       
[13] paletteer_1.6.0          reactablefmtr_2.0.0      reactable_0.4.4         
[16] echarts4r_0.4.5          timetk_2.9.0             htmlwidgets_1.6.4       
[19] htmltools_0.5.8.1        doFuture_1.0.1           foreach_1.5.2           
[22] future_1.34.0            crosstalk_1.2.1          plotly_4.10.4           
[25] forecast_8.23.0          rgeoboundaries_1.3.1     modeltime.ensemble_1.0.4
[28] modeltime.resample_0.2.3 modeltime_1.3.1          yardstick_1.3.1         
[31] workflowsets_1.1.0       workflows_1.1.4          tune_1.2.1              
[34] rsample_1.2.1            recipes_1.1.0            parsnip_1.2.1           
[37] modeldata_1.4.0          infer_1.0.7              dials_1.3.0             
[40] scales_1.3.0             broom_1.0.6              tidymodels_1.2.0        
[43] giscoR_0.6.0             sf_1.0-19                leaflet_2.2.2           
[46] imputeTS_3.3             RColorBrewer_1.1-3       highcharter_0.9.4       
[49] lubridate_1.9.3          forcats_1.0.0            stringr_1.5.1           
[52] dplyr_1.1.4              purrr_1.0.2              readr_2.1.5             
[55] tidyr_1.3.1              tibble_3.2.1             ggplot2_3.5.1           
[58] tidyverse_2.0.0          rhdx_0.1.0.9000         

loaded via a namespace (and not attached):
  [1] ggtext_0.1.2            matrixStats_1.5.0       DiceDesign_1.10        
  [4] httr_1.4.7              tools_4.4.1             backports_1.5.0        
  [7] utf8_1.2.4              R6_2.5.1                lazyeval_0.2.2         
 [10] withr_3.0.2             gridExtra_2.3           progressr_0.14.0       
 [13] cli_3.6.3               pacman_0.5.1            labeling_0.4.3         
 [16] tseries_0.10-58         sass_0.4.9              proxy_0.4-27           
 [19] QuickJSR_1.5.1          StanHeaders_2.32.10     parallelly_1.38.0      
 [22] readxl_1.4.3            TTR_0.24.4              rstudioapi_0.16.0      
 [25] httpcode_0.3.0          generics_0.1.3          shape_1.4.6.1          
 [28] vroom_1.6.5             inline_0.3.21           loo_2.8.0              
 [31] Matrix_1.7-0            abind_1.4-8             lifecycle_1.0.4        
 [34] yaml_2.3.10             snakecase_0.11.1        grid_4.4.1             
 [37] promises_1.3.0          slider_0.3.1            crayon_1.5.3           
 [40] lattice_0.22-6          pillar_1.10.0           knitr_1.49             
 [43] future.apply_1.11.2     codetools_0.2-20        glue_1.8.0             
 [46] urca_1.3-4              V8_4.4.2                stinepack_1.5          
 [49] data.table_1.15.4       vctrs_0.6.5             urltools_1.7.3         
 [52] cellranger_1.1.0        gtable_0.3.5            assertthat_0.2.1       
 [55] rematch2_2.1.2          reactR_0.6.1            cachem_1.1.0           
 [58] gower_1.0.1             xfun_0.50               mime_0.12              
 [61] prodlim_2024.06.25      dataui_0.0.1            survival_3.7-0         
 [64] timeDate_4032.109       iterators_1.0.14        warp_0.2.1             
 [67] hardhat_1.4.0           units_0.8-5             lava_1.8.0             
 [70] ipred_0.9-15            nlme_3.1-164            xts_0.14.0             
 [73] bit64_4.5.2             rstan_2.32.6            KernSmooth_2.23-24     
 [76] rpart_4.1.23            colorspace_2.1-0        DBI_1.2.3              
 [79] tidyselect_1.2.1        bit_4.5.0.1             compiler_4.4.1         
 [82] curl_6.0.1              glmnet_4.1-8            xml2_1.3.6             
 [85] triebeard_0.4.1         classInt_0.4-10         lmtest_0.9-40          
 [88] fracdiff_1.5-3          quadprog_1.5-8          rappdirs_0.3.3         
 [91] digest_0.6.37           rmarkdown_2.29          pkgconfig_2.0.3        
 [94] base64enc_0.1-3         lhs_1.2.0               fastmap_1.2.0          
 [97] quantmod_0.4.26         shiny_1.8.1.1           farver_2.1.2           
[100] jquerylib_0.1.4         zoo_1.8-12              jsonlite_1.8.9         
[103] rlist_0.4.6.2           magrittr_2.0.3          munsell_0.5.1          
[106] GPfit_1.0-8             furrr_0.3.1             stringi_1.8.4          
[109] MASS_7.3-60.2           pkgbuild_1.4.4          parallel_4.4.1         
[112] listenv_0.9.1           stars_0.6-7             splines_4.4.1          
[115] gridtext_0.1.5          hms_1.1.3               igraph_2.0.3           
[118] stats4_4.4.1            crul_1.5.0              evaluate_1.0.3         
[121] leaflet.providers_2.0.0 RcppParallel_5.1.9      hoardr_0.5.4           
[124] tzdb_0.4.0              httpuv_1.6.15           janitor_2.2.0          
[127] xtable_1.8-4            countrycode_1.6.0       e1071_1.7-16           
[130] later_1.3.2             viridisLite_0.4.2       class_7.3-22           
[133] memoise_2.0.1           timechange_0.3.0        globals_0.16.3