Predicting Median Property Value - Los Angeles County

Author

Nils Indreiten

Executive Summary

Accurately predicting property value has long been an area of interest to many parties, from real estate companies to government bodies. The scope of this project is property value in the LA county. First the data is retrieved using the Tidycensus api and then it is briefly explored. Second, the log transformation of the property valuation is modeled, using both a linear regression and geographically weighted regression models. Finally, the findings are concluded and future research directions are proposed, with an emphasis on spatial ML techniques. Most importantly, this project sheds light on some of the challenges that arise due to spatial data and how to address them.

Data Collection and Exploration

Data Exploration

The data used for this project is from the tidycensus package. Tidycensus enables users to interact with US Census Bureau data via an API. Once you have requested an API, you can retrieve the data by passing your key to the key argument in get_acs().

Show the code

# set county variables
la_counties <- c("Los Angeles")

# set list of census variables
# variable list
variables <- c(
  median_value = "B25077_001",
  median_rooms = "B25018_001",
  median_income = "DP03_0062",
  total_population = "B01003_001",
  median_age = "B01002_001",
  pct_college = "DP02_0068P",
  pct_foreign_born = "DP02_0094P",
  pct_white = "DP05_0077P",
  pct_black = "DP05_0078P",
  pct_hispanic = "DP05_0070P",
  pct_asian = "DP05_0080P",
  median_year_built = "B25037_001",
  percent_ooh = "DP04_0046P"
)

# get acs data and transform to NYC EPSG
la_census_data <- get_acs(
  geography = "tract",
  variables = variables,
  state = "CA",
  county = la_counties,
  geometry = TRUE,
  output = "wide",
  year = 2020,
  key = "you_key") %>% # Pass your API key
  st_transform(2263)

Our data is now assigned to la_census_data, lets take a quick look at the data:

Show the code

glimpse(la_census_data)

Rows: 2,498
Columns: 29
$ GEOID              <chr> "06037101110", "06037101122", "06037101220", "06037…
$ NAME               <chr> "Census Tract 1011.10, Los Angeles County, Californ…
$ median_valueE      <dbl> 502200, 658800, 550800, 461600, 378400, 643200, 554…
$ median_valueM      <dbl> 51606, 21673, 39781, 75638, 93996, 24681, 55738, 50…
$ median_roomsE      <dbl> 4.6, 5.7, 4.1, 3.3, 3.7, 5.3, 5.2, 5.0, 5.5, 4.6, 5…
$ median_roomsM      <dbl> 0.3, 0.5, 0.3, 0.6, 0.3, 0.3, 0.5, 0.3, 0.4, 0.4, 0…
$ total_populationE  <dbl> 3923, 4119, 3775, 3787, 2717, 3741, 3246, 1920, 418…
$ total_populationM  <dbl> 460, 858, 474, 651, 442, 478, 534, 495, 612, 336, 6…
$ median_ageE        <dbl> 44.3, 52.2, 42.2, 40.5, 39.3, 52.6, 46.9, 37.7, 45.…
$ median_ageM        <dbl> 2.6, 2.6, 5.0, 6.1, 3.0, 3.9, 5.7, 2.8, 7.3, 4.7, 4…
$ median_year_builtE <dbl> 1958, 1964, 1959, 1975, 1976, 1958, 1962, 1958, 197…
$ median_year_builtM <dbl> 3, 3, 5, 5, 9, 2, 5, 3, 2, 6, 5, 2, 3, 4, 3, 3, 13,…
$ median_incomeE     <dbl> 74625, 93125, 55682, 46274, 30016, 87066, 66210, 59…
$ median_incomeM     <dbl> 27799, 15209, 9264, 13226, 9714, 27204, 28061, 1590…
$ pct_collegeE       <dbl> 27.8, 34.1, 26.2, 24.0, 17.1, 33.2, 30.9, 35.7, 41.…
$ pct_collegeM       <dbl> 5.2, 9.8, 7.2, 9.8, 7.4, 7.1, 6.4, 8.1, 8.5, 7.1, 8…
$ pct_foreign_bornE  <dbl> 34.8, 34.6, 44.4, 50.9, 55.7, 45.0, 34.6, 42.8, 37.…
$ pct_foreign_bornM  <dbl> 6.4, 9.9, 8.7, 9.2, 9.6, 6.4, 7.5, 7.9, 7.6, 7.8, 5…
$ pct_whiteE         <dbl> 61.0, 76.7, 43.9, 53.5, 54.0, 80.4, 69.5, 75.0, 68.…
$ pct_whiteM         <dbl> 7.7, 8.1, 7.4, 10.4, 13.3, 6.9, 8.4, 9.7, 8.6, 5.6,…
$ pct_blackE         <dbl> 0.4, 3.6, 2.5, 5.6, 0.1, 4.6, 0.4, 1.1, 2.0, 0.6, 1…
$ pct_blackM         <dbl> 0.5, 3.5, 2.8, 4.5, 0.5, 5.5, 0.6, 1.7, 1.7, 0.9, 1…
$ pct_hispanicE      <dbl> 3923, 4119, 3775, 3787, 2717, 3741, 3246, 1920, 418…
$ pct_hispanicM      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pct_asianE         <dbl> 10.3, 10.3, 10.3, 7.1, 2.4, 3.7, 9.1, 12.6, 11.5, 4…
$ pct_asianM         <dbl> 3.9, 4.3, 6.0, 4.2, 5.1, 2.7, 4.7, 7.5, 5.1, 2.8, 5…
$ percent_oohE       <dbl> 58.3, 74.9, 42.7, 20.4, 11.4, 86.8, 69.8, 57.9, 78.…
$ percent_oohM       <dbl> 7.6, 10.4, 7.9, 9.2, 8.3, 5.6, 9.5, 14.6, 7.5, 11.9…
$ geometry           <MULTIPOLYGON [US_survey_foot]> MULTIPOLYGON (((-11921121…

We have around 29 variables in the dataset, some of which are spatial variables. Since we will be predicting the median house value let’s map it and focusing on it’s spatial distribution:

Show the code

# create plot
(la_median_value_hist_tm <- la_census_data[!st_is_empty(la_census_data),,drop=F] %>% 
    tm_shape() +
    tm_polygons(col = "median_valueE",
                palette = "plasma",
                title = "2016-2020 ACS",
                legend.hist = TRUE,
                legend.format = scales::dollar_format()) +
    tm_layout(main.title = "LA Median Home Value by Census Tract",
              frame = FALSE,
              legend.outside = TRUE,
              bg.color = "grey100",
              legend.hist.width = 5,
              fontfamily = "Verdana"))

The map shows that the properties that are higher in value are predominantly located in the west of the county, in the Calabasas, Malibu and Topanga regions. The median value of some of the properties in these regions are worth $1.5M or more. Looking at the distribution of values as a whole, the majority of the homes are under $1M, as shown by the dark blue and purple shaded regions. The distribution is right-skewed and along left tail made up of expensive homes. This might cause issues when we model the price, therefore we apply a log transformation to the median home value variable:

Show the code

# create plot log transformation
(la_median_value_log_hist_tm <- la_census_data[!st_is_empty(la_census_data),,drop=F] %>%
    mutate(log_med_value = log(median_valueE)) %>% 
    tm_shape() +
    tm_polygons(col = "log_med_value",
                palette = "cividis",
                title = "2016-2020 ACS (log)",
                legend.hist = TRUE) +
    tm_layout(main.title = "LA Median Home Value by Census Tract",
              frame = FALSE,
              legend.outside = TRUE,
              bg.color = "grey100",
              legend.hist.width = 5,
              fontfamily = "Verdana"))

As the figure above demonstrates, the skewness of the distribution is minimised, however there are still some values that are skewing the distribution, we can consider these outliers.

Data Preparation

Prior to modeling we will have to clean our dataset. For this, NA values are removed, along with the margin of error columns, that are not necessary. The population density variable is calculated and the years since the property has been built:

Show the code

la_census_prepped <- la_census_data %>% 
  mutate(pop_density = as.numeric(set_units(total_populationE / st_area(.),
                                            "1/km2")),
         median_building_age = 2020 - median_year_builtE) %>% 
  select(-ends_with("M")) %>% 
  rename_with(.fn = ~str_remove(.x,"E$")) %>% 
  drop_na()

Now our dataset is ready for modeling.

Modeling

Linear Regression

The first model that will be fitted, is a simple linear regression with the log transformed median home value as the dependent variable:

Show the code

model_1 <- lm(formula = log(median_value)~median_rooms+median_income+
pct_college+pct_foreign_born+pct_white+pct_black+pct_hispanic+
pct_asian+percent_ooh+median_age+median_building_age+pop_density,data=la_census_prepped)

model_1 %>% 
  summary()


Call:
lm(formula = log(median_value) ~ median_rooms + median_income + 
    pct_college + pct_foreign_born + pct_white + pct_black + 
    pct_hispanic + pct_asian + percent_ooh + median_age + median_building_age + 
    pop_density, data = la_census_prepped)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.05134 -0.11764  0.01222  0.13243  0.83967 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.225e+01  5.789e-02 211.644  < 2e-16 ***
median_rooms         3.456e-03  1.007e-02   0.343   0.7315    
median_income        4.878e-06  3.356e-07  14.537  < 2e-16 ***
pct_college          1.319e-02  6.517e-04  20.234  < 2e-16 ***
pct_foreign_born     5.158e-03  6.376e-04   8.089 9.59e-16 ***
pct_white           -1.578e-03  4.900e-04  -3.221   0.0013 ** 
pct_black           -1.197e-03  5.481e-04  -2.184   0.0291 *  
pct_hispanic        -3.602e-06  3.895e-06  -0.925   0.3552    
pct_asian           -3.875e-03  5.231e-04  -7.407 1.80e-13 ***
percent_ooh         -5.023e-03  4.700e-04 -10.686  < 2e-16 ***
median_age           1.187e-02  1.234e-03   9.622  < 2e-16 ***
median_building_age  8.043e-05  1.096e-05   7.336 3.03e-13 ***
pop_density         -8.681e-06  2.021e-06  -4.294 1.82e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2425 on 2313 degrees of freedom
Multiple R-squared:  0.7043,    Adjusted R-squared:  0.7028 
F-statistic: 459.1 on 12 and 2313 DF,  p-value: < 2.2e-16

The variables that do not have a statistically significant effect are median_rooms and pct_hispanic. Let’s filter for only the statistically significant variables and explore the scale of their effect:

Show the code

model_1 %>% 
  tidy() %>%
  filter(term != "(Intercept)") %>% 
  arrange(estimate) %>% 
  mutate(flag=ifelse(estimate > 0, "Positive","Negative")) %>% 
  filter(p.value <= 0.05) %>%
  hchart("column", hcaes(x = term, y = estimate, group = flag),pointWidth=30) %>% hc_colors(c("indianred","darkblue")) %>% hc_title(text = "Effect of Regression Rerms")

The pct_ooh (% of housing in the tract owned by the occuppier) variable is the one with the highest negative effect, with a coefficient of -0.005. In contrast, the pct_college (% of population with age 25+ with 4 year degree) variable has the highest coefficient size with a value of 0.013. pop_density, median_income and median_building_age have marginal effects and are therefore not very visible. If you select the positive or negative tool tips you might be able to see them. For the terms with a positive estimate, an increase in their value leads to an increase in the dependent variable i.e. the log transformation of the median home value. On the other hand the opposite applies for the terms with a negative estimate. Because linear regression is a parametric model, the assumptions have to be checked. One of these assumptions is that the errors are independent. However the spatial relationships in the dataset will likely mean this is violated.In other words, because of spatial autocorrelations in the dataset, the model assumptions will be violated. This implies that the performance of the model is directly dependent on the location.

It would be helpful to understand the extent to which the values are correlated:

Show the code

la_estimates <- la_census_prepped %>%
  select(-GEOID, -median_value, -median_year_built, -total_population) %>%
  st_drop_geometry()

# get correlations
correlations <- correlate(la_estimates, method = "pearson")

# plot correlcations
network_plot(correlations,
             min_cor = .4)

As we had expected, the variables are very highly correlated to each other. In fact, a lot of them are strongly positively correlated to each other. We can see that pct_white (% of the population identifying as non-Hispanic white alone), and median_income are strongly positively correlated. In contrast, we see some negative correlations between pct_foreign_born and median_income, with pop_density and percent_ooh exhibiting a strong negative correlations. The extent to which the correlations will be problematic can be checked using Variance Inflation Factors, where a value lower than 5 suggests problematic correlations:

Show the code

regclass::VIF(model_1) %>% 
  tidy() %>% 
  mutate(flag = ifelse(x<= 5, "Blue", "Red")) %>% 
  ggplot(aes(reorder(names,x),x,color=flag))+
  geom_point(shape=3,size=3)+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45,vjust=1,hjust = 1),
        legend.position = "none")+
  xlab(label = "")+
  ylab(label = "VIF value")+
  scale_y_continuous(breaks = c(2,4,6,8,10), limits = c(0,11),)+
  scale_color_manual(values = c("midnightblue","red"))+
  labs(title = "Variance Inflation Factors by Variable")

The figure above shows that median_rooms, median_income, percent_ooh, pct_white and pct_college, have a value above 5. Therefore the assumption of residual dependence is violated and the model performance depends on geographic location. This can be further assessed with the Moran’s I test. In order to perform Moran’s test, we need to create a spatial weights matrix, this is done in the following section.

Spatial Weights Matrix

In order to create the matrix, we first need to generate a list of neighborhoods. The method used is referred to as continuity-based neighbors, this method was selected because the geometry features are polygons. In addition, since we will consider all polygons that share a vertex as neighbors, we will toggle the queen option:

Show the code

# Make sure zero policy is on:
set.ZeroPolicyOption(check = TRUE)

[1] FALSE

Show the code

la_nb <- poly2nb(la_census_prepped, row.names = NULL, snap=sqrt(.Machine$double.eps),
        queen=TRUE, useC=TRUE, foundInBox=NULL)

la_nb

Neighbour list object:
Number of regions: 2326 
Number of nonzero links: 14112 
Percentage nonzero weights: 0.260837 
Average number of links: 6.067068 
2 regions with no links:
572 1939

The neighbors object shows that there are 2,326 census tracts in Los Angeles. The percentage of nonzero weights of i x j unit matrix is 26%. The average number of links between census tracts is 6.06 and there are two regions with no links.

Show the code

la_geo <- st_geometry(la_census_prepped)
la_geo_centroid <- st_centroid(la_geo)

plot(la_geo,
     main = "",
     sub = "Neigborhood matrix - LA",
     reset = FALSE,
     cex.main = 3,
     cex.sub = 2)
plot(la_nb, la_geo_centroid,
     add = TRUE,
     col = 2,
     lwd = 1.5)

The figure above demonstrates the connections between the census tracts, configured using the Queen method. With this we can now create the spatial weights matrix:

Show the code

spdep::nb2listw(la_nb, zero.policy = TRUE) -> wts

wts

Characteristics of weights list object:
Neighbour list object:
Number of regions: 2326 
Number of nonzero links: 14112 
Percentage nonzero weights: 0.260837 
Average number of links: 6.067068 
2 regions with no links:
572 1939

Weights style: W 
Weights constants summary:
     n      nn   S0       S1       S2
W 2324 5400976 2324 810.4336 9549.577

Now that we have the weights list object, the Moran I test can be performed:

Show the code

la_census_prepped$residuals <- residuals(model_1)

spdep::moran.test(la_census_prepped$residuals,wts)


    Moran I test under randomisation

data:  la_census_prepped$residuals  
weights: wts  n reduced by no-neighbour observations
  

Moran I statistic standard deviate = 25.842, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.3152516973     -0.0004304778      0.0001492325

The standardised Moran statistic value of 25.842 is exceptionally large, suggesting that there is strong evidence against the null hypothesis of spatial randomness. The Moran I statistic value is greater than the expectation, this suggests that neighboring areas or points are more similar to each other than expected due to spatial randomness. Given these outputs, we can conclude that there is significant positive autocorrelation in the data. The observed values in the dataset are more similar to their neighbor than one would expect by chance alone. We can see this by exploring the residuals:

Show the code

la_census_prepped$lagged_residuals <- lag.listw(wts, la_census_prepped$residuals)


la_census_prepped %>% 
ggplot(aes(x = residuals, y = lagged_residuals)) + 
  theme_minimal() + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", color = "red") +
  labs(title="Lagged Residuals vs. Residuals")->p1

plotly::ggplotly(p1)

The plot above demonstrates that the assumption of independence in error term is violated, suggesting that there is spatial autocorrelation in the census tracts. We can try to model this using regression models in the spatial regression family.

Geographically Weighted Regression

Geographically weighted regression is a spatial analysis technique that allows for the relationship between variables to vary across space. Whilst a traditional regression models estimate a global model across all observations, the geographic model takes into account the spatial location when estimating a local model. By creating location dependent model parameters, the geographically weighted regression learns a distinct model at each data point. In doing so it seeks to resolve spatial heterogeneity. However, one downfall of this is that geographically weighted regression models can be more affected by local specific multi-collinearity.

Key to generating models locally, is the concept of a kernel bandwidth. With the help of a kernel and a distance-decay function, local models are developed for each census tract. Let’s model our data using a geographically weighted regression:

Show the code

# Convert the data to a spatial object:

la_census_data_prep_sp <- la_census_prepped %>% 
  as_Spatial() 

# Define formula:

formula <- "log(median_value) ~ median_rooms + median_income + 
pct_college + pct_foreign_born + pct_white + pct_black + pct_hispanic + 
pct_asian + median_age + percent_ooh + median_building_age + pop_density"

Show the code

# Select bandwidth:
bw <- bw.gwr(
  formula = formula,
  data = la_census_data_prep_sp,
  kernel = "bisquare",
  adaptive = TRUE
)

bw

500 was the selected number of nearest neighbors based on CV. In other words for each Census Tract, the nearest 500 of the total census tracts in the LA region will be used to estimate the local model, with weights calculated using the bi-square distance-decay function. Let’s run the model with the optimal bandwidth:

Show the code

gw_mod <- gwr.basic(
  formula = formula,
  data = la_census_data_prep_sp,
  bw = bw,
  kernel = "bisquare",
  adaptive = TRUE
)

gw_mod$GW.diagnostic$gw.R2

[1] 0.841199

The R2 value of the geographically weighted regression is 0.84, up from the global regression which had an R2 value of 0.70. Let’s visualise the output:

Show the code

# Add gw model to sp object

la_census_prepped$gwmodel <- gw_mod$SDF$Local_R2 

tm_shape( la_census_prepped %>% 
  as_Spatial())+
  tm_fill("gwmodel",palette = "-viridis",title = "Local R^2")+
  tm_scale_bar(position = c(0.01, 0),
               just = "left") + 
  tm_compass(position = c("right", "top"))+
  tm_borders("black",alpha =.5)+
  tmap_mode("view")

The plot above shows that the neighborhoods in the northern region of LA have higher values of R^2. Similarly in the southwest also has higher values. In contrast, we can see some areas with lower values as shown by the yellow and green shaded areas. These are mostly concentrated in the Santa Clarita, San Fernando and Burbank areas. Let’s iterate on the previous model using the spgwr implementation:

Show the code

bw2 <- gwr.sel(formula = formula,
               data=la_census_data_prep_sp,
               adapt=TRUE,
               gweight = gwr.bisquare,
               method = "cv",
               verbose =FALSE)

Show the code

# run model               
spgwr_model <- gwr(formula = formula, 
                   data = la_census_data_prep_sp,
                   adapt = bw2,
                   gweight = gwr.bisquare, 
                   hatmatrix = TRUE)

la_census_prepped$spgwr <- spgwr_model$SDF$localR2

# Plot output:

la_prepped_sf <-  sf::st_as_sf(la_census_data_prep_sp) 


la_prepped_sf$spgwr <- spgwr_model$SDF$localR2

tm_shape(la_prepped_sf)+
  tm_fill("spgwr",palette = "-viridis",title = "Local R^2")+
  tm_scale_bar(position = c(0.01, 0),
               just = "left") + 
  tm_compass(position = c("right", "top"))+
  tm_borders("black")+
  tmap_mode("view")

This model performs differently and the higher performing census tracts are not as prevalent as the previous model. In fact, it seems most of the census tracts with 0.9 to 1.0 local R^2 are concentrated in the Rancho Palos Verdes and Redondo Beach areas. Let’s compare the models next to each other:

Show the code

# pivot then plot
la_census_prepped %>% 
  st_as_sf() %>% 
  pivot_longer(cols = c("spgwr", "gwmodel")) %>% 
  ggplot(aes(fill = value)) +
    geom_sf(color = NA) +
    scale_fill_viridis_c() +
    ggthemes::theme_map(base_size = 8) +
    facet_wrap(~name) +
    labs(title = "Local R2 of Geographically Weighted Models",
         fill = "R^2") +
    theme(plot.title.position = 'plot',
          plot.title = element_text(hjust = 0.5,
                                    vjust = 3,
                                    size = 20))

The plot above shows that the gwmodel implementation performs better than the spgwr one. At a general level it seems the models are both performing better in the southwestern regions of LA. Otherwise, the gwmodel performs better overall.

Conclusion and Future Research Directions

This project explored spatial data, focusing on property value at the census tract level in LA. Spatial data provides some interesting challenges when modeling, particularly spatial collinearity. After briefly exploring the data, the log transformation of the property value was modeled. First a simple linear model was built, then two variants of geographically weighted regression models were implemented. The GWmodel implementation performed better than the spgwr implementation. The average R^2 value of the GWmodel was 0.81, whilst for the spgwr was 0.71.

Future research efforts may wish to increase the temporal elements of the data, that is, retrieve data from prior years, allowing for each model to be trained, tested and validated. Furthermore, non-parametric spatial models can be used, such as the geographically weighted implementation of random forest.