Using machine learning to study heterogeneity in the adoption of clean technologies in neighbourhoods

  • Author:
  • Year: 2024

Keywords: N.A.

The adoption of clean energy technologies in households is crucial in combating climate change. This energy
transition implies large investments on the part of homeowners. One of the ways to stimulate and support
homeowners in the transition are community-led initiatives to collectively purchases clean home technologies.
Neighbourhood dynamics play an important role in the adoption patterns of households and neighbourhoods,
driven by factors like peer influence and socio-economic disparities. Understanding these dynamics is essential for effective policy implementation and targeted interventions. This thesis explores econometric and
machine learning techniques to find out why some neighbourhoods are more successful in activating home
owners to adopt than others, and how this is related to heterogeneity (i.e. differences) between residents and
neighbourhoods. Therefore the research question of this thesis is: Which econometric or machine learning
technique is the preferred method to gain insight into the heterogeneity within and between various demographic groups in the adoption of clean energy technologies in neighbourhoods? With this research question
also 4 sub-questions are formulated: How can econometric and machine learning techniques be utilized to
identify and explain the heterogeneity among various demographic groups in the adoption of clean energy
technologies in neighbourhoods?, What are the comparative strengths and weaknesses of econometric and
machine learning techniques in explaining heterogeneity among various demographic groups in the adoption
of clean energy technologies in neighbourhoods?, and How do the predictive accuracy of econometric and
machine learning techniques compare in the context of clean energy technology adoption, as assessed by
evaluation metrics?
To answer the research question and sub-questions, first, a literature review was conducted. This review
resulted in a conceptual model and a list of relevant variables that do influence energy use and the adoptions
of clean technologies. These variables contain dwelling characteristics, household characteristics and neighbourhood characteristics that may cause heterogeneity in adoption probability. They are expected to have
an effect on the decision to apply an energy efficient measure. How large the effect is is studied empirically.
In order to build and test empirical models, data is required. For this thesis data from the foundation
Buurkracht was used in combination with data on postal code level from Statistics Netherlands. Buurkracht
is a foundation that supports collective purchases in various Dutch communities. The data from Buurkracht
contains information about 74 thousand households across 82 communities. The data from Statistics Netherlands includes information on dwelling characteristics, gas use and electricity use.
Three techniques for explaining heterogeneity in adoption probability were compared: logistic regression,
random forest and causal forest. The logistic regression is user-friendly, does not require much computational power and gives information on the direction and importance of the variables, however the method
is prone to overfitting, a lot of assumptions should be met regarding the data and complex relationships
are hard to capture. The random forest is more robust and less prone to overfitting and also provides an
indication of the importance and direction of the effects of variables. The results are more difficult to interpret and more computational power is required. The logic from the model cannot be extracted, it is a
black box model. The causal forest is especially focused on the treatment effect, which is defined as the
presence of an initiator of the neighbourhood approach within 200 meters. This makes direct comparisons
with other models challenging, since those models do not incorporate a similar treatment definition. The
inclusion of the treatment enables us to investigate the heterogeneity around this one specific variable. The
causal forest is more complex and requires more computational power. A Poisson regression was also used, a
Poisson regression is specifically used for count variables and in this thesis the technique was used on a different level, community instead of household, hence this technique was not compared to the other techniques.
Empirical application of the models, unfortunately, produces inconclusive results: the models do not succeed
to generate consistent results in terms of the impact of variables on adoption and fail to predict adoption
probabilities well. The results from the logistic regression generally aligned the best with the intuitive expectations. The machine learning models, random and causal forest, succeed to predict non-adoption well,
but fail to predict the adoption. Digging into possible reasons of the poor model performance, the problem
of an unbalanced dataset is identified. The data contains many household that do not adopt a technology
and only very few households that do adopt a clean energy technology. A combination of an over and under
sampling method is used to tackle the problem, however this does not lead to an improved performance. At
the same time, the parametric logistic and Poisson regressions suggest that there exists a large unobserved
community-specific variance that complicates prediction.
The predictive performance of the models has been compared using 5 evaluation metrics: accuracy, precision,
recall, specificity and F1 score. The F1 score is the harmonic mean of precision and recall and useful in
the case of imbalanced data. The predictive performance is tested based on both the original dataset and
a dataset created with 20% households applying a measure and 80% not applying a measure. The independent variables that are included in the models are: gas use, electricity use, distance to initiator, living area,
dwelling value and construction year. Recall is seen as the most important metric, since this is based on the
correctly predicted positive outcomes. The causal forest performs the best on the recall metric, but for all
models the recall score is very low. On the newly constructed data set (20% measure) the random forest
performs the best on all evaluation metrics.
Unfortunately, this research did not identify a method that is preferred to gain insight into the heterogeneity
in the adoption of clean energy technologies in neighbourhoods. The causal forest technique shows potential,
but also its performance is far from desired. An important limitation is the data quality. Caveats of the data
that were identified during the research are: class imbalance in the data regarding measure adoption, lack
of household-specific data, a large unobserved community-specific variance, inability to measure community
treatment effect since no data is available on households that were not involved in a community campaign
and finally data on pre-existing measures is missing. Another notable missing element in current data and
models is people’s norms and values, which plays an important role in shaping their choices. Incorporating
such factors into the models could improve their predictive accuracy and understanding of the decision to
apply an energy efficient measure. It is recommended to replicate the study with better data and also other
methods should be explored, since the field of machine learning and econometrics is much broader than the
part that has been dealt with in this thesis.

Kolen_0815377_ABP_Arentze_MSc_thesis