Learning Real Estate Automated Valuation Models from Heterogeneous Data Sources – Summaries in Data Science & Finance Research

Learning Real Estate Automated Valuation Models from Heterogeneous Data Sources Bergadano, et al. ArXiv 2019

The real estate market is often used as a testbed for new machine learning methods due to the vast diversity of the industry and the richness of the potential feature set. However, real estate valuations remain difficult to predict with consistent accuracy and the industry remains reliant on the services of professional appraisers. The team at the University of Turin, Italy, have taken a step toward automating this process with a high level of accuracy using some creative feature investigation.

In this paper we propose a Web data acquisition methodology, and a Machine Learning model, that can be used to automatically evaluate real estate properties. This method uses data from previous appraisal documents, from the advertised prices of similar properties found via Web crawling, and from open data describing the characteristics of a corresponding geographical area.

The paper uses data from properties in different regions of Italy, but the insights and methods derived have wide applicability to nearly any region on the globe with a few adjustments to the data and feature set.

This data is used to train AI and machine learning models to predict the valuation of properties for use in Automated Valuation Models (AVM). These models are useful to professional appraisers and appraisal companies as they reduce both costs and the time needed for their own valuation process. They also provide quick valuation estimates to stakeholders before a professional appraiser is engaged and the cost for their time is incurred.

Designing an AVM is a tricky process due to the inherent nature of the data used:

When designing an AVM, one must consider the fact that the appraisal of a real estate property is a very difficult task, due to the high heterogeneity in both structural and geographical data. Moreover, the price can be influenced by macroeconomic factors, that obviously change over time. In fact, the creation of a model able to predict real prices needs to deal with several problems caused by the complexity and the dynamics of the real estate market, and to the difficulty of obtaining reliable and objective data.

Current Understanding of AVM Development

The authors note that recent research of AVMs mostly concentrate on features that directly relate to the characteristics of the property. Features such as the number of rooms, floor size, number of bathrooms, etc. The paper refers to these common property features as hedonic features, literally meaning features that derive a sense of pleasure.

The authors build on this prior research with the implementation of machine learning methods using hedonic features as inputs. The authors also focus on including features that have neighbouring geographical information about a certain area around the property. A notion of a Point of Interest (PoI) is defined and mapped, then linked to the target property based on distance and relevance as an intrinsic feature.

The Data Used

For the purpose of this research we have used three different data sources: (1) a corpus of professional and validated property appraisal documents and corresponding data base, (2) geographical and open data obtained from heterogeneous public Web sources, and (3) advertised prices for comparable properties obtained via Web crawling.

The first dataset consists of the 7988 property valuations performed by professional appraisers located in the city of Turin, Italy, performed between 2011 and 2016. The information retrieved for each property is summarized below:

The authors then perform some data cleanup by considering only properties with registered residential use. They exclude those with a surface area greater than 250 square meters as they are quite rare, and they also exclude houses with a garage for the same reason. This last point is an example of one that would differ greatly depending on the region of the world where the model is applied, as garages in North America is the norm.

They then narrow the dataset down further by using only houses with a global valuation between 20,000€ and 700,000€, reducing the dataset to 3983 property valuations.

The second dataset is the geographical data taken from the Italian Real-Estate Observatory (OMI), the governing body that divides the surface of each municipality into different ‘OMI areas’ which have the same real estate characteristics and valuation schemes. The OMI publishes the names of each OMI area, as well as the upper and lower limit of the price ranges in each area. This is used to create an ‘OMI minimum and maximum price per square meter’ metric, which is used as a feature in the upcoming models along with the OMI names.

Nearby points of interest (PoI) are also used as features that correspond to activities and resources that are present in each area. This feature is built by counting the number of POIs in one area, with each PoI given a relevance score that is weighted by the distance from the property. The paper notes that this method follows a best practice in real estate valuation where experts consider the nearest metro, bus stops, schools, entertainment, etc. The authors group the PoIs into 13 categories such as Arts, Business&Services, etc.

Lastly, a feature set of comparable prices is built by using a web crawler where a limited number of property values are captured near the target property. This data is then used to calculate an ‘average price per square meter’ metric of comparable neighbouring properties.

Machine Learning Feature Selection

Three different sets of features are tested in their machine learning algorithms:

Hedonic Feature Set – Consisting of the hedonic features (standard features such as number of rooms, square meters, number of bathrooms, etc.). This includes the initial appraisal dataset, points of interest, the distance from city center, and the OMI name.
OMI-centered features – This approach reduces the above hedonic feature set, including only the OMI minimum and maximum price per square meter metric mentioned above, and the surface area in square meters.
OMI-centered features with comparable prices – Using the same dataset as 2, with the added average price per square meter of comparable properties.

The authors highlight an issue inherent in the hedonic feature set, where it has many irrelevant features that could lead to overfitting. They also note that the OMI minimum and maximum price per square meter outperforms simply using the OMI area name, which does not carry any geographical properties or correlations.

Model Performance Results

The paper now tests the three different feature sets described above using multiple machine learning algorithms.

The learning task is a supervised regression problem, where the target feature to be predicted is the property valuation. We have experimented, depending on the feature selection method, with up to six Machine Learning regression algorithms: K nearest neighbour (KNN), Random Forests, Extremely Randomized Trees (Extra Trees), Gradient Boosting, Adaboost, and Bagging.

The results are evaluated mainly by the below formula:

The formula above is actually Root Mean Squared Error (RMSE), which the paper refers to as Mean Error (ME). We can find Mean Squared Error (MSE) as well, since MSE = ME². ME, MSE, and R² are the metrics used for model evaluation.

The results for each feature set is:

ME is expressed in thousands of Euros. Between ME and R², we can see a steady improvement through feature sets 1 through 3. We see a significant improvement with the OMI-centered features with comparables feature set, with the ME reduced by half of that of the hedonic feature set and with an R² above 95%.

The improvement made by the inclusion of the OMI-centered features and the OMI-centered features with comparables is further illustrated with the next two charts.

A comparison of results for OMI-centered features, with and without comparables, can be seen in different detail in Figures 6 and 7. Each dot in these scatter plots represents a real estate property in the test set, with the projection on the x axis representing its expert-valuated price, and the projection on the y axis representing the error in our AVM prediction. It is immediately evident that the accuracy of the AVM with comparables (Fig. 7) is significantly better, making the corresponding Web Crawling activity worthwhile. This can also be observed on the concentration graph on the right of both figures.

Overall, we find the inclusion of the OMI-centered features with comparables to be helpful in the automatic prediction of property valuation. The OMI statistics can be substituted for statistics from the land zoning governing body of any region of the world, wherever the data is available.

In Sum

The authors sum up the contributions of this paper to three points:

1) We have developed a methodology for acquiring relevant real estate property features from the Web and open data, correlating them to other existing intrinsic features that are normally available in expert valuation documents and appraisal data bases;
2) We have shown that, using such features, it is possible to predict the value of some property with a limited error rate;
3) We have developed an AVM tool that uses this data acquisition of Machine Learning methodology, and may be a valid help for professional appraisers and for appraisal companies that need to validate expert documents or evaluate real estate portfolios.

The next steps noted include the use of larger datasets in different international locations, and the evaluation of the technical and legal aspects of crawling the web to retrieve comparable prices and other data.