Airbnb price prediction using machine learning and sentiment analysis – Summaries in Data Science & Finance Research

Airbnb Price Prediction Using Machine Learning and Sentiment Analysis Kalehbasti, et al. arXiv 2019

The real estate market is no stranger to applied machine learning models trying to accurately predict future prices and trends based on the countless possible features. In this paper, the authors target Airbnb for their price prediction model and include an interesting and uncommon feature in the form of sentiment analysis. As most people are already familiar with how services like Airbnb work, its easy to see how the reviews written by prior tenants may contain the most important information when making your decision on where to rent. This is what the authors took into account in the form of a new feature for a dataset of the New York City market.

This paper aims to develop a reliable price prediction model using machine learning, deep learning, and natural language processing techniques to aid both the property owners and the customers with price evaluation given minimal available information about the property.

Data Cleaning and Feature Selection

The authors utilize a dataset of over 50,000 entries, each with 96 features such as rental size, bathrooms, beds, etc. The initial data cleaning consists of removing features with many missing values, changing boolean features to binaries, and utilizing one-hot encoding to convert the categorical features into binary data. Then a train-test-validation split of the data is produced.

As stated above, the sentiment analysis of the property reviews was a novel feature to include. The performance of this feature depends heavily on the quality of the text predictor, where the paper utilizes the TextBlob sentiment analysis library instead of other python NLP libraries such as NLTK. The premise of the feature is simple enough and does not go into great detail:

This method assigns a score between -1 (very negative sentiment) and 1 (very positive sentiment) to each analyzed text. For every listed property, each review was analyzed using this method and the scores were averaged across all the reviews associated with that listing. The final scores for each listing was included as a new feature in the model.

Prior to feature selection, there were 764 elements in the feature vector available. The paper notes the high variance of error associated with trying to feed all features into a model in addition to long computation times (no surprise there!). The paper walks through three feature selection methods to narrow down the most relevant features:

Manual selection of features – This performs the worst in terms of R².
Tuning the coefficients of a linear regression model with lasso regularization trained on the training data (using L1 regularization to zero insignificant feature coefficients), and the model with the best performance over the validation set was chosen.
Choosing the lowest p-values of a linear regression model trained on the training data.

The authors show that the best R² score was from the second method, lasso regression. This narrows down the features that are the most important in terms of an accuracy-processing time trade-off resulting in 78 remaining features. These features are then fed into several machine learning models.

The Models

The paper utilized five machine learning models and benchmarked each against the performance of a simple Linear Regression:

Ridge Regression – This is similar to the benchmark linear regression, but adds an L₂ regularization term (Forces coefficients to be small but not zero like the above L₁ regularization of lasso regression). This is useful to prevent overfitting and deal with collinearity in the data
K-means clustering with Ridge Regression – This is an interesting approach to capture non-linearity in the data, as noted in the paper. Using k-means clustering, the authors split the training datapoints into distinct clusters and run ridge regression on each cluster individually. The benefits of using clustering in the pre-processing stage are discussed here by Shebhendu, Pardos, and Heffernan (a paper perhaps for a future review)
Support Vector Regression (SVR) – SVR speaks for itself, simply a Support Vector Machine (SVM) used to predict a continuous variable. SVR may be particularly helpful in this example as it can easily transform non-linear data into a higher dimensional feature space using a Radial Basis Function (RBF) kernel in this paper. The values of the regularization parameter C or the optimization method of C are not discussed in detail.
Neural Network – The paper utilizes a fully connected 3 layer neural network: with 20 neurons in the first hidden layer, 5 neurons in the second, and 1 output neuron. The first and second hidden layers use the common ReLU activation function, and the output neuron has a linear activation function.
Gradient Boosting Tree Ensemble – The authors note that since the relationship between the feature vector and price is non-linear, they opted to utilize a regression tree. The gradient boosting method helps increase the models performance by correcting prior predictions with another model based on the negative gradient of the prior loss.

Results

The paper uses Mean Absolute Error (MAE), Mean Squared Error (MSE), and R² to evaluate the performance of each model. For financial data, MAE works well as all differences between amounts are weighted equally (difference between $10 and $0 is exactly twice the difference between $5 and $0). This is different from MSE where it punishes for higher errors due to squaring.

Each models results for both training and test sets:

Despite expanding the number of features in the feature vector, SVR with RBF kernel turned out to be the best performing model with the least MAE and MSE and the highest R2 score on both train and test sets. RBF feature mapping was able to better model the prices of the apartments which have a non-linear relationship with the apartment features. Since regularization is taken into account in the SVR optimization problem, parameter tuning ensured that the model was not overfitting.

The neural network (NN) and the K-means clustering model are noted to have suffered from an insufficient number of training examples, driving down model performance.

The Final Word

The initial experimentation with the baseline linear regression model proved that the abundance of features leads to high variance and weak performance of the model on the validation set compared to the train set. Lasso-based feature importance analysis reduced the variance and using advanced models such as SVR and neural networks resulted in higher R2 score for both the validation and test sets. Among the models tested, Support Vector Regression (SVR) performed the best and produced an R2 score of 69% and a MSE of 0.147 (defined on ln(price)) on the test set.

The authors suggests building upon the work in their paper by including in future studies:

Other feature selection methods
Further experiments with neural networks with larger amounts of training data
Increasing the amount of training examples by leveraging other sites with similar platform models (VRBO)
Investigate the overall effectiveness of using review sentiment as a feature