This blog post is meant to serve as an assignment project for the partial fulfilment of the Udacity Data Scientist Nanodegree. The project involves analysis of Airbnb data using CRISP-DM method, and we focus on the Airbnb data from the city of Amsterdam. The primary goal is to analyse the data and present the findings.
For this analysis, CRISP-DM approach is used, and the following steps are performed as a part of this comprehensive methodology.
- Develop understanding of the business domain
- Develop understanding of the data
- Data modeling
- Evaluation of the results
- Deploying the changes
Note: We assume here that the the most common prices that are reported in the current listings are already designed to maximize the profit. Thus, it should suffice that a model that learns this representation can appropriately predict the price of another listing based on the choice of region in Amsterdam. Further, while we model the data as well as build a prediction model in this study, we do not deploy the changes.
To summarise, we address the following points in this project.
- We analyse the distribution of Airbnb data to understand the distribution of listings and prices in the city of Amsterdam.
- We build regression models to predict prices of the listings based on a set of available features.
Develop understanding of the business domain
Amsterdam is s vibrant city that attracts a lot of tourist from across the world every year. This has significantly helped the business of Airbnb to scale up in a short amount of time. Airbnb is very popular in Amsterdam, especially because it attracts a large number of tourists every year. The vibrant culture of Amsterdam attracts tourists across all diversities, be it age or ethnicity. In this project, we analyse the Airbnb data of this city to understand the distribution of listings as well as prices within the city.
The analysis is aimed towards understanding on how to choose appropriate price for a new listing. It is assumed here that we have multiple properties available across the city, and we would like to use one of these properties to put for listing in Airbnb. It is also important that when these prices are chosen, they are neither too high nor too low. A too low price for a listing means unnecessarily reduced profit, when there is already scope of raising it further. A too high price would clearly lead to less customers choosing the listing for their stay. Further, for customers who choose it, it is possible that they find the prices too high for the resources that they can, and this could adversely lead to poor reviews. Overall, it is important that the prices are optimally chosen.
Develop understanding of the data
The data used in this study was obtained from the Airbnb website here. We only use the information contained in listings_summary.csv and listings_details.csv files. A snapshot of some of the headers from the first file is shown below. Additional headers include last_review, reviews_per_month, calculated_hosts_listings_count and availability_35.
Since our end goal is to identify the region in Amsterdam where our new listing would benefit most, we look at the distribution of the listings across Amsterdam, and this is reflected in the distribution plot below.
Among the 22 neighbourhoods listed on Airbnb, we see that most listings belong to De Baarsjes — Oud west neighbourhood. Following it, there are large number of listings in De Pijp, Centrum region as well as those enclosing it (Westerpark and Zuid). Overall, most listings are clustered around the city center. This is expected since most tourists prefer to stay in the close vicinity of the famous places to be visited.
We further study the distribution of prices for the listings from the 10 neighbourhoods with most listings, and the distribution is shown below.
It is observed that the mean prices as well as the deviations are significantly higher in the regions closer to the city center of Amsterdam when compared to the neighborhoods away from it. These align with the fact that the first three have several popular tourist destinations. Similarly, several nice restaurants are centered around De Pijp which could attract more listings in this region. The Amsterdam Zuid region has several corporate offices which could possibly attract Airbnb listings. Overall, choosing one of these 4 listings could possibly be a nice deal in terms of maximizing the return from the listing. For more concrete analysis, data on daily reservations should also be studied, however, we keep it beyond the scope of this study.
Data modelling and evaluation of the results
Based on the information from previous listings, we next analyse if we can predict the price of any future listing that we would like to add. For this purpose, we fit regression models on the various features available in listing_details.csv. We convert the categorical features into one-hot-encoding, and are left with 957 features to fit our regression model.
We first study the correlation of the various features with the output, and build feature_importance vector. Further, we show below the top 50 most important features for our regression model.
From bottom to top in the plot above, the importance of features reduces in terms of affecting the output. As expected, the type of the listing is most important in deciding its price. And on top of that, prices are most influenced when the listing type is a lighthouse or a private room. Nevertheless, the above 50 features are the most important, and among these, we use the top 25 features for further building the regression model.
Among the 25 features, we would like to select an optimal set of features. But how do we identify this optimal set of features? Instead of guessing, we can systematically test a range of different numbers of selected features and discover which results in the best performing model. We ran the grid search on different numbers of selected features using mutual information statistics, where each modelling pipeline is evaluated using repeated cross-validation. We visualised how the overall error reduces when one additional feature is included. It was observed that while the error plot becomes almost flat after around 8th feature, each feature seems to reduce the error further. Thus, in this case, we move ahead with all the 25 features.
Before fitting a regression model, we fitted dummy regressor as a baseline model, and baseline error in predicted prices was obtained as approximately 52.1 dollars. Clearly, for any regression model to work well, this error should reduce.
As first regression models, we used linear regression and ridge regression, and the error in predicted prices for the two models is 40.6 and 40.3 dollars, respectively. Comparing with the baseline, we conclude that the two regression models provide a better and more intelligent estimate of the prices, however, it is of interest to explore if these error values can be reduced further.
We further analyse whether this price can be reduced through ensembling and grid search. For ensembling, we employ boosting techniques. Particularly, we want to focus on AdaBoostRegressor, GradientBoostRegressor, and BaggingRegressor for our estimators. AdaBoostRegressor is a meta-estimator that fits a regressor on the original dataset and then fits additional copies of the regressor utilizing different and adjusted weights of the errors. GradientBoostingRegressor allows for optimization of arbitrary differentiable loss functions, where in each stage a regression tree is fit on the negative gradient. BaggingRegressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original datasets and then aggregate their individual predictions to a final one. The benefit is that it reduces variance of black-box estimator like a decision tree using randomization.
With the gradient boosting approach, the obtained error in predicted prices was 40.6 which is no improvement over the scores reported above. Interestingly, when we experimented with bagging, for the same experimental setting, the error in prices was 51.6 dollars which is relatively even higher. Details related to these experiments can be found in the notebooks hosted here. With Random Forest method, we observed that the error in prices reduced to 36 dollars. Clearly, with RF method, the error in predictions has significantly compared to the baseline.
While the final mean absolute error in prices of 36 dollars is still high, the results reported above clearly indicate the housing prices in Amsterdam for Airbnb listing can be predicted using regression models, and further research in this direction could be of business interest.