Predicting house prices with an Artificial Neural Network and Python

In the previous post, we used machine learning and regression to create a model to predict house prices using data for houses sold between May 2014 to May 2015 in King County, Washington State, USA.

When evaluating our regression model, we obtained an R Squared metric of 0.7078 and a root mean squared error (RMSE) of $187,009. The purpose of this blog post is to hopefully improve on those metrics by using an Artificial Neural Network.
An Artificial Neural Network (ANN) is a machine learning algorithm that attempts to simulate the brain by modelling a web of connected nodes and the information transmitted between them, through calculations using different weights and biases between interconnected nodes. Nodes are grouped together in layers, with different transformations taking place at each layer. The features from which we want to predict a result act as the input layer to the ANN, moving through various transformations across hidden layers, before a result is produced at the output layer (usually one or several nodes).
As in our machine learning example (see previous post link above), the Python packages that we will need for this exercise are pandas to work with the data, Matplotlib to provide some visual output and scikit-learn to handle the machine learning algorithms. However, we will also need to install an additional package called keras, to take care of the ANN model calculations and training.
The house price data we will be using is based on housing data for houses sold between May 2014 to May 2015 in King County, Washington State, USA. The data was obtained from here.
The required packages can be downloaded and installed from source at the above links, or installed on the command line using Python’s pip package manager as follows, which should also install any dependencies (for example tensorflow):
pip install pandas
pip install matplotlib
pip install scikit-learn
pip install keras
Now that we have everything installed, let’s get started.
First we need to import the modules that we will be using, which will be described in more detail below as we use each model:
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt
from keras.layers import Dense
from keras.models import Sequential
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
Next we create a pandas dataframe with the housing data CSV file, which is assumed to be in the same directory as the Python script:
data = pd.read_csv("kc_house_data.csv")
To see all the columns in the dataset, issue the command print(data.columns.values):
['id' 'date' 'price' 'bedrooms' 'bathrooms' 'sqft_living' 'sqft_lot'
 'floors' 'waterfront' 'view' 'condition' 'grade' 'sqft_above'
 'sqft_basement' 'yr_built' 'yr_renovated' 'zipcode' 'lat' 'long'
 'sqft_living15' 'sqft_lot15']
Some rows in our data contain strange values, for example a zero value in the number of bathrooms and bedrooms, or an abnormally high number of bedrooms. It may be worthwhile to drop these rows from the data.
# drop data with zero bedrooms and bathrooms and bedrooms outlier
data = data.drop(data[data.bedrooms == 0].index)
data = data.drop(data[data.bedrooms == 33].index)
data = data.drop(data[data.bathrooms == 0].index)
Additionally, the id and date columns are not required, neither is the zip code as we already have longitude and latitude columns for the location data:
# drop columns that we won't be using
data = data.drop(['id', 'date', 'zipcode'], axis=1)
In order for the calculations between nodes to not be biased towards features with large number ranges, we need to perform some preprocessing on the data to scale the inputs to be constrained between 0 and 1. To do this, we use the MinMaxScaler function, which scales the data by mapping the lowest value for each feature to 0 and the highest value to 1. We place the result in a new dataframe called data_scaled.
# scale data to values between 0 and 1
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data) # numpy array
# convert numpy array to dataframe with data columns names
data_scaled = pd.DataFrame(data_scaled, columns=data.columns.values)
The target, or dependent variable (price) needs to be split from the independent variables in the dataset to create two separate data frames:
# dependent variable
y_scaled = data_scaled[['price']].values
# extract independent variables (all columns except price)
X_scaled = data_scaled.drop(['price'], axis=1).values
The data then needs to be split into testing and training sets. The training sample data (80% of the data) is used to train the model to identify relationships. The resultant trained model can then be evaluated by predicting target variables from the unseen test data (20% of the data).
# randomly split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size = 0.2)
Next, we create our ANN model, consisting of two hidden layers and an output layer, using the Sequential model from the keras package.
x_len = len(X_scaled[0])
model = Sequential([
# dense layer with 32 neurons, relu activation and x_len features
Dense(32, activation='relu', input_shape=(x_len,)),
# dense layer with 16 neurons and relu activation (constrain to 0 to 1)
Dense(16, activation='relu'),
# dense output layer with one neuron
Dense(1, activation='linear'),
])
There is quite a lot going on in the above code. The first hidden layer has 32 nodes and uses the ReLu activation function. The input_shape parameter is the number of input parameters, which we extract by calculating the number of columns in the input data (number of features / columns). The second hidden layer has 16 nodes, but we don’t have to repeat the input_shape, as it is assumed to be the same. The output layer has one node and uses a linear activation function.
We are now ready to build and train the model.
To build (compile) the model, we have to choose an optimizer and specify a loss function. The Adam optimizer has worked well for me, but you may want to experiment with some others. For regression type problems, the mean squared error is a common choice for the loss function, as it penalises higher error values.
# build the model
model.compile(optimizer='adam',
	loss='mean_squared_error')
To train (fit) the model, we pass in the independent variable samples from the training data (X_train) and the dependent variable training data (y_train), as well as the batch size (training examples per iteration) and the number of epochs (iteration count of passes over training set). The data against which the training will be validated (test data) is also passed into the function.
# train model
hist = model.fit(X_train, y_train,
	batch_size=32, epochs=50,
	validation_data=(X_test, y_test))
Now that the model has been trained, it’s time to make some predictions using the test data and compare these results to the expected values in order to evaluate the model. We use R squared and root mean squared error (RMSE) to test the effectiveness of our model:
# make some predictions using the model
y_pred = model.predict(X_test)
# evaluate model based on the predictions
print ('Model evaluation:')
print ('R squared:t {}'.format(r2_score(y_test, y_pred)))
# calculate RMSE based on unscaled value
rmse = sqrt(mean_squared_error(y_test, y_pred)) * (data['price'].max() - data['price'].min())
print ('RMSE:tt {}'.format(rmse))
The output of the above will vary each time you run the model. My output was as follows:
Model evaluation:
R squared:	 0.8766830328007977
RMSE:		 134796.12277840523
The R squared score, which tells us that approximately 87.67% of variations in house prices are explained by the independent variables (inputs), is quite a bit better than the score obtained in the previous blog post using machine learning regression. The RMSE of $ 134,796 is also a decent improvement, although still not ideal since the mean of the house prices in the dataset is in the region of $540,000.
We can visualise the history of the training process by mapping the training loss against the validation loss.
# visualise loss during training
plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Training loss', 'Validation loss'], loc='upper right')
plt.tick_params(labelsize=8)
plt.tight_layout()
plt.show()
The model can probably be further improved by experimenting with some of the design elements, such as the configuration of hidden layers, or by implementing regularisation to fine tune the model.
All of the above code can be found on my Github page.

Leave a Reply

Your email address will not be published. Required fields are marked *