Predicting Sale Prices Of Houses In Ames
Predicting Sale Prices of Houses in Ames using scikit-learn.
In this tutorial I will discuss how you can go from a raw dataset to a predictive model. For this tutorial we will make use of the Ames Dataset and see whether we can predict house prices based on characteristics provided in the dataset.
The analysis in this tutorial is done in Python using the pandas
, scikit-learn
and matplotlib
packages. We will start by exploring the raw data and see whether we can already see some patterns in the data or that some features should be discarded right away. Next, we will make a straight-forward pipeline that will transform our dataset and fit a linear regression model. Finally, we will evaluate the performance of this model.
I do have limited knowledge with data analysis (I mainly use R
), so this tutorial will be most informative for people like me: beginners!
Raw Data & Initial Analysis
First, we need to download the dataset provided in the link above (direct download link here).
# Import the Ames Housing data (put this in the same folder as your Python file)
house_data = pd.read_csv("AmesHousing.txt", sep="\t")
# I do not like spaces in the column names of a dataset so I first replace spaces
# by underscores
house_data.columns = house_data.columns.str.replace(' ', '_')
Next we want to get a feeling for what exactly we can find in the dataset, so let’s look at some relevant information.
# Get some feeling for the data
house_info = house_data.describe()
Since we want to build a model that predicts the house price given a set of features of this given house, let’s first check these prices by creating a histogram.
plt.hist(house_data.SalePrice, bins=50)
plt.xlabel("House price (in $)")
# Pandas also provides a built-in plotting env
# house_data.SalePrice.plot.hist()
The Year_Built
feature would be a candidate to give insights in the sale price of a house. Below we report the density of the year that each house in the dataset is built.
We can check the relationship between the sale price and the year houses are built by creating a scatter plot.
house_data.plot.scatter(x='Year_Built', y='SalePrice')
This plot suggest that houses that are built more recently tend to have a higher sale price (on average). Note that this probably due to the geographical area we are considering. In old cities, like Amsterdam, we can imagine that (part) of the old houses are momumental buildings and therefore have higher sale price.
In the analysis above, we have looked into the relationship of the sale price and a numerical feature. A scatter plot is usually a good option to check if there might be an interesting relationship, however for categorical features we need different techniques.
We have information on the type of sale for each house, e.g., some houses are sold with adjecent land, or a family member bought the house. A tool to look at the relationship between this (categorical) feature and the sale price is to look at the histogram for each type of sale condition.
house_data['SalePrice'].hist(by=house_data['Sale_Condition'], bins=30)
Another feature that is likely to have an influence on the sale price is in what neighbourhood the house is located. We could again create histograms of the sale price for each neighbourhood, but since there are more than 20 neighbourhoods, we can do something else. We take the mean of the sale price for each neighbourhood and present this in a bar plot.
# Check saleprice per neighbourhood
avg_price_neigh = house_data.groupby('Neighborhood').agg({'SalePrice' : 'mean'}).reset_index()'Neighborhood', y='SalePrice')
Feature Selection
The dataset provides more than 80 features, and not all features are equally informative for sale price. Selecting good features and transforming existing features is often a rather ad-hoc procedure, since every dataset is unique. However, there are a few standard procedures that usually help in creating a significantly better dataset.
One of these techniques is to delete outliers from your dataset. Whether or not this is necesarry also depends on the type of model that is used, e.g., linear models are usually very sensitive to outliers.
Let’s look at a scatter plot of the living area and sale price.
# Check living space and sale price (there seems to be 5 outliers...)
house_data.plot.scatter(x='Gr_Liv_Area', y='SalePrice')
house_data = house_data.query('Gr_Liv_Area < 4000')
There seem to be 4-5 outliers with very high living areas. For this analysis we will remove these, however if the goal of your model is to also have good predictions for these outliers it might be worthwhile to keep them.
Feature enhancement
There are many ways to enhance existing features. Here, I will show how categorical variables can be improved. To do this, I select all features that are of the type object
and save the histogram for these features in a folder figures
# Check categorical variables and see if we need to delete them
house_data_cat = house_data.select_dtypes(include='object')
for col in house_data_cat.columns:
plt.savefig('figures/' + col + '_hist.pdf')
house_data_cat['Bsmt_Qual'].value_counts()'Basement Quality')
There are a considerable number of categorical features that are ordinal, that is, the categories have a certain ordering. In this case, we have information about whether the feature has a value Poor, Fair, Average, Good or Excellent. However, especially the most extreme outcomes are rather unlikely. As we will later see this results in a large number of features (when we one hot encode them) with only minimal extra predictive power. Therefore we combine certain groups.
ord_groups = {'Fa': 'Bad', 'Po': 'Bad', 'TA': 'Average', 'Gd': 'Good', 'Ex':'Good'}
columns_ord = ['Bsmt_Cond', 'Bsmt_Qual', 'Exter_Cond', 'Exter_Qual', 'Fireplace_Qu',
'Garage_Cond', 'Garage_Qual', 'Heating_QC', 'Kitchen_Qual', 'Pool_QC']
for col in columns_ord:
house_data[col].replace(ord_groups, inplace=True)
house_data[col] = house_data[col].astype('category', categories=['Bad', 'Average', 'Good'], ordered=True)
Missing Values
Most real-life datasets have missing values for a subset of its features. There are different ways to deal with these missing values. Usually, when a feature’s value is missing in most of the samples, it is better to just discard them. Let’s see if we have any of these variables.
missing_vals = house_data.isnull().sum(axis = 0)
Let’s get rid of features that have many missing values.
house_data = house_data.drop(columns=['Alley', 'Fireplace_Qu', 'Pool_QC',
'Misc_Feature', 'Misc_Val'])
For some features a missing value is not really missing, since it can indicate that the value of this feature is zero when it’s missing. Fence
and Pool_Area
belong to this group. For these features we decide to create a binary variable indicating whether or not the house has this feature (and we do not care about the type or size of feature).
# Make some variables useful
house_data['Fence'] = house_data['Fence'].notna()
house_data['Pool'] = house_data['Pool_Area'] > 0
Similar to the categorical features that are rated from poor until excellent, there are also features with different categories. Again, it might be worthwhile to combine categories in the dataset that only occur a very infrequently. In this case we combine categories that consistute less than 1 percent of the total samples.
house_data_obj = house_data.select_dtypes(include='object')
for col in house_data_obj.columns:
series = house_data[col].value_counts()
mask = (series/series.sum() * 100).lt(1)
house_data[col] = np.where(house_data[col].isin(series[mask].index),'Other',
house_data[col] = house_data[col].astype('category')
Note that the 1 percent of the total sample is rather ad-hoc. Imagine that you have a dataset consisting of millions of samples, then there is no need to get rid of these categories, since we still have sufficient information (unless you need balanced categories).
To see what effect our transformations and enhanced have had on our features, we can have a look at the basement quality feature again.
house_data['Bsmt_Qual'].value_counts()'Basement Quality')
Data pipeline
Now that we have cleaned and transformed our dataset we can look at if it is possible to create a decent model that can predict sale prices of houses the model has not seen before. However, we first have to deal with a few steps before we can use our dataset in these models.
We need to make sure that we do not evaluate on data that we used to train and estimate our model. So we split our dataset in a training and test part.
X = house_data.drop(columns=['SalePrice'])
y = house_data['SalePrice']
# We need to split the set into a training and test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Even though we got rid of the features with many missing values, there are still features that have a few missing values. It is possible to use advanced techiques to “imputate” these missing values, however we will a simple imputing technique, which replaces the missing values with the most frequent value for that feature.
imputer = SimpleImputer(strategy='most_frequent')
Standardisation & Encoding
Most machine learning models require that the scale of the features in the dataset are similar. Therefore, we will standardise all numerical features in the dataset by removing the mean of that feature and dividing by the standard deviation, i.e.,
We also have categorical features which we cannot directly feed to the machine learning models. All features need to be represented by a numerical value. Both nominal and ordinal features can be transformed by a technique called One-hot encoding, where a categorical feature is replaced by a number of dummies that indicate one of these categories.
# Next we need to scale the numerical columns and encode the categorical
# variables. This can be done by splitting the dataset into two parts.
cat_cols = X_train.select_dtypes(include='category').columns
num_cols = X_train.select_dtypes(include='number').columns
cat_cols_mask = X_train.columns.isin(cat_cols)
num_cols_mask = X_train.columns.isin(num_cols)
ct = ColumnTransformer(
[('scale', StandardScaler(), num_cols_mask),
('encode', OneHotEncoder(), cat_cols_mask)])
Note that ordinal values can also be transformed into a single numerical value, where each category is represented by a number that is ordered according to the ordering of these categories. The advantage of this procedure is that we require less features to represent this feature numerically, however for most machine learning models it also results in a linear relationship between the that feature and the feature we want to predict.
Machine learning model
Now that we have a dataset that only consists of numerical values we can apply any machine learning we want. scikit-learn
provides many machine learning models with a common interface (except for the hyperparameters), so it is easy to implement many different models. In our case I will only use a linear regression.
linear_reg = LinearRegression()
Combine all steps of pipeline
Pipelines can be used to automatically perform all steps of the model estimation (including preprocessing steps). It also makes sure that the preprocessing is done correctly, e.g., the scaling for the test dataset is the one used for the training dataset (Reader: why is this absolutely necessary?). An overview of what the pipeline does is given in the figure below (copyright by Sebastian Raschka).
pipe = make_pipeline(imputer, ct, linear_reg)
Training & Evaluating model
Training the model is as simple as
pipe =, y_train)
Usually, the best model (within or between) class(es) of models can be determined by using cross-validation or splitting the training dataset into a training and validation part (if you have enough data). I will skip this step and go immediately to the evaluation of our model.
In the step above we estimated and trained our model based on the training set. To see how well it performs we can look at what sale prices our model predict for data it has not seen before.
y_pred = pipe.predict(X_test)
errors = y_pred - y_test
can provide a score for the model, but we will look at the error of the prediction instead. First let’s plot the histogram of the errors
plt.hist(errors, bins=30)
The histogram suggest that the errors are normally distributed (assumption of the linear regression). Furthermore, it seems that most predictions have an error less than 20,000$. To me this seems like a reasonable model.
In this tutorial we have seen how we can go from raw data to a predictive model that has a reasonable performance. However, there are many things we have not discussed yet. To name a few:
- Consider multiple models and select the best one using cross-validation.
- Use a more advanced missing value imputation technique, e.g., MICE.