When it comes to predicting housing prices, many factors can come into play. Factors such as overall quality that rates the overall material and finish of the house, as well as gross living area, are admittedly important factors contributing to home value, but they are not the only couple. The goal of this project is to build a linear regression model using the Ames Housing Dataset and apply data analysis techniques such as exploratory data analysis, data cleaning, and feature selection targeting multiple independent variables that are moderately and highly correlated with the dependent variable, to improve prediction accuracy that can help both realtors and prospective homeowners accurately price a house based on its available features.
The goal of this project is to build a model using linear regression algorithm to predict house prices through Sklearn, the most useful and powerful machine learning package in Python. The training data consists of the sales price and location of the properties and the various features included with them. From this dataset with 2,930 observations and 82 variables, 100 observations with all variables was selected as the training data, while another 100 was chosen as testing data used to check the accuracy of the model by comparing the predicted values and the actual values.
The steps of work are contrived as follows: Section 2.1 briefs data familiarization before data analysis techniques are initiated. Section 2.2 explains the data exploration process. Analyzing the distribution of the dependent variable SalePrice leads to the conclusion that the correct next step is to log transform the dependent variable value. Section 2.3 explains how the data for modeling is prepared. It follows steps of cleaning and imputation. Section 2.4 describes using correlations to help make educated guesses about how to proceed with the analysis, showing a data reshaping process based on features within a target range of correlation strengths. Section 3.1 covers the algorithm used in this project. Section 3.2 presents the steps to implement the algorithm using Python and SKlearn tools to build the model. Several analysis approaches to improve the accuracy of the model are explained. Section 3.3 exhibits the results obtained. Section 3.4 elaborates the process of testing the model against test data to confirm its accuracy on new data. Finally, the conclusion is extrapolated in the last section.
The data source used for modeling in this project is Ames Housing Dataset, which is a well-known dataset in the field of machine learning and data analysis. It contains various features and attributes of residential homes in Ames, Iowa, USA. The dataset is often used for regression tasks, particularly for predicting housing prices.
Data collection is one of the first steps in starting to build any model. After importing the training dataset using pandas.read_csv, preliminary information about the data can be retreived. The dataset consists of 100 rows as data points, each one representing the sale of a property, and 82 columns as variables, each describing a feature of that property. Null values can be easily detected from the first 5 rows.
# import the training data
import numpy as np
import pandas as pd
train = pd.read_csv('houseSmallData.csv') # get the training dataset
print(train.shape)
train.head()
(100, 82)
Unnamed: 0 | Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 82 columns
By using pandas.Series.describe and matplotlib.pyplot.hist to understand the basic statistics and the distribution pattern of the dependent variable SalePrice, the data is found fairly right skewed and unevenly distributed with a few outliers with higher sale prices. Since it looks much like a log-normal distribution, the log-transformed data using numpy.log is tested for comparison, and it turns much closer to a normal distribution. Unbiased skews of the two distributions can be calculated using pandas.Series.skew. The closer the returned skewness is to 0, the more normal the distribution is. It turns out that the log-transformed distribution of sales prices has a skewness of -0.10, while the original distribution has a skewness of 1.18. This transformation does improve the regression models, as can be seen later in the improvement of the model’s performance against test data. The sale price outliers seen in the original data are less drastic after the log transformation and do not need to be removed for the analysis.
# analyze distribution of dependent variable using Series.describe(), Series.skew() and plt.hist()
import matplotlib.pyplot as plt
Y_distrib = train['SalePrice']
print(Y_distrib.describe())
fig, ax = plt.subplots(1,2)
# the original data is fairly skewed and unevenly distributed, much like a log-normal distribution
ax[0].hist(Y_distrib, label=f'skew {Y_distrib.skew():.2f}')
ax[0].set_title('Measured Values', size=10)
ax[0].legend()
# the log-transformed data looks close to a normal distribution
ax[1].hist(np.log(Y_distrib), label=f'skew {np.log(Y_distrib).skew():.2f}')
ax[1].set_title('Log-Transformed Values', size=10)
ax[1].legend()
fig.suptitle('Dependent Variable (SalePrice) Data Distribution')
fig.set_figwidth(8.5)
fig.tight_layout()
plt.show()
# conclude that a proper next step is to transform the data by taking the logarithm of the SalePrice values
# if the logarithm of the sale prices can be predicted, the actual sale prices can also be predicted by such transformation
Y_target = np.log(Y_distrib)
count 100.000000
mean 173820.660000
std 72236.552886
min 40000.000000
25% 129362.500000
50% 153750.000000
75% 207750.000000
max 438780.000000
Name: SalePrice, dtype: float64
As there are any missing values in the data. ? If yes, include a description of the steps you followed to clean the data.
In this step, columns with missing values are checked for to determine which features may not add enough value to the analysis, or if any missing values will need to be interpolated. After sorting the features by number of null values and cross-checking them with features that have moderate and higher correlation with the dependent features, it was found that there was no overlap, so it was confirmed that these features with null values were allowed to be discarded in following cleaning process.
Next, through a series of data cleaning actions, DataFrame.select_dtypes(include=np.number) for droping non-numeric columns, DataFrame.interpolate for filling NaN values using an interpolation method, DataFrame.dropna for droping columns with uninterpolatable null values, the dataset is more reasonable, accurate, and better organized, and is ready for the correlation analysis in the next step. The data has been reshaped to 39 columns after the cleaning process.
# detect missing values by sorting independent variables by number of nulls
nulls = train.isna().sum().sort_values(ascending=False)
nulls = nulls[nulls.values != 0]
nulls
PoolQC 100
Alley 94
MiscFeature 91
Fence 77
FireplaceQu 54
LotFrontage 14
GarageType 6
GarageYrBlt 6
GarageFinish 6
GarageQual 6
GarageCond 6
BsmtFinType1 3
BsmtQual 3
BsmtCond 3
BsmtExposure 3
BsmtFinType2 3
dtype: int64
# get the subset including columns of numeric data only, fill NaN values, and drop columns with uninterpolatable NaNs
data = train.select_dtypes(include=np.number).interpolate().dropna(axis=1)
print(sum(data.isna().sum() != 0)) # verify there is no missing data
print(data.shape) # get information about the number of remaining columns after data cleaning and repairing
0
(100, 39)
Pearson Correlation Coefficient (r) ∈[-1,1] is the most common way of measuring a linear correlation in a value between –1 and 1 that measures the strength and direction of the relationship between two variables (can be thought of as normalized covariance). It is an important factor applied to aid in the analysis in this project. The strength of correlation is classified according to the absolute value of the coefficient: 0~0.2 is negligible, 0.2~0.4 is weak, 0.4~0.6 is moderate, 0.6~0.8 is marked, and 0.8~1.0 is high. pandas.corr can be used to compute pairwise correlation of columns.
I specified the acceptable range of coefficients to be no less than 0.4, that is, only independent features with moderate and higher correlation strength are to be retained for subsequent linear regression model analysis. After implementing it by code, the selected 15 features have surfaced. They are OverallQual, GrLivArea, GarageArea, GarageCars, YearBuilt, TotalBsmtSF, GarageYrBlt, FullBath, MasVnrArea, TotRmsAbvGrd, 1stFlrSF, YearRemodAdd, LotArea, Fireplaces, and WoodDeckSF. The correlation metric used to narrow down the features to be selected can be better understood in the bar plot below.
# calculate correlation factor
corr = data.corr() # return a DataFrame listing corrilation between any 2 columns
print(corr.shape) # understand the returned DataFrame is square in shape (col_num, col_num)
# strength of correlation is classified as, negligible 0~0.2, weak 0.2~0.4, moderate 0.4~0.6, marked 0.6~0.8, high 0.8~1.0
# limit variables with moderate strength of correlation and higher (r >= 0.4)
columns = corr['SalePrice'][corr.SalePrice.abs() >= 0.4].sort_values(ascending=False)
print(columns) # get information about columns that match criteria in a Series
cols = columns.index # get column labels in an Index
print(cols) # result in totally 15 columns to be considered, excluding SalePrice, the dependent variable column
(39, 39)
SalePrice 1.000000
OverallQual 0.855061
GrLivArea 0.735129
GarageArea 0.688249
GarageCars 0.663441
YearBuilt 0.658636
TotalBsmtSF 0.616297
GarageYrBlt 0.589361
FullBath 0.579505
MasVnrArea 0.571836
TotRmsAbvGrd 0.553603
1stFlrSF 0.550912
YearRemodAdd 0.548330
LotArea 0.497124
Fireplaces 0.487907
WoodDeckSF 0.451241
Name: SalePrice, dtype: float64
Index(['SalePrice', 'OverallQual', 'GrLivArea', 'GarageArea', 'GarageCars',
'YearBuilt', 'TotalBsmtSF', 'GarageYrBlt', 'FullBath', 'MasVnrArea',
'TotRmsAbvGrd', '1stFlrSF', 'YearRemodAdd', 'LotArea', 'Fireplaces',
'WoodDeckSF'],
dtype='object')
plt.figure(figsize=(15,5))
plt.barh(cols[1:], columns[1:].values, label='r')
plt.gca().invert_yaxis()
plt.axis([0.0, 1.0, 15, -1])
plt.xticks([0.0,0.2,0.4,0.6,0.8,1.0],['0.0 Negligible','0.2 Weak','0.4 Moderate','0.6 Marked', '0.8 High','1.0'], ha='left')
plt.xlabel('Correlation Coefficient r')
plt.ylabel('Independent Features')
plt.title('Correlation Strength of Selected Features')
plt.legend()
plt.show()
In this section, the steps to implement the algorithm using Python and SKlearn tools to build the model are presented. Several analysis approaches to improve the accuracy of the model are explained. They are, detecting how R-squared score varies as the number of variables for modeling increases, testing how the sample weight parameter affects the score, as well as comparing the model quality using log-transformed dependent variable data against using original dependent variable data through visual inspections of the plots. The results obtained are then exhibited to assist in explaining the findings derived from the data analysis.
The algorithm selected for data analysis in this project is multiple linear regression, where the value for the dependent variable is calculated using multiple independent variables. The value of variable which is to be predicted depends on its strength of relationship with the other independent variables. This factor is correlation that has been covered above. Linear regression is a machine learning algorithm that is used to predict values within a continuous range rather than classifying them into categories.
Simple linear regression is the most basic form of linear regression. Its mathematical formulation is based on the traditional equation below, where x is the independent variable, y is the dependent variable, m is the slope or the weight, b is the intercept or the bias. \(y = mx + b\) Multiple Linear Regression can be viewed as a natural extension of simple linear regression. The main difference is that now multiple sets of data are used to make the prediction. The equation for multiple linear regression becomes: \(y = m_{1}x_{1} + m_{2}x_{2} + ... + m_{n}x_{n} + b\)
Linear regression modeling can be realized in Python with tools provided in SKlearn library, where linear regression is defined as the process of determining the straight line that best fits a set of dispersed data points and the line can then be projected to forecast fresh data points. Due to its simplicity and essential features, linear regression is a fundamental Machine Learning method. sklearn.linear_model.LinearRegression class is used to instantiate Linear Regression model. The key methods of this class used in this project include LinearRegression.fit(X, Y, sample_weight=None) that fits linear model and returns fitted estimator, LinearRegression.predict that predicts class labels for samples in X using the linear model, and LinearRegression.score(X, Y, sample_weight=None) that returns the coefficient of determination of the prediction, or the mean accuracy score [0,1] of the model.
The improvement in model quality by the 3 approaches implemented in the analysis process are measured by R-squared value generated after fitting and predicting steps. R-squared or coefficient of determination of the prediction is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable in a regression model; In other words, r-squared shows how well the data fit the regression model (the goodness of fit). The value range is [0,1]. The higher the value, the better the prediction accuracy of the model.
The first analysis is to compare the results of 3 different sets of independent variables and use the best one. They use the top 5, 10, and 15 most relevant variables in order for testing.
The second analysis is to test how enabling LinearRegression.fit’s sample_weight parameter affects the result. This paramenter is used to estimate sample weights by class for unbalanced datasets.
During the data exploration phase, it is concluded that taking the log of the dependent variable SalePrice can help further improve the models’ performance. Therefore, the final analysis is to test the result from visual inspections of the linear regression correlation plots.
# RESULT IMPROVEMENT ANALYSIS: compare the results of 3 different sets of independent variables and use the best one
# assign 3 selected sets of columns to X-variables and SalePrice column to Y-variable
Y = data['SalePrice']
X1 = data[cols[1:6]] # include highest 5 correlated variables
X2 = data[cols[1:11]] # include highest 10 correlated variables
X3 = data[cols[1:]] # include highest 15 correlated variables
# build linear regression model using sklearn library
from sklearn import linear_model
for item in list([X1,X2,X3]):
score_item = linear_model.LinearRegression().fit(item,Y).score(item,Y)
print(f'P^2 = {score_item}')
# highest 5 correlated variables R^2 = 0.83
# highest 10 correlated variables R^2 = 0.85
# highest 15 correlated variables R^2 = 0.88
# conclude that the more variables are included the better prediction accuracy the model is
X = X3 # confirm to use all 15 variables to continue
P^2 = 0.8309859964337732
P^2 = 0.8517354506353321
P^2 = 0.8824028808099044
# RESULT IMPROVEMENT ANALYSIS 2: test how enabling LinearRegression.fit()'s sample_weight parameter affects the result
from sklearn.utils.class_weight import compute_sample_weight
weight = compute_sample_weight(class_weight='balanced',y=Y)
score_sw = linear_model.LinearRegression().fit(X,Y).score(X,Y)
score_sw
# conclude that enabling sample_weight argument barely improves the score in this case
0.8824028808099044
# RESULT IMPROVEMENT ANALYSIS 3: visual inspection of log-transformed data
# visual inspection of the model fit by original Y-variable data reveals some non-linearity in the data
model_untrans = linear_model.LinearRegression().fit(X,Y)
Y_pred_untrans = model_untrans.predict(X)
R2_untrans = model_untrans.score(X,Y)
# the model fit by log-transformation Y-variable data
model = linear_model.LinearRegression().fit(X,np.log(Y)) # fit the model with log-transformed Y-variable data
Y_pred = model.predict(X)
R2 = model.score(X,np.log(Y)) # R-squared score remains unchanged from using log-transformed Y-variable data
# plot predictions by the above 2 models to compare
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,2)
# plot the model fit by original Y-variable data
ax[0].scatter(Y, Y_pred_untrans, label='Values')
ax[0].plot([Y.min(), Y.max()], [Y.min(), Y.max()], label='Y=X Curve', c='red', lw=1.5)
ax[0].axis([Y_pred_untrans.min(), Y_pred_untrans.max(), Y_pred_untrans.min(), Y_pred_untrans.max()])
ax[0].set_title(f'Training Dataset ($R^2 = {R2_untrans:.2f})$')
ax[0].set_xlabel('Measured Prices')
ax[0].set_ylabel('Predicted Prices')
ax[0].legend()
# plot the model fit by log-transformation Y-variable data
ax[1].scatter(np.log(Y), Y_pred, label='Log-Transformed Values')
ax[1].plot([np.log(Y).min(), np.log(Y).max()], [np.log(Y).min(), np.log(Y).max()], label='Y=X Curve', c='red', lw=1.5)
ax[1].axis([Y_pred.min(), Y_pred.max(), Y_pred.min(), Y_pred.max()])
ax[1].set_title(f'Log-Transformed Training Dataset ($R^2 = {R2:.2f})$')
ax[1].set_xlabel('Measured Prices')
ax[1].set_ylabel('Predicted Prices')
ax[1].legend()
fig.set_figwidth(10)
fig.tight_layout()
plt.show()
# conclude the model fit by log-transformation Y-variable data does present better linearity from visual inspection
# confirm to use the model fit from log-transformation Y-variable data to continue
For the first analysis, the 3 different sets using the highest 5, 10, and 15 correlated variables in order gave R-squares of 0.83, 0.85, and 0.88 in order. It can be concluded that the more good correlated features included, the higher the prediction accuracy of the model. Therefore, all 15 variables were continued to be used in the following analysis.
As for the second analysis, it is concluded from the unchanged R-square that enabling the sample_weight argument barely improves the score in this case.
Regarding the final analysis, it could be seen some non-linearity in the data from the plot of the model fit by original SalePrice data, although R-squared remained unchanged from using log-transformed SalePrice data. In contrast, the model fit by SalePrice data after log transformation did present better linearity from visual inspection. The prediction errors in value also presented a normal distribution as shown below. It is then confirmed to use that is fit by log-transformed data as the model to be verified against test data next. If the logarithm of the sale prices can be predicted, the actual sale prices can also be predicted by such transformation.
# plot histogram of errors
import matplotlib.pyplot as plt
plt.figure(figsize=(5,5))
plt.hist(np.log(Y) - Y_pred, label='Prediction Error')
plt.xlabel('Error Value')
plt.ylabel('Entry Count')
plt.title('Prediction Error Distribution')
plt.legend()
plt.show()
# verify that the distribution of erros follows a normal distribution
Now that we have a prediction model, it’s time to test the model against test data to confirm its accuracy on new data. It can be observed that the test dataset comprises the same number of rows and columns as the training dataset, but they are apparently a different group of observations from the same source, Ames Housing Dataset.
After pre-log-transforming the SalePrice data of the test data, the R-squared results 0.84. It is slightly lower than the 0.88 of the trained model, but the scatter plot shows good linearity.
# import the test data
test = pd.read_csv('jtest.csv')
print(test.shape)
test.head()
(100, 82)
Unnamed: 0 | Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100 | 101 | 20 | RL | NaN | 10603 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2010 | WD | Normal | 205000 |
1 | 101 | 102 | 60 | RL | 77.0 | 9206 | Pave | NaN | Reg | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal | 178000 |
2 | 102 | 103 | 90 | RL | 64.0 | 7018 | Pave | NaN | Reg | Bnk | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2009 | WD | Alloca | 118964 |
3 | 103 | 104 | 20 | RL | 94.0 | 10402 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2010 | WD | Normal | 198900 |
4 | 104 | 105 | 50 | RM | NaN | 7758 | Pave | NaN | Reg | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2007 | WD | Normal | 169500 |
5 rows × 82 columns
# assign the set of 15 columns applied to the training model to x-variables and SalePrice column to y-variable
x = test[cols].drop('SalePrice', axis=1).interpolate() # repair missing data in test dataset
y = test['SalePrice']
x
OverallQual | GrLivArea | GarageArea | GarageCars | YearBuilt | TotalBsmtSF | GarageYrBlt | FullBath | MasVnrArea | TotRmsAbvGrd | 1stFlrSF | YearRemodAdd | LotArea | Fireplaces | WoodDeckSF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6 | 1610 | 480 | 2 | 1977 | 1610 | 1977.0 | 2 | 28.0 | 6 | 1610 | 2001 | 10603 | 2 | 168 |
1 | 6 | 1732 | 476 | 2 | 1985 | 741 | 1985.0 | 2 | 336.0 | 7 | 977 | 1985 | 9206 | 1 | 192 |
2 | 5 | 1535 | 410 | 2 | 1979 | 0 | 1979.0 | 2 | 0.0 | 8 | 1535 | 1979 | 7018 | 0 | 0 |
3 | 7 | 1226 | 740 | 3 | 2009 | 1226 | 2009.0 | 2 | 0.0 | 6 | 1226 | 2009 | 10402 | 0 | 0 |
4 | 7 | 1818 | 240 | 1 | 1931 | 1040 | 1951.0 | 1 | 600.0 | 7 | 1226 | 1950 | 7758 | 2 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 6 | 1456 | 440 | 2 | 1976 | 855 | 1976.0 | 2 | 0.0 | 7 | 855 | 1976 | 2280 | 1 | 87 |
96 | 7 | 1726 | 786 | 3 | 2007 | 1726 | 2007.0 | 2 | 205.0 | 8 | 1726 | 2007 | 9416 | 1 | 171 |
97 | 8 | 3112 | 795 | 2 | 1918 | 1360 | 1918.0 | 2 | 0.0 | 8 | 1360 | 1990 | 25419 | 1 | 0 |
98 | 6 | 2229 | 0 | 0 | 1912 | 755 | 1961.0 | 1 | 0.0 | 8 | 929 | 1950 | 5520 | 0 | 0 |
99 | 8 | 1713 | 856 | 3 | 2004 | 1713 | 2004.0 | 2 | 262.0 | 7 | 1713 | 2005 | 9591 | 1 | 0 |
100 rows × 15 columns
# test how good the model is with test dataset
# fit the model with log-transformed y-variable of testing dataset in the same way that the training data was transformed
y_pred = model.predict(x)
r2 = model.score(x,np.log(y))
print(f'Test R^2 = {r2}')
# R-squared score 0.84 is slightly lowered compared to the training model's 0.88
Test R^2 = 0.8382396071005003
# plot the model against test dataset
plt.figure(figsize=(5,5))
plt.scatter(np.log(y), y_pred, label='Values (Log-Transformed)')
plt.plot([np.log(y).min(), np.log(y).max()], [np.log(y).min(), np.log(y).max()], label='Y=X Curve', c='red', lw=1.5)
plt.axis([y_pred.min(), y_pred.max(), y_pred.min(), y_pred.max()])
plt.title(f'Test Dataset ($R^2 = {r2:.2f}$)')
plt.xlabel('Measured Prices (Log-Transformed)')
plt.ylabel('Predicted Prices (Log-Transformed)')
plt.legend()
plt.show()
It can be confirmed that using the highest 15 correlated independent variables and the log-transformed dependent variable for prediction resulted the best model, and the hypothesis derived from data exploration and model analysis that “given the logarithm of the predicted sales price, the actual sales price can be well predicted by logarithmic transformation” is established and reasonable. Using the linear regression machine learning model built in this project, we were able to predict home sales prices in Ames with 84% accuracy. This will enable potential buyers and sellers to make effective decisions.