Madison Housing Prices Multiple Regression

By Julia Janicki

Introduction

I am investigating what are some potential factors that influence housing prices in the Madison area. I obtained information on 96 individual postings for house sales (95 after removal of an outlier) from the real estate website Trulia, and at the end I used the following parameters to perform a multiple regression to predict house prices: area of house (in square feet), number of bedrooms, and number of bathrooms, percent people commuting by car in neighborhood, number of restaurants in vicinity, and distance to capitol (in meters). The result of the model suggests that there is a relationship between the variables investigated and price given that the p-value is less than 2.2e-16, which is less than 0.05, which means that we reject H0, that there is no relationship between these variables and price. The R squared value is 0.7418, which means that the proportion of variability in price accounted for by the regression model is about 74%.

Figure 1. Madison Houses For Sale

Figure 1. Madison Houses For Sale

Body

—Data—

Originally, I wanted to scrape data from the real estate website Trulia for the Madison area, which has about 1300 records. But data scraping was more complicated than expected so I manually went over around 96 housing postings and recorded the following information for each house: price, area in square feet, number of bedrooms, number of bathrooms, year built, percent people commuting by car in neighborhood, number of restaurants in vicinity, number of grocery stores in vicinity, number of bars in vicinity, as well as address. I then geocoded the address using an R script to get the latitude and longitude of each house. Then in ArcGIS (Figure 1), I imported all the housing data with latitude / longitude for each house, as well as imported the point of the state capitol and the lakes in Dane county. I calculated two extra metrics for each house: distinace to capitol and distance to nearest lake by using the “Near” operator. So my final dataset includes prices for each house and the following metrics: area in square feet, number of bedrooms, number of bathrooms, year built, percent people commuting by car in neighborhood, number of restaurants in vicinity, number of grocery stores in vicinity, number of bars in vicinity, distance to capitol and distance to lakes. In the beginning I included only the variables area, number of bathrooms and number of bedrooms in my multiple regression model after having conducted simple linear regression on each variable vs price, but these three variables are related to each other (for example, a house with bigger area tends to have more bathrooms and bedrooms), so I decided to include other variables. There was also one outlier that was causing the sample to be skewed so I removed it. The final dataset used included 95 data points with the following variables being investigated: area of house (in square feet), number of bedrooms, and number of bathrooms, percent people commuting by car in neighborhood, number of restaurants in vicinity, and distance to capitol (in meters).

—Methods—

I conducted a multiple regression model in R and checked the assumptions. I plotted the residuals vs fitted values (Figure 2) to check if there are dependence or varying vertical spread, which we don’t have so it is okay. I plotted the normal probability plot of the residuals, which is linear and looks good (Figure 3). I plotted the residuals in the order in which the observations were made (Figure 4) and there is no relationship so it looks good. Finally I plotted the residuals vs area (Figure 5) and other independent variables to make sure there is no relationship. Then I obtained the summary of the model in R.

Remove outlier and load data

##    price bedrooms bathrooms sq_ft year restaurants grocery_stores
## 1 799900        3         2  2450 1917         106              4
## 2 225000        2         4  1722 1998           8              3
## 3 250000        3         2  1750 1999          11              0
## 4 985000        4         3  3552 1980           2              2
## 5 389900        3         3  1885 2015           4              0
## 6 259900        1         1   879 2007         281             13
##   nightlife commute_by_car distance_to_capitol distance_to_lakes
## 1        22             85           3150.2771         563.13011
## 2         1             76           7473.2295         882.25388
## 3         2             87           8926.6878        4509.71855
## 4         0             87           6130.2463          43.75013
## 5         2             87           8626.8025        4043.11112
## 6       124             44            514.5936         465.56617

Simple Linear Regression for Independent Variables

## [1] 0.5387543

## [1] 0.6508261

## [1] 0.7639259

## [1] 0.2066548

## [1] -0.2553181

## [1] -0.166137

Decide which independent variables to use

Apart from area, number of bathrooms, and number of bedrooms I also added the following factors: %commute by car, number of restaurants, distance to capitol (m). The model above included area, number of bedrooms and number of bathrooms, which are in fact all correlated. So I decided to includ other variables.

Multiple Regression Model

model2 = lm(price2 ~ area2+bathroom2+bedroom2+commute2+restaurant2+capitol2)
summary(model2)
## 
## Call:
## lm(formula = price2 ~ area2 + bathroom2 + bedroom2 + commute2 + 
##     restaurant2 + capitol2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -245108  -61585   -4626   57618  402189 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  88149.951 137739.963   0.640   0.5238    
## area2          166.228     21.349   7.786 1.06e-11 ***
## bathroom2    15157.415  14516.106   1.044   0.2992    
## bedroom2    -18438.402  12689.839  -1.453   0.1497    
## commute2       -41.717   1455.558  -0.029   0.9772    
## restaurant2    641.064    335.176   1.913   0.0589 .  
## capitol2       -10.628      4.368  -2.433   0.0169 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105700 on 91 degrees of freedom
## Multiple R-squared:  0.7418, Adjusted R-squared:  0.7248 
## F-statistic: 43.58 on 6 and 91 DF,  p-value: < 2.2e-16

Check assumptions for model

  1. plot of residuals vs fitted values
plot(x=model2$fitted.values, y = model2$residuals, main="Figure 2: residuals vs fitted values")

Dependence is bad. Varying vertical spread is bad.

–> our plot is good

  1. normal probability plot of the residuals
qqnorm(model2$residuals, main="Figure 3: QQ plot of residuals")

Non-linear is bad. –> our plot is good

  1. plot residuals in the order in which the observations were made
plot(price2,model2$residuals, main="Figure 4: residuals")

–> our plot is good

  1. Plot residuals vs each independent variable
plot(area2,model2$residuals, main="Figure 4: residuals vs area")

–> our plot is good

After removing the outlier the assumptions are much better

—Result—

The final regression equation is: 166.228area + 15157.415bathroom - 18438.402bedroom - 41.717commute + 641.064restaurant - 10.628capitol + 88149.951. The R squared is 0.7418. The F-statistic is 43.58 on 6 and 91 DF and the p-value < 2.2e-16. The p-values for the intercept, area, number of bathrooms, number of bedrooms, percent of people commuting by car, number of restaurants in vicinity, and distance to capitol are: 0.5238, 1.06e-11, 0.2992, 0.1497, 0.9772, 0.0589 and 0.0169, respectively (Table 1).

—Analysis—

The R-squared is 0.7418, which means that the proportion of variability in price accounted for by the regression model is 74%. The overall p-value is less than 2.2e-16, which is less than 0.05, which means that we reject H0, so there is a relationship between price and the listed independent variables. Finally, for individual variables, we reject H0 for area and distance to capitol since their p-values are less than 0.05, while for the others we do not reject H0 since their p-values are greater than 0.05. The p-value of number of restaurants is 0.0589, so even though we do not reject it it is quite close to 0.05. Area seems be the variable that affects price the most, followed by distance to capitol. It should be noted however that area is related to number of bedrooms and number of bathrooms and distance to capitol is loosely related to number of restaurants in vicinity, since there tends to be more restaurants towards the capitol, so this should be taken into consideration why some of the variables may have had a bigger p-value.

—Table 1— (Intercept) p-value = 0.5238 –> don’t reject H0
area p-value = 1.06e-11 –> reject H0
bathroom p-value = 0.2992 –> don’t reject H0
bedroom p-value = 0.1497 –> don’t reject H0
commute p-value = 0.9772 –> don’t reject H0
restaurant p-value = 0.0589 –> don’t reject H0 (but close!)
capitol p-value = 0.0169 –> reject H0

Conclusion

The results of the multiple regression model suggests that 74% of the variability in price is accounted for by the model, which includes the following factors: area, number of bathrooms, number of bedrooms, percent of people commuting by car, number of restaurants in vicinity, and distance to capitol. The p-value is very small so there is a strong relationship between prices and the factors that were investigated. In the future, it might be interesting to investigate how to deal with independent variables that may be related to each other,and it might be interesting to investigate if the independent variables have only a linear relationship with the dependent variable or if one or more independent variables potentially have a squared or other quadratic relationship with the dependent variable.