By Julia Janicki
I am investigating what are some potential factors that influence housing prices in the Madison area. I obtained information on 96 individual postings for house sales (95 after removal of an outlier) from the real estate website Trulia, and at the end I used the following parameters to perform a multiple regression to predict house prices: area of house (in square feet), number of bedrooms, and number of bathrooms, percent people commuting by car in neighborhood, number of restaurants in vicinity, and distance to capitol (in meters). The result of the model suggests that there is a relationship between the variables investigated and price given that the p-value is less than 2.2e-16, which is less than 0.05, which means that we reject H0, that there is no relationship between these variables and price. The R squared value is 0.7418, which means that the proportion of variability in price accounted for by the regression model is about 74%.
Figure 1. Madison Houses For Sale
Originally, I wanted to scrape data from the real estate website Trulia for the Madison area, which has about 1300 records. But data scraping was more complicated than expected so I manually went over around 96 housing postings and recorded the following information for each house: price, area in square feet, number of bedrooms, number of bathrooms, year built, percent people commuting by car in neighborhood, number of restaurants in vicinity, number of grocery stores in vicinity, number of bars in vicinity, as well as address. I then geocoded the address using an R script to get the latitude and longitude of each house. Then in ArcGIS (Figure 1), I imported all the housing data with latitude / longitude for each house, as well as imported the point of the state capitol and the lakes in Dane county. I calculated two extra metrics for each house: distinace to capitol and distance to nearest lake by using the “Near” operator. So my final dataset includes prices for each house and the following metrics: area in square feet, number of bedrooms, number of bathrooms, year built, percent people commuting by car in neighborhood, number of restaurants in vicinity, number of grocery stores in vicinity, number of bars in vicinity, distance to capitol and distance to lakes. In the beginning I included only the variables area, number of bathrooms and number of bedrooms in my multiple regression model after having conducted simple linear regression on each variable vs price, but these three variables are related to each other (for example, a house with bigger area tends to have more bathrooms and bedrooms), so I decided to include other variables. There was also one outlier that was causing the sample to be skewed so I removed it. The final dataset used included 95 data points with the following variables being investigated: area of house (in square feet), number of bedrooms, and number of bathrooms, percent people commuting by car in neighborhood, number of restaurants in vicinity, and distance to capitol (in meters).
I conducted a multiple regression model in R and checked the assumptions. I plotted the residuals vs fitted values (Figure 2) to check if there are dependence or varying vertical spread, which we don’t have so it is okay. I plotted the normal probability plot of the residuals, which is linear and looks good (Figure 3). I plotted the residuals in the order in which the observations were made (Figure 4) and there is no relationship so it looks good. Finally I plotted the residuals vs area (Figure 5) and other independent variables to make sure there is no relationship. Then I obtained the summary of the model in R.
## price bedrooms bathrooms sq_ft year restaurants grocery_stores
## 1 799900 3 2 2450 1917 106 4
## 2 225000 2 4 1722 1998 8 3
## 3 250000 3 2 1750 1999 11 0
## 4 985000 4 3 3552 1980 2 2
## 5 389900 3 3 1885 2015 4 0
## 6 259900 1 1 879 2007 281 13
## nightlife commute_by_car distance_to_capitol distance_to_lakes
## 1 22 85 3150.2771 563.13011
## 2 1 76 7473.2295 882.25388
## 3 2 87 8926.6878 4509.71855
## 4 0 87 6130.2463 43.75013
## 5 2 87 8626.8025 4043.11112
## 6 124 44 514.5936 465.56617
## [1] 0.5387543
## [1] 0.6508261
## [1] 0.7639259
## [1] 0.2066548
## [1] -0.2553181
## [1] -0.166137
Apart from area, number of bathrooms, and number of bedrooms I also added the following factors: %commute by car, number of restaurants, distance to capitol (m). The model above included area, number of bedrooms and number of bathrooms, which are in fact all correlated. So I decided to includ other variables.
model2 = lm(price2 ~ area2+bathroom2+bedroom2+commute2+restaurant2+capitol2)
summary(model2)
##
## Call:
## lm(formula = price2 ~ area2 + bathroom2 + bedroom2 + commute2 +
## restaurant2 + capitol2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -245108 -61585 -4626 57618 402189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88149.951 137739.963 0.640 0.5238
## area2 166.228 21.349 7.786 1.06e-11 ***
## bathroom2 15157.415 14516.106 1.044 0.2992
## bedroom2 -18438.402 12689.839 -1.453 0.1497
## commute2 -41.717 1455.558 -0.029 0.9772
## restaurant2 641.064 335.176 1.913 0.0589 .
## capitol2 -10.628 4.368 -2.433 0.0169 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 105700 on 91 degrees of freedom
## Multiple R-squared: 0.7418, Adjusted R-squared: 0.7248
## F-statistic: 43.58 on 6 and 91 DF, p-value: < 2.2e-16
plot(x=model2$fitted.values, y = model2$residuals, main="Figure 2: residuals vs fitted values")
Dependence is bad. Varying vertical spread is bad.
–> our plot is good
qqnorm(model2$residuals, main="Figure 3: QQ plot of residuals")
Non-linear is bad. –> our plot is good
plot(price2,model2$residuals, main="Figure 4: residuals")
–> our plot is good
plot(area2,model2$residuals, main="Figure 4: residuals vs area")
–> our plot is good
After removing the outlier the assumptions are much better
The final regression equation is: 166.228area + 15157.415bathroom - 18438.402bedroom - 41.717commute + 641.064restaurant - 10.628capitol + 88149.951. The R squared is 0.7418. The F-statistic is 43.58 on 6 and 91 DF and the p-value < 2.2e-16. The p-values for the intercept, area, number of bathrooms, number of bedrooms, percent of people commuting by car, number of restaurants in vicinity, and distance to capitol are: 0.5238, 1.06e-11, 0.2992, 0.1497, 0.9772, 0.0589 and 0.0169, respectively (Table 1).
The R-squared is 0.7418, which means that the proportion of variability in price accounted for by the regression model is 74%. The overall p-value is less than 2.2e-16, which is less than 0.05, which means that we reject H0, so there is a relationship between price and the listed independent variables. Finally, for individual variables, we reject H0 for area and distance to capitol since their p-values are less than 0.05, while for the others we do not reject H0 since their p-values are greater than 0.05. The p-value of number of restaurants is 0.0589, so even though we do not reject it it is quite close to 0.05. Area seems be the variable that affects price the most, followed by distance to capitol. It should be noted however that area is related to number of bedrooms and number of bathrooms and distance to capitol is loosely related to number of restaurants in vicinity, since there tends to be more restaurants towards the capitol, so this should be taken into consideration why some of the variables may have had a bigger p-value.
—Table 1— (Intercept) p-value = 0.5238 –> don’t reject H0
area p-value = 1.06e-11 –> reject H0
bathroom p-value = 0.2992 –> don’t reject H0
bedroom p-value = 0.1497 –> don’t reject H0
commute p-value = 0.9772 –> don’t reject H0
restaurant p-value = 0.0589 –> don’t reject H0 (but close!)
capitol p-value = 0.0169 –> reject H0
The results of the multiple regression model suggests that 74% of the variability in price is accounted for by the model, which includes the following factors: area, number of bathrooms, number of bedrooms, percent of people commuting by car, number of restaurants in vicinity, and distance to capitol. The p-value is very small so there is a strong relationship between prices and the factors that were investigated. In the future, it might be interesting to investigate how to deal with independent variables that may be related to each other,and it might be interesting to investigate if the independent variables have only a linear relationship with the dependent variable or if one or more independent variables potentially have a squared or other quadratic relationship with the dependent variable.