Linear Regression for Predicting Real Estate Price

Linear regression is one of the common and simple statistical methods to be applied in machine learning and able to provide predictions in supervised learning. Linear regression analysis is used to predict data that are interval and continuous.

Predictions made in linear regression involve 2 or more variables with details of 1 dependent variable (written in y) and 1 or more independent variables (written in x or x1 to xn). Independent variables can be defined as variables that can affect the value of the dependent variable. While the dependent variable is a variable that gets a prediction, the value obtained in this variable will affect the value of the independent variable. See an example of linear regression like the following.

There is sample data on the size of a motorcycle engine displacement (in cc) and the power it produces (in HP).

Image for post
Engine to Power

If displayed in a scatter plot, it will look like this.

Image for post
Scatter Plot

The image shows the value of the independent variable filled with engine size and power is the dependent variable, and if the two variables have a high correlation, the diagram image will create a straight line that touches most of the existing data spread. The examples used demonstrate this.

Linear regression can be developed by adding the number of independent variables to make predictions. This method is also called multiple linear regression, this example of multiple linear regression will be discussed using the real estate price dataset provided by Kaggle.

This discussion will determine real estate price predictions in python language, first we need to import some important libraries.

import warnings
warnings.filterwarnings(‘ignore’)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

Prepare the data that will be used, before then we will read using the Pandas libraries

#Read and understands the data
real_estate_df = pd.read_csv(“real_estate.csv”)
real_estate_df.head()

From the dataset, we can see the existing independent variables as follows.

Image for post
Dependant variables

This time we assume that all variables have high collinearity between each other and all variables will be used for price predictions, then the dataset is divided into test data and training data as follows

np.random.seed(0)
df_train, df_test = train_test_split(real_estate_df, train_size=0.8, test_size=0.2)
X_train = df_train
y_train = df_train.pop("Y house price of unit area")

When the data has been divided into test data and training data, the next step is to create a model with the library that has been provided

lm = LinearRegression()
model = lm.fit(X_train, y_train)

Making predictions can be done by entering test data, previously the test data is separated into test data for independent and dependent variables.

X_test = df_test
y_test = df_test.pop(“Y house price of unit area”)
predictions = model.predict(X_test)

To find out the prediction accuracy of the model made, the accuracy is calculated using MSE and R2, here are the results of the prediction accuracy

mse = mean_absolute_error(y_test, predictions)
print(mse)
r2 = r2_score(y_test, predictions)
print(r2)
Image for post
MSE (Above) R2 (Below)

Machine Learning & Data Science Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store