In the world of real estate, determining property prices involves numerous factors, from location and size to amenities and market trends. Simple linear regression, a foundational technique in machine learning, provides a practical way to predict housing prices based on key features like the number of rooms or square footage.
In this article, I delve into the process of applying simple linear regression to a housing dataset, from data preprocessing and feature selection to building a model that can offer valuable price insights. Whether you’re new to data science or seeking to deepen your understanding, this project serves as a hands-on exploration of how data-driven predictions can shape smarter real estate decisions.
First things first, you start by importing your libraries:
import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt
#Read from the directory where you stored the data data = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')
data
#Test to see if there arent any null values data.info()
#Trying to draw the same number of null values data.dropna(inplace = True)
data.info()
#From our data, we are going to train and test our data from sklearn.model_selection import train_test_split X = data.drop(['median_house_value'], axis = 1) y = data['median_house_value']
y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
#Examining correlation between x and y training data train_data = X_train.join(y_train)
train_data
#Visualizing the above train_data.hist(figsize=(15, 8))
#Encoding non-numeric columns to see if they are useful and categorical for analysis train_data_encoded = pd.get_dummies(train_data, drop_first=True) correlation_matrix = train_data_encoded.corr() print(correlation_matrix)
train_data_encoded.corr()
plt.figure(figsize=(15,8)) sns.heatmap(train_data_encoded.corr(), annot=True, cmap = "inferno")
import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt
#Read from the directory where you stored the data data = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')
data
ocean_proximity
INLAND 5183
NEAR OCEAN 2108
NEAR BAY 1783
ISLAND 5
Name: count, dtype: int64
#Test to see if there arent any null values data.info()
#Trying to draw the same number of null values data.dropna(inplace = True)
data.info()
#From our data, we are going to train and test our data from sklearn.model_selection import train_test_split X = data.drop(['median_house_value'], axis = 1) y = data['median_house_value']
y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
#Examining correlation between x and y training data train_data = X_train.join(y_train)
train_data
#Visualizing the above train_data.hist(figsize=(15, 8))
#Encoding non-numeric columns to see if they are useful and categorical for analysis train_data_encoded = pd.get_dummies(train_data, drop_first=True) correlation_matrix = train_data_encoded.corr() print(correlation_matrix)
train_data_encoded.corr()
plt.figure(figsize=(15,8)) sns.heatmap(train_data_encoded.corr(), annot=True, cmap = "inferno")
train_data['total_rooms'] = np.log(train_data['total_rooms'] + 1) train_data['total_bedrooms'] = np.log(train_data['total_bedrooms'] +1) train_data['population'] = np.log(train_data['population'] + 1) train_data['households'] = np.log(train_data['households'] + 1)
train_data.hist(figsize=(15, 8))
0.5092972905670141
#convert ocean_proximity factors into binary's using one_hot_encoding train_data.ocean_proximity.value_counts()
#For each feature of the above we will then create its binary(0 or 1) pd.get_dummies(train_data.ocean_proximity)
0.4447616558596853
#Dropping afterwards the proximity train_data = train_data.join(pd.get_dummies(train_data.ocean_proximity)).drop(['ocean_proximity'], axis=1)
train_data
#recheck for correlation plt.figure(figsize=(18, 8)) sns.heatmap(train_data.corr(), annot=True, cmap ='twilight')
0.5384474921332503
I would really say that training a machine is not the easiest of processes but to keep improving the results above you can add more features under the param_grid such as the min_feature and in that way your best estimator score can keep on improvimng.
If you got till this far please like and share your comment below, your opinion really matters. Thank you!??❤️
The above is the detailed content of House_Price_Prediction. For more information, please follow other related articles on the PHP Chinese website!