Applied Machine Learning using Python – Regression - Part 1

Published on: 15 May 2021

Regression

Through the journey into this course, we have established a connection between programming, mathematics, and statistical analysis to build a root for Machine Learning. In this chapter the focus will be on Regression in Machine Learning.

Regression based Analysis of Data

Regression analysis is a consortium of statistical techniques and procedures employed for the estimation of relationships between a singular dependent variable and one or more independent variables. Apart from classifying data, it is primarily also used to assess the potency of this relationship that exists between variables. It is also helpful for building forecasting of these variables in the future.

Image shows different types of Regression: Single Variable and Multivariate.
Non-collinearity: Independent variables in regression analysis are expected to show a minimum of correlation between each other. Also, in cases where independent variables show a very high correlation between each other, it becomes difficult to gauge the true relationships between them. Therefore, minimized collinearity is a mandatory condition for regression analysis to function.
SINGLE LINEAR REGRESSION
Expression: Y = a + bX + epsilon, where
Y – Dependent variable
X – Independent (explanatory) variable
a – Intercept
b – Slope
epsilon – Residual (error)

MULTIVARIATE REGRESSION
Expression: Y = a + bX1 + cX2 + dX3 + epsilon, where
Y – Dependent variable
X1, X2, X3 – Independent (explanatory) variables
a – Intercept
b, c, d – Slopes
epsilon – Residual (error)

Ensemble Modelling

Ensemble methods are a statistical technique that use multiple learning algorithms constituently to obtain a better predictive performance from the given input. Unlike the traditional statistical ensemble, machine learning models only work on a specific set of limited alternative models.
An example of understanding how Ensemble Modelling works, lets understand this scenario. If for instance, you are planning to purchase a car, would you choose the first car you learn about? Highly unlikely. You would want to research from varied sources to understand how different cars stack up against each other and then choose the best one for your needs. Ensemble training in machine learning algorithms does the exact same thing for classification and regression models.

Supervised Learning and Ensemble Importance

Supervised machine learning algorithms execute the tasks of moving through a predefined space of hypothesis to learn about the input and output to perform predictions. Ensemble combines multiple hypothesis together in the search of even better predictions.

Ensemble Techniques

Let us look at implementing ensemble techniques in machine learning algorithms.

Max Voting: This method is generally used for classification problems, where multiple prediction scenarios are put against each other. Every individual model’s prediction is called a “vote”. The prediction value that is achieved from majority of the models, is chosen as the final output.
Averaging: As the word suggests, in case of averaging problems, multiple predictions are made for a problem using different mechanisms. Finally, the average prediction of all these individual predictions is taken as the resultant value.
Weighted Average: The extension of averaging technique is the weighted average technique. Every individual prediction model is assigned a weight of importance. For instance, if a research prediction is given by a professor who is experienced in research, his value will be given weightage over someone else’s who is new to the field.

Let us now look at the implementation of these Ensemble techniques in Python:


# Import scikit-learn library
import sklearn

# For instance, consider training a dataset using 3 distinct mechanisms – Decision Trees, KNN and Logistic Regression
model_one = sklearn.tree.DecisionTreeClassifier()  
model_two = sklearn.neighbors.KNeighborsClassifier() 
model_three = sklearn.linear_model.LogisticRegression() 

# We fit these three models with the same training data
model_one.fit(x_train,y_train)
model_two.fit(x_train,y_train)
model_three.fit(x_train,y_train)

# Perform predictions on the same data, separately for the three algorithms
prediction_one = model_one.predict(x_test)
prediction_two = model_two.predict(x_test)
prediction_three = model_three.predict(x_test)

# One way to perform max voting algorithm is using the mode of all the predictions made through the three algorithms by iterating through every result
from statistics import mode
final_pred = np.array([])
for i in range(0,len(x_test)):
    final_pred = np.append(final_pred, mode([pred1[i], pred2[i], pred3[i]])) 

# Sci-kit learn also gives a method VotingClassifier() to perform the Max Voting Ensemble Modelling
# sklearn.ensemble.VotingClassifier – It is a soft voting/majority rule classifier for unfitted estimators.
from sklearn.ensemble import VotingClassifier
model_one = LogisticRegression(random_state=2)
model_two = tree.DecisionTreeClassifier(random_state=2)

# estimators: it is the list of (str, estimator) tuples – when the fit method inside VotingClassifier() is called, clones of the original estimators are stored in the class attribute self.estimators_.
# voting{‘hard’, ‘soft’}, default=’hard’ – The voting variable takes in two arguments. Hard uses the predicted class labels for creating the majority rule, while soft uses the argmax of the predicted probabilities and creates labels for classes.
model_final = VotingClassifier(estimators=[('lr', model_one), ('dt', model_two)], voting='hard')
model_final.fit(x_train,y_train)
model_final.score(x_test,y_test)

# For performing Averaging in the Ensemble Model, we take the three predictions from separate models and calculate their average
final_prediction=(prediction_one + prediction_two + prediction_three) / 3

# By a similar approach, when calculating weighted averages, we give individual results a respective weight value and then take their average
final_prediction=(prediction_one * 0.3 + prediction_two * 0.4 + prediction_three * 0.7)

Ensemble Modelling is useful in regression analysis as it provides with the best-case scenario and combines results from multiple models to give the prediction.
Ensemble Representation:

Ensemble Techniques use a combination of multiple training algorithms and their respective predictions to form a suitable and best possible projection of the test data.

Regression in Machine Learning Algorithms

To build a regressor in Python, we will use the scikit-learn library that is used for Machine Learning. Here are some important functions and methods we need to understand before moving onto building a regressor.

Single Regressor: Linear or Single regression works with two variables and attempts to work on a straight-line graph. It is used to establish a concrete relationship between two variables. The algorithm aims to draw a straight line that comes close to data points on both axes. It works by calculating the slope and intercept that eventually define the line function and minimize errors.
Using single regression in real-life: When predicting soil erosion, although external factors like temperature and animal traversal occur, they cause minimal damage to the soil. Predicting soil erosion has a direct relationship with rainfall. Therefore, the prediction of levels of soil erosion based on rainfall is an example of a single regression.
Multivariable Regressor: Multivariable or Multivariate regression is an extension of single regression for use-cases when independent variables are more than one in number. These can be linear or non-linear. Multiple regressions are centered on an assumption that there exists a relationship between the dependent and independent variables of data.
Using multiple regression in real-life: In medicinal science, the prediction of blood pressure variance in patients is an important sign to understand diverse diseases. Prediction of blood pressure is an example of using multiple regression. To predict blood pressure from a sample population, take BP (dependent variable) and stack it against all the independent variables that would affect its outcome, like, height, weight, age, sex, and hours of exercise per week.

Implementation of Regression in Python


# We start with Single Linear Regression
# The aim of this model is to predict weights of various fishes in a market
# The dataset used in this case is Kaggle's Fish Market Dataset
# Link to data - (https://www.kaggle.com/burhanuddinraja/best-fish-from-the-data)

# Import the required Libraries to Perform Linear Regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the CSV data into the workspace
input_data = pd.read_csv('Fish.csv')
# Exploring the dataset
input_data.head()
input_data.info()
input_data.describe()


>>>

RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  159 non-null    object 
 1   Weight   159 non-null    float64
 2   Length1  159 non-null    float64
 3   Length2  159 non-null    float64
 4   Length3  159 non-null    float64
 5   Height   159 non-null    float64
 6   Width    159 non-null    float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB
>>>
Weight	Length1	Length2	Length3	Height	Width
count	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000
mean	398.326415	26.247170	28.415723	31.227044	8.970994	4.417486
std	357.978317	9.996441	10.716328	11.610246	4.286208	1.685804
min	0.000000	7.500000	8.400000	8.800000	1.728400	1.047600
25%	120.000000	19.050000	21.000000	23.150000	5.944800	3.385650
50%	273.000000	25.200000	27.300000	29.400000	7.786000	4.248500
75%	650.000000	32.700000	35.500000	39.650000	12.365900	5.584500
max	1650.000000	59.000000	63.400000	68.000000	18.957000	8.142000


input_data=input_data[['Species','Length1','Length2','Length3','Height','Width','Weight']]
input_data.head()


>>>
Species	Length1	Length2	Length3	Height	Width	Weight
0	Bream	23.2	25.4	30.0	11.5200	4.0200	242.0
1	Bream	24.0	26.3	31.2	12.4800	4.3056	290.0
2	Bream	23.9	26.5	31.1	12.3778	4.6961	340.0
3	Bream	26.3	29.0	33.5	12.7300	4.4555	363.0
4	Bream	26.5	29.0	34.0	12.4440	5.1340	430.0


# Select suitable rows and columns from the dataset for labelling
x_axis = input_data.iloc[:,:-1]
y_axis = input_data.iloc[:,-1]

# Import One Hot Encoder from Sklearn to label the input data numerically
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
col_trans = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
x_axis = np.array(col_trans.fit_transform(x_axis))
print(x_axis)

# Using train_test_split divide the data into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_axis, y_axis, test_size=0.33, random_state = 2)

# Import the linear Regression module and apply the model to the training data
from sklearn.linear_model import LinearRegression
lin_regressor = LinearRegression()
lin_regressor.fit(x_train, y_train)


>>> LinearRegression()


# Predict the values of y from the test data using the trained model
# Below are the predicted weights of the fishes
y_pred = lin_regressor.predict(x_test)
y_pred


>>>
array([62.62, 53.41,  7.73, 61.65, 50.09, 65.8 , 52.7 , 97.17, 90.23,
       67.31, 97.52, 82.12, 72.39, 32.28, 31.12, 35.78, 58.51, 76.83,
       76.07, 58.52, 25.79,  7.95, 41.28, 26.68, 79.01, 59.79, 33.98,
       31.34, 35.74, 50.51, 96.57, 68.57, 92.2 , 64.9 , 32.69, 42.17,
       86.01, 69.66, 31.41, 41.04,  7.4 , 29.87, 48.52, 22.6 ,  8.38,
       74.9 , 79.01, 66.9 , 48.02, 82.43, 30.17, 46.04, 78.89])


# Map Linear Lines from the input to the output
plt.scatter(x_test[:,0], y_test, color='blue')
plt.plot(x_test[:,0], y_pred, color='green', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
>>>

This chapter continues in the following part two.

Tutorial successivo:

Applied Machine Learning using Python – Regression - Part 2

Applied Machine Learning using Python – Regression - Part 1

Contents