Multivariable Regression
Regression, when executed with multiple variables of dependency is considered a multivariable regression model.
The image above shows a representation of Multivariable Regression Analysis done for prediction of House prices based on the external factors that affect the price of houses in an area, along with two constant coefficients.
# The same dataset is now used to perform Multivariable Polynomial Regression
# Import the required Libraries to Perform Linear Regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import Linear Regression and Polynomial Features
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
input_data=input_data[['Species','Length1','Length2','Length3','Height','Width','Weight']]
y_axis = input_data.Weight.values.reshape(-1,1)
x_axis = input_data.Width.values.reshape(-1,1)
plt.scatter(x_axis,y_axis)
plt.ylabel("weight of fish in Gram")
plt.xlabel("diagonal width in cm")
>>>
Text(0.5, 0, 'Fish Width (cm)')
# Fitting the Linear Regression Model
lin_reg = LinearRegression()
lin_reg.fit(x_axis,y_axis)
#Checking Predictions out of the Model
y_head = lin_reg.predict(x_axis)
plt.plot(x_axis,y_head, color="red", label="linear")
plt.show()
print("Fish in the 750 Gm Weight ", lin_reg.predict([[750]]))
>>>
>>> Fish in the 750 Gm Weight [[140753.16]]
# Performing preprocessing using Polynomial Regression
# We use the second-degree polynomial
poly_reg = PolynomialFeatures(degree = 2)
x_poly = poly_reg.fit_transform(x_axis)
lin_reg_two = LinearRegression()
lin_reg_two.fit(x_poly,y_axis)
y_head_two = lin_reg_two.predict(x_poly)
plt.plot(x_axis,y_head_two,color="blue",label="Polynomial Distribution")
plt.legend()
plt.show()
# The graph shows the multivariable polynomial distribution of data across the predicted space
>>>
# Printing the predicted values against the dataset given for training
for i in range(0,158):
print(y_axis[i], y_head_two[i])
>>>
[242.] [266.4]
[290.] [316.99]
[340.] [391.83]
[363.] [344.95]
[430.] [483.53]
[450.] [439.24]
[500.] [515.6]
[390.] [390.61]
[450.] [421.84]
[500.] [445.98]
[475.] [477.03]
[500.] [415.84]
[500.] [328.51]
[340.] [470.22]
[600.] [491.61]
[600.] [585.38]
[700.] [517.15]
# Score measurement is discussed in the next section
Performance of Regression Algorithms
Now, we can take a look to typical evaluation metrics for regression algorithms.
- Sci-kit Learn Metrics: The sklearn.metrics module within Sci-kit learn offers numerous functions for measuring the accuracy of training models from clustering, regression, and classification. There are categorical APIs used by this module, Estimator, Scoring and Metric APIs. Estimators have a comparison mechanism that defines a default criterion of evaluation. The predicting is matched to this criterion and a score is released. Scoring uses tools like cross-validation to calculate the model’s closeness to expected answers. Metric functions implement assessment over specific purposes.
- Mean Absolute Error: The MAE is the measure of errors between given pair of observations of the x and y axes. It is a predicted vs observed value graph plotted over time. Lower values of MAE show a better fit in the training model.
- Mean Squared Error: This is the average measure of squares of individual errors. In other words, it is the average squared of the differences between the estimated probable values and the actual value. There is generally no ideal MSE value, although the closer it is to 0, the better a model is considered. A 0 value is the perfect model.
- Explained Variance Score: Measures the total dispersion or mathematical variance calculated from the predictions. The complimentary part of EV that completes the variation is called unexplained or residual variance. A higher percentage of explained variance indicates a stronger association strength.
- R2 Score (Coefficient of Determination): The R-Squared score is the fraction of the variance that occurs in the dependent variable and is predictable from the independent variables. A low R-squared value generally represents a good model, and a high R-squared value denotes a model that does not fit well.
Score Evaluation for Single Regressor:
# Checking the results of the prediction model
import sklearn.metrics as sm
print("Measuring the Performance of this Linear regressor:")
print("The Mean absolute error is", round(sm.mean_absolute_error(y_test, y_pred), 2))
print("The Mean squared error is", round(sm.mean_squared_error(y_test, y_pred), 2))
print("The Median absolute error is", round(sm.median_absolute_error(y_test, y_pred), 2))
print("The Explain variance score is", round(sm.explained_variance_score(y_test, y_pred), 2))
print("The R2 score is", round(sm.r2_score(y_test, y_pred), 2))
>>> Measuring the Performance of this Linear regressor:
>>> The Mean absolute error is 2.43
>>> The Mean squared error is 9.47
>>> The Median absolute error is 2.22
>>> The Explain variance score is 0.99
>>> The R2 score is 0.98
Score Evaluation for Multivariable Regressor:
print("Measuring the Performance of this Linear regressor:")
print("The Mean absolute error is", round(sm.mean_absolute_error(y_axis, y_head_two), 2))
print("The Mean squared error is", round(sm.mean_squared_error(y_axis, y_head_two), 2))
print("The Median absolute error is", round(sm.median_absolute_error(y_axis, y_head_two), 2))
print("The Explain variance score is", round(sm.explained_variance_score(y_axis, y_head_two), 2))
print("The R2 score is", round(sm.r2_score(y_axis, y_head_two), 2))
>>> Measuring the Performance of this Polynomial Regressor:
>>> The Mean absolute error is 91.77
>>> The Mean squared error is 21967.53
>>> The Median absolute error is 57.4
>>> The Explain variance score is 0.83
>>> The R2 score is 0.83
Choosing between Regression and Classification for ML Problems
Now that we know the functionalities and uses of classification, clustering, and regression problems, it is time to understand which machine learning algorithm is to be used for what problems. Regression models are used to predict a continuous variable, for instance, sales made on a day, or forecasting the temperature of a city. These values are continuous and generally depend on a flow of independent variables. Regression relies on a polynomial distribution to fit a line between the dependent and independent variables.
Discrete variable prediction is handled by classification algorithms. Classification algorithms like Decision Trees are used to segregate data into known groups. They are thus, a form supervised machine learning. Decision Trees are also called “Eager Learners” since they focus on understanding labels of data and subsequently continue to learn more about these labels to perform better predictions.
Clustering algorithms, in a parallel world, do the same thing that classification does but they do it on unlabeled data. This means that the algorithm first builds a classification model on the training data and then sets labels to these classified data to successfully predict test data. Clustering being an unsupervised learning mechanism cannot be used to directly predict results. Therefore, the general use cases lie in finding similarities between given data and group them based on these similarities and characteristics of the provided data points.
Coming Up Next
With regression, we have now observed three algorithm categories that are primarily used for performing predictions and grouping. Although these algorithms have distinct use cases, it is always considered good practice to experiment data with different algorithms to find the best fit for any given dataset. In the next chapter we will be concentrating on analyzing time-based sequential data and perform prediction on time series.
- Understanding Time Series and Sequential Data
- Forecasting Time Series
- Studying trends and seasonality in data
- Implementing Time Series Forecast models in Python