Learning Linear Regression & Machine Learning by Taking Baby Steps

Arun Ojha
Artificial Intelligence in Plain English
6 min readFeb 10, 2021

--

This article is for babies wanting to understand the power of Machine Learning. Therefore, I will not be explaining complex math (not really !!) behind LINEAR Regression. As long as you follow along the codes shown below in the R studio (concept is similar, if you prefer using python as well). I assure you that after reading through this article you will be able to have meaningful conversation related to ML. Hands on method of learning is actually the simplest way to explain ML. Plus, if you are already looking up this article, I would also imagine that you have some basic familiarity of linear regression. (if not click here). I know …I know…. It’s only good for forecasting and finding cause and effect type stuff. However down the road this very basic concept can be utilized in many different ways. In the context of this article, Linear regression is chosen to explain ML because of this very essence of it’s simplicity. Linear regression model is normally the first concept that ML enthusiasts tries to understand before stepping their foot into the broader universe of the complex algorithms. Understanding this very basic concept eventually builds up the knowledge base that is required to implement potential many other sophisticated classification and Neural learnings algorithm in the future.

Now straight to the business, like I said I’m not going to explain what ML is and what Linear Regression is. However I will give a simplistic hands on experience on how to use linear regression to build up ML model. This will be used for prediction and I will also explain how to interpret the results obtained by analyzing the model. To get started we will need to follow these simple steps outlined below. I will try to explain each steps in very simplistic way. Now please follow me, we will be performing following tasks:

  • Split the data (TRAIN and TEST)
  • Create Machine Learning Linear Model
  • Check p-values and F-statistics of the model
  • Predict using TEST data
  • Verify prediction

1. Load the Data

Let’s use iris data set, it is embedded into the R-Studio therefore for the ease of this experiment we will use iris dataset:

df <- iris #Loading iris into df data frame.head(df) #Checking to see first few data set from the data

The step below is optional, however I will be performing it. Since it helps me provide clear explanation as to what we are even trying to do in this experiment (When you read till the end it will be more clear why I removed 5th column). Plus did you notice that 5th column called “Species” is non-numeric? Linear Regression model are only good for prediction and are not used for classification purposes. Therefore, I will drop 5th column from our dataset. However like I said this step is completely optional because you could literally ignore 5th column call while creating linear model.

df <- df[,-5] #dropping the 5th columnhead(df)

2. Split the Data

Now the second step would be to split the data into training data and test data. Training data will be used to train the model and test will be later use to predict from the model created during this process. We will be splitting the data in random order. We will be keeping 70% of the data to train the model and rest 30% will be used to test the model. After that we will be comparing the output to the original value. (I know .. I know …let’s get back to the action !!)

split_data <- sort(sample(nrow(df), nrow(df))*0.7)

3. Training and Test Data

Now we have decided how we want to split the data, let’s go ahead and create the training and test dataset:

train <- df[split_data,]test <- df[-split_data,]

4. Create Machine Learning Linear Model

This is the fancy stuff about linear regression machine learning model. For this experiment, we will try to predict Petal Width (Y-dependent Variable) by using Sepal Length, Sepal Width, & Petal Length as X-independent variable. The other code line is used to get the summary of the model, we just created.

# Creating a machine learning model
model <- lm(Petal.Width ~ Sepal.Length + Sepal.Width +
Petal.Length, data=training)
# Summarizing modelsummary(model)
Summary of the model

Creating the model is easy part however interpreting the result of the model is different thing.

First thing that you should be checking after creating any linear model is the p-value of the F-statistics (see 1 in the image). This value is expected to be less than 0.05. This determines if the model we just created is statistically significant. Those model with p-value higher than 0.05 are considered to be statistically insignificant.

Second thing that you should be looking for is, the R-squared value (see 2 in the image). This value explains how much of the variability of Y-variable (in our case Petal Width) is explained by (X-variable). In our case it seems to explain ~95%.

Third thing you should check is the statistical significance of the intercepts for each of the variables (see 3 in the image). Like in the first case, we want it to be less than 0.05. If it is less than 0.05 than we can call it statistically significant. For statistically significant variable we can than explain what those intercept values means in the real world. If the p-value of the intercept is larger than 0.05, in that case the intercept value is considered to be statistically insignificant and the intercepts are unexplainable.

Fourth step is to interpret the intercept value (see 4 in the image). So basically for Sepal Length the value of intercept is -0.10184. This means that for every unit increase of Sepal Length the value of Petal Width (our dependent variable) decreases by -0.010184. (notice they have inverse relationship with each other. Similarly if we look at the intercept value of Sepal Width. We can say that for every unit increase of Sepal Width the value of Petal Width (our dependent variable) increases by 0.11909. See they have direct relationships. To conclude, the sign in the intercept values determines the direction of change. Whereas the intercept values in itself tells how much the dependent variable changes per unit change of the independent variable. And so on and so forth…

4. Predict using Test Data

Now that we have created the model using the training data. It’s now time to check how good the model is in predicting the input dataset. To do so we will use test dataset that we created early in the step 4. So if you look at the code, what we are trying to do here is use the model that we just created. After calling the model we just created, we than run the model against the test data. We will than store the predicted values of the Petal Width into the variable called pred.

Pred_Petal.Width <- predict(model, test)

Wow !! we are almost done. Now lets group the data together so we can combine the predicted value with the original test dataset. In doing so we can compare and check how good job our model did in predicting the values later in the final step. Since all the values in our dataset are rounded to one decimal point, we will do same to the predicted value as well.

Pred_Petal.Width <- round(Pred_Petal.Width,1) #Round predicted valuecompare <- cbind(test, Pred_Petal.Width)head(compare)
Comparing model predicted value against actual value

As you can see from the table above (1 = Actual value) & (2 = Predicted value) now we can say that our model did pretty good job in predicting the Petal Width.

Conclusion

Hurray !!! you just created your first machine learning model using Linear Regression and interpreted the model statistically. Also you verified if the model is any good by comparing the predicted value with the actual test data. I hope this will encourage you to dive into details of other ML algorithms in the future. Good Luck.

--

--