Ok, so we got a model from our data. But how do we check whether it is correct or not?

You might wonder - didn't we already do that through plots of residuals?

Well, plots of residuals only tell whether it was appropriate for us to think of a linear model for our data. But it doesn't tell how accurately our model represents the data. There is a certain statistic to comment about that.

Let's discuss it.

R-squared - You might have seen this value come up when we were creating Excel trendline. Also, in the summary of every linear model we created in R using lm() function, you might have seen R-squared value. It tells how much of your data is being explained by the model. For instance, a value close to 1 indicates that all your data is being explained. But a value close to 0 tells that your model doesn't represent the data very well.


When we were finding best model with two independent variables in the last Assignment, we were having a lot of choices to choose among. For instance, we could have selected "x" and "b" or "x" and "c". So why did we choose "a" and "c" only. Because that model had the highest R-squared value. We didn't have to individually calculate R-square one by one for each combination. regsubset() function did that for us and returned those variables for which R-square was highest.