A simple linear regression model that describes the relationship between two variables x and y can be expressed by the following equation.
y= α + βx+ ε
If we choose the parameters α and β in the simple linear regression model so as to minimize the sum of squares of the error term ϵ, we will have the so called estimated simple regression equation. It allows us to compute fitted values of y based on values of x.
y= a + bx
Suppose, we have a data set of height and weight
lm=data.frame(wt = c(105, 120, 120, 160, 120, 145, 175, 160, 185, 210), ht = c(61, 62, 63, 65, 65, 68, 69, 70, 72, 75))
Correlation and Covariance
Let’s check the Correlation and Covariance
> cor(lm$ht,lm$wt) 0.9368622 > cov(lm$ht,lm$wt) 145.5556
Let’s plot the scatter plot
Applying the Regression
In simple linear regression, we predict scores on one variable from the scores on a second variable. The variable we are predicting is called the criterion variable and is referred to as wt. The variable we are basing our predictions on is called the predictor variable and is referred to as ht. When there is only one predictor variable, the prediction method is called simple regression. We use lm for regression command. Predictor variable or Dependent variable comes on the left side and Criterion or independent variable comes on the right side of ~.
fit = lm(a$wt ~ a$ht) fit Call: lm(formula = a$wt ~ a$ht) Coefficients: (Intercept) a$ht -316.862 6.968
Lets try to understand the summary of linear model
The residuals are the difference between the actual values of the variable you’re predicting and predicted values from your regression -> y – ŷ. For most regressions you want your residuals to look like a normal distribution when plotted. If our residuals are normally distributed, this indicates the mean of the difference between our predictions and the actual values is close to 0 (good) and that when we miss, we’re missing both short and long of the actual value, and the likelihood of a miss being far from the actual value gets smaller as the distance from the actual value gets larger.
The stars are shorthand for significance levels, with the number of asterisks displayed according to the p-value computed. *** for high significance and * for low significance. In this case, *** indicates that it’s unlikely that no relationship exists b/w weight and height.
The estimated coefficient is the value of slope calculated by the regression. The regression equation is weight = -316.8617 + 6.9681 * height. Usually we test whether the slope is zero because if it is then the model is not much use
Standard Error of the Coefficient Estimate
Measure of the variability in the estimate for the coefficient. Lower means better but this number is relative to the value of the coefficient. As a rule of thumb, you’d like this value to be at least an order of magnitude less than the coefficient estimate.
In our example, the std error or the parent variable is 0.92 which is 7 times less than the estimate of the coefficient.
t-value of the Coefficient Estimate
Score that measures whether or not the coefficient for this variable is meaningful for the model.
Probability the variable is NOT relevant. You want this number to be as small as possible. If the number is really small, R will display it in scientific notation.
The more punctuation there is next to your variables, the better.
Blank=bad, Dots=pretty good, Stars=good, More Stars=very good
Residual Std Error / Degrees of Freedom
The Residual Std Error is just the standard deviation of your residuals. You’d like this number to be proportional to the quantiles of the residuals in #1. For a normal distribution, the 1st and 3rd quantiles should be 1.5 +/- the std error.
The Degrees of Freedom is the difference between the number of observations included in your training sample and the number of variables used in your model (intercept counts as a variable)
Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. Corresponds with the amount of variability in what you’re predicting that is explained by the model. In this instance, 87% of the cause for a weight due to height.
F-statistic & resulting p-value
Performs an F-test on the model. This takes the parameters of our model (in our case we only have 1) and compares it to a model that has fewer parameters. In theory the model with more parameters should fit better.
The DF, or degrees of freedom, pertains to how many variables are in the model. In our case there is one variable so there is one degree of freedom.
Lets continue with our regression analysis.
Plot the Regression Model
Regression through the origin
> fit2 = lm(a$wt ~ a$ht + 0)
Lets try to plot this model into our scatter plot
> abline(fit2, lty = "dotted")
Lets try to predict weight when height is 100.
Setting intervals specifies computation of confidence or prediction (tolerance) intervals at the specified level, sometimes referred to as narrow vs. wide intervals.