RSS

Monthly Archives: April 2015

Linear Regression – Relation and Prediction


A simple linear regression model that describes the relationship between two variables x and y can be expressed by the following equation.

y= α + βx+ ε

If we choose the parameters α and β in the simple linear regression model so as to minimize the sum of squares of the error term ϵ, we will have the so called estimated simple regression equation. It allows us to compute fitted values of y based on values of x.

y= a + bx

Suppose, we have a data set of height and weight

lm=data.frame(wt = c(105, 120, 120, 160, 120, 145, 175, 160, 185, 210),
ht = c(61, 62, 63, 65, 65, 68, 69, 70, 72, 75))

Correlation and Covariance

Let’s check the Correlation and Covariance

> cor(lm$ht,lm$wt)
0.9368622
> cov(lm$ht,lm$wt)
145.5556

Let’s plot the scatter plot


>plot(lm$ht, lm$wt)

scatter

Applying the Regression

In simple linear regression, we predict scores on one variable from the scores on a second variable. The variable we are predicting is called the criterion variable and is referred to as wt. The variable we are basing our predictions on is called the predictor variable and is referred to as ht. When there is only one predictor variable, the prediction method is called simple regression. We use lm for regression command. Predictor variable or Dependent variable comes on the left side and Criterion or independent variable comes on the right side of ~.


fit = lm(a$wt ~ a$ht)

fit
Call:
lm(formula = a$wt ~ a$ht)

Coefficients:
(Intercept)        a$ht
-316.862          6.968

summary

Lets try to understand the summary of linear model

Residuals

residual

The residuals are the difference between the actual values of the variable you’re predicting and predicted values from your regression -> y – ŷ. For most regressions you want your residuals to look like a normal distribution when plotted. If our residuals are normally distributed, this indicates the mean of the difference between our predictions and the actual values is close to 0 (good) and that when we miss, we’re missing both short and long of the actual value, and the likelihood of a miss being far from the actual value gets smaller as the distance from the actual value gets larger.

Stars Significance

stars

The stars are shorthand for significance levels, with the number of asterisks displayed according to the p-value computed. *** for high significance and * for low significance. In this case, *** indicates that it’s unlikely that no relationship exists b/w weight and height.

Estimated Coefficient

estimate

The estimated coefficient is the value of slope calculated by the regression. The regression equation is weight = -316.8617 + 6.9681 * height. Usually we test whether the slope is zero because if it is then the model is not much use

Standard Error of the Coefficient Estimate

std error

Measure of the variability in the estimate for the coefficient. Lower means better but this number is relative to the value of the coefficient. As a rule of thumb, you’d like this value to be at least an order of magnitude less than the coefficient estimate.

In our example, the std error or the parent variable is 0.92 which is 7 times less than the estimate of the coefficient.

t-value of the Coefficient Estimate

t value

Score that measures whether or not the coefficient for this variable is meaningful for the model.

Variable p-value

probablity

Probability the variable is NOT relevant. You want this number to be as small as possible. If the number is really small, R will display it in scientific notation.

Significance Legend

sig legendf

The more punctuation there is next to your variables, the better.

Blank=bad, Dots=pretty good, Stars=good, More Stars=very good

Residual Std Error / Degrees of Freedom

dof

The Residual Std Error is just the standard deviation of your residuals. You’d like this number to be proportional to the quantiles of the residuals in #1. For a normal distribution, the 1st and 3rd quantiles should be 1.5 +/- the std error.

The Degrees of Freedom is the difference between the number of observations included in your training sample and the number of variables used in your model (intercept counts as a variable)

R-Squared

r-square

Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. Corresponds with the amount of variability in what you’re predicting that is explained by the model. In this instance, 87% of the cause for a weight due to height.

F-statistic & resulting p-value

f-stat

Performs an F-test on the model. This takes the parameters of our model (in our case we only have 1) and compares it to a model that has fewer parameters. In theory the model with more parameters should fit better.

The DF, or degrees of freedom, pertains to how many variables are in the model. In our case there is one variable so there is one degree of freedom.

Lets continue with our regression analysis.

Plot the Regression Model


>abline(fit)

abline

Regression through the origin


> fit2 = lm(a$wt ~ a$ht + 0)

summary2

Lets try to plot this model into our scatter plot


> abline(fit2, lty = "dotted")

abline2

 Predictions

Lets try to predict weight when height is 100.

prediction

Setting intervals specifies computation of confidence or prediction (tolerance) intervals at the specified level, sometimes referred to as narrow vs. wide intervals.

 
Leave a comment

Posted by on April 22, 2015 in Linear Regression, R Packages

 

Tags: , , , , ,