Additional Topics on Multiple Regression

 

This set of notes covers various additional important considerations for understanding multiple regression analysis.

 

 

1. Expected Value of Elements of the F-Ratio

comparing bivariate to null model:

When we compute the f-statistic, it is simply a sample estimate. It is useful to examine the theoretical components that enter into this f – statistic. To do this we examine the expectation of the MSR and MSE, that is the population mean values for these two sample statistics, denoted E[MSR] and E[MSE] respectively.

 

MSR = (N-1)[b12s12 + b22s22 +2b1b2s21]/2

E[MSR] = se2 + {.5(N-1)[b12s12 + b22s22 + b1b2 s21]}

= se2 + sp2 =  error variance + systematic variance

 

MSE = SSE/(N-3)

E[MSE] = se2

 

F* = MSR/MSE

E[MSR]/ E[MSE] = (se2 + sp2)/ se2

 

As can be seen from the above ratio, the numerator measures signal plus noise, and the denominator measures noise alone. Thus the ratio is expected to be greater than one when a systematic effect is present.

 

2. Assumptions for the F-test

a) Observations are independent

Test by examining autocorrelations of residuals.

b) Y is normally distributed

Test by examining normal probability plot of residuals.

c) Variance of Y is constant across different fixed values of X1 and X2 (homogeneity of variance).

Test by correlating rank order of predicted value with rank order of residual magnitude.

 

3. Additional Considerations

a) Check for outliers before you begin. These can greatly distort the estimates and inflate the MSE.

b) Nonlinearity. Plot the residuals as a function of each predictor to see if there are any systematic nonlinear relations between a predictor and the criterion.

c) Omission of an important predictor.  If an important predictor is omitted this will of course inflate the MSE. But it may also bias or distort the regression coefficients of the other predictors if they are correlated with the omitted predictor.

 

4. Other methods for model comparison.

 

a)  Adjusted R-Square

Use for comparing models with different numbers of parameters

Define pi = total number of parameters of a model i (including intercept)

SSEi = sum of squared errors from model i

Standard R-square for a model

R2 = 1 - (SSEi/TSS)

 

Adjusted R-square for a model

sy2 = TSS/(N-1)

se2 = SSEi/(N-pi)   for model i

Ra2 = 1 - (se2/sy2) = 1 - (SSEi/TSS)(N-1/N-pi)

 

This is interpreted as the R2 expected if you use the sample estimates to predict the entire population.

 

 

b) Cross-Validation Method for Comparing Models

N = total sample size

Randomly divide the total sample into two subsamples

n1 observations in the first subsample and

n2 observations in the second subsample.

n1 + n2 = N

Stage 1: Estimate the coefficients of each model from the first subsample using n1 observations.

Stage 2: Use the same coefficients from stage 1 to make predictions in the second subsample. Compare R-squares using the predictions from the second stage.

The whole process can be repeated many times with different random samples, producing a sampling distribution of R-squares.

 

c) Bayesian Methods for Model Comparison:

Principle: Choose model that maximizes Pr[ Model | data].

Asymptotic Approximation for large sample size N:

P[ Model i | data ] =   mi / åj mj

mi = exp[ ln(mi) ]

ln(mi) = ln(Li ) –  (pi /2)ln(N)   (ln is natural log function)

Li = likelihood of model i = Pr[Y1|model i]× Pr[Y2|model i]×××Pr[YN|model i]

ln(Li) = ln(Pr[Y1|model i]) + ln(Pr[Y2|model i]) + ….+ ln(Pr[YN|model i])

pi = no. parm’s in model i

 

The Schwarz Bayesian Information Criterion is

BICi = -2×ln(Li)+ pi × ln(N) = Gi2 + pi × ln(N)

choose the model with the lowest BIC.

 

Gi2 = -2×ln(Li) is chi square distributed statistical measure of lack of fit

Note that BICi = -2×ln(mi) and so mi = exp( -BICi / 2 )

 

For GLM (and assuming normality, homogenous variance, and independent observations) this becomes ((click here for proof)

ln(Li) = - N/2 × ln( SSEi /N ) -  N/2 × [ ln(2p) + 1]

BICi = N × ln(SSEi /N) + pi × ln(N)  +  N × [ln(2p) + 1]

Note: the term N×[ln(2p) + 1] is common across all models, and this cancels out when we compare any two models, so this term is just ignored 

BICi = N × ln(SSEi / N) + pi × ln(N)  

use this to compute Pr[ Model i | data ] shown above.

 

When comparing only two models, model C vs model R, we can use

 Evidence for model C (complex) over model R (restricted)

 BIC = 2[ ln(mC) – ln(mR)] =  BICR – BICC

 

Probability of Model 1 compared to Model 2   

Pr[ model 1 | data] = 1/{1 + exp[-BIC / 2]}

 

See

Wasserman, L (2000) Bayesian model selection and model averaging. Journal of Mathematical Psychology, 44, 92-107.