Lecture 7

Additional Topics on Multiple Regression

This set of notes covers various additional important considerations for understanding multiple regression analysis.

1. Expected Value of Elements of the F-Ratio

comparing bivariate to null model:

When we compute the f-statistic, it is simply a sample estimate. It is useful to examine the theoretical components that enter into this f – statistic. To do this we examine the expectation of the MSR and MSE, that is the population mean values for these two sample statistics, denoted E[MSR] and E[MSE] respectively.

MSR = (N-1)[b₁²s₁² + b₂²s₂² +2b₁b₂s₂₁]/2

E[MSR] = s_e² + {.5(N-1)[b₁²s₁² + b₂²s₂² + b₁b₂ s₂₁]}

= s_e² + s_p² = error variance + systematic variance

MSE = SSE/(N-3)

E[MSE] = s_e²

F* = MSR/MSE

E[MSR]/ E[MSE] = (s_e² + s_p²)/ s_e²

As can be seen from the above ratio, the numerator measures signal plus noise, and the denominator measures noise alone. Thus the ratio is expected to be greater than one when a systematic effect is present.

2. Assumptions for the F-test

a) Observations are independent

Test by examining autocorrelations of residuals.

b) Y is normally distributed

Test by examining normal probability plot of residuals.

c) Variance of Y is constant across different fixed values of X1 and X2 (homogeneity of variance).

Test by correlating rank order of predicted value with rank order of residual magnitude.

3. Additional Considerations

a) Check for outliers before you begin. These can greatly distort the estimates and inflate the MSE.

b) Nonlinearity. Plot the residuals as a function of each predictor to see if there are any systematic nonlinear relations between a predictor and the criterion.

c) Omission of an important predictor. If an important predictor is omitted this will of course inflate the MSE. But it may also bias or distort the regression coefficients of the other predictors if they are correlated with the omitted predictor.

4. Other methods for model comparison.

a) Adjusted R-Square

Use for comparing models with different numbers of parameters

Define p_i = total number of parameters of a model i (including intercept)

SSE_i = sum of squared errors from model i

Standard R-square for a model

R² = 1 - (SSE_i/TSS)

Adjusted R-square for a model

s_y² = TSS/(N-1)

s_e² = SSE_i/(N-p_i) for model i

R_a² = 1 - (s_e²/s_y²) = 1 - (SSE_i/TSS)(N-1/N-p_i)

This is interpreted as the R² expected if you use the sample estimates to predict the entire population.

b) Cross-Validation Method for Comparing Models

N = total sample size

Randomly divide the total sample into two subsamples

n₁ observations in the first subsample and

n₂ observations in the second subsample.

n₁ + n₂ = N

Stage 1: Estimate the coefficients of each model from the first subsample using n₁ observations.

Stage 2: Use the same coefficients from stage 1 to make predictions in the second subsample. Compare R-squares using the predictions from the second stage.

The whole process can be repeated many times with different random samples, producing a sampling distribution of R-squares.

c) Bayesian Methods for Model Comparison:

Principle: Choose model that maximizes Pr[ Model | data].

Asymptotic Approximation for large sample size N:

P[ Model i | data ] = m_i / å_j m_j

m_i = exp[ ln(m_i) ]

ln(m_i) = ln(L_i ) – (p_i /2)ln(N) (ln is natural log function)

L_i = likelihood of model i = Pr[Y₁|model i]× Pr[Y₂|model i]×××Pr[Y_N|model i]

ln(L_i) = ln(Pr[Y₁|model i]) + ln(Pr[Y₂|model i]) + ….+ ln(Pr[Y_N|model i])

p_i = no. parm’s in model i

The Schwarz Bayesian Information Criterion is

BIC_i = -2×ln(L_i)+ p_i × ln(N) = G_i² + p_i × ln(N)

choose the model with the lowest BIC.

G_i² = -2×ln(L_i) is chi square distributed statistical measure of lack of fit

Note that BIC_i = -2×ln(m_i) and so m_i = exp( -BIC_i / 2 )

For GLM (and assuming normality, homogenous variance, and independent observations) this becomes ((click here for proof)

ln(L_i) = - N/2 × ln( SSE_i /N ) - N/2 × [ ln(2p) + 1]

BIC_i = N × ln(SSE_i /N) + p_i × ln(N) + N × [ln(2p) + 1]

Note: the term N×[ln(2p) + 1] is common across all models, and this cancels out when we compare any two models, so this term is just ignored

BIC_i = N × ln(SSE_i/ N) + p_i × ln(N)

use this to compute Pr[ Model i | data ] shown above.

When comparing only two models, model C vs model R, we can use

Evidence for model C (complex) over model R (restricted)

BIC = 2[ ln(m_C) – ln(m_R)] = BIC_R – BIC_C

Probability of Model 1 compared to Model 2

Pr[ model 1 | data] = 1/{1 + exp[-BIC / 2]}

See

Wasserman, L (2000) Bayesian model selection and model averaging. Journal of Mathematical Psychology, 44, 92-107.