Additional
Topics on Multiple Regression
This set of notes covers various additional
important considerations for understanding multiple regression analysis.
1.
Expected Value of Elements of the F-Ratio
comparing bivariate to null
model:
When we
compute the f-statistic, it is simply a sample estimate. It is useful to
examine the theoretical components that enter into this f – statistic. To
do this we examine the expectation of the MSR and MSE, that is the population
mean values for these two sample statistics, denoted E[MSR]
and E[MSE] respectively.
MSR = (N-1)[b12s12 + b22s22
+2b1b2s21]/2
E[MSR]
= se2
+
= se2
+ sp2
= error
variance + systematic variance
MSE = SSE/(N-3)
E[MSE] = se2
F* = MSR/MSE
E[MSR]/ E[MSE] = (se2 + sp2)/ se2
As can be seen from the above ratio, the numerator measures signal plus
noise, and the denominator measures noise alone. Thus the ratio is expected to
be greater than one when a systematic effect is present.
2.
Assumptions for the F-test
a)
Observations are independent
Test by examining autocorrelations of residuals.
b) Y is
normally distributed
Test by examining normal probability plot of residuals.
c) Variance of
Y is constant across different fixed values of X1 and X2 (homogeneity of
variance).
Test by correlating rank order of predicted value with rank order of
residual magnitude.
3.
Additional Considerations
a) Check for
outliers before you begin. These can greatly distort the estimates and inflate
the MSE.
b)
Nonlinearity. Plot the residuals as a function of each predictor to see if
there are any systematic nonlinear relations between a predictor and the
criterion.
c) Omission of an important predictor.
If an important predictor is omitted this will of course inflate the
MSE. But it may also bias or distort the regression coefficients of the other
predictors if they are correlated with the omitted predictor.
4. Other
methods for model comparison.
a) Adjusted R-Square
Use for comparing models with different numbers of parameters
Define pi = total
number of parameters of a model i (including
intercept)
SSEi = sum of squared
errors from model i
Standard
R-square for a model
R2 = 1 - (SSEi/TSS)
Adjusted
R-square for a model
sy2 = TSS/(N-1)
se2 = SSEi/(N-pi) for
model i
Ra2 = 1 - (se2/sy2)
= 1 - (SSEi/TSS)(N-1/N-pi)
This is interpreted as the R2 expected if you use the sample
estimates to predict the entire population.
b) Cross-Validation Method for Comparing Models
N = total sample size
Randomly divide the total sample into two subsamples
n1 observations in the
first subsample and
n2 observations in the
second subsample.
n1 + n2 = N
Stage 1: Estimate the coefficients of each model from the first subsample using n1 observations.
Stage 2: Use
the same coefficients from stage 1 to make predictions in the second subsample. Compare R-squares using the predictions from the
second stage.
The whole
process can be repeated many times with different random samples, producing a
sampling distribution of R-squares.
c)
Bayesian Methods for Model Comparison:
Principle: Choose
model that maximizes Pr[ Model | data].
Asymptotic
Approximation for large sample size N:
P[
Model i | data ] =
mi / åj
mj
mi
= exp[ ln(mi) ]
ln(mi)
= ln(Li ) – (pi /2)ln(N)
(ln is
natural log function)
Li =
likelihood of model i = Pr[Y1|model i]× Pr[Y2|model i]×××Pr[YN|model i]
ln(Li) = ln(Pr[Y1|model
i]) + ln(Pr[Y2|model
i]) + ….+ ln(Pr[YN|model i])
pi
= no. parm’s in model i
The Schwarz Bayesian
Information Criterion is
BICi
= -2×ln(Li)+ pi ×
ln(N) = Gi2 + pi ×
ln(N)
choose
the model with the lowest BIC.
Gi2 = -2×ln(Li)
is chi square distributed statistical measure of lack of fit
Note that BICi = -2×ln(mi) and so mi = exp( -BICi / 2 )
For GLM (and assuming normality, homogenous
variance, and independent observations) this becomes ((click here for proof)
ln(Li) = - N/2 × ln( SSEi /N ) - N/2 × [ ln(2p) + 1]
BICi = N × ln(SSEi /N) + pi × ln(N)
+ N × [ln(2p) + 1]
Note: the term N×[ln(2p)
+ 1] is common across all models, and this cancels out when we compare any two
models, so this term is just ignored
BICi
= N ×
ln(SSEi / N) + pi ×
ln(N)
use this to compute Pr[
Model i | data ] shown above.
When comparing only two models, model C vs model R, we can use
Evidence for model C (complex) over model R
(restricted)
BIC = 2[ ln(mC) – ln(mR)] =
BICR – BICC
Probability of Model 1 compared to Model
2
Pr[
model 1 | data] = 1/
See
Wasserman, L (2000) Bayesian model selection
and model averaging. Journal of Mathematical Psychology,
44, 92-107.