Categorical Data Analysis

 

First consider once again the familiar GLM model for a 3 x 5 factorial design. Factor A has 3 levels and factor B has 5 levels. yi is the dependent variable for subject i (i-th row).  We need 2 effect codes for factor A and 4 for factor B, and then 2 x 4 = 8 for the interaction effect in the GLM:

 yi’= b0 + [a1Ai1 + a2Ai1] +  [b1Bi1 + … + b4Bi4] + [c1Ai1ΧBi1 + … + c8Ai2ΧBi4] 

Using GLM, we assumed that the dependent variable was a normally distributed variable . Now we assume that it takes on a small number of categories. 

1. Logit model:

To start , suppose that the observed dependent variable, denoted R, takes on only two values coded (1 = correct, yes, present, or true , vs.  0 = incorrect, no, absent, or false).

Let’s call this binary valued dependent variable Ri.  We assume that Ri is a probabilistic function of a latent continuous variable yi .

The logit model assumes that  

          yi = b0 + [a1Ai1 + a2Ai1] +  [b1Bi1 + … + b4Bi4] + [c1Ai1ΧBi1 + … + c8Ai2ΧBi4] + ei

or

yi’ = b0 + [a1Ai1 + a2Ai1] +  [b1Bi1 + … + b4Bi4] + [c1Ai1ΧBi1 + … + c8Ai2ΧBi4] 

          yi = yi’+ ei

          Pr[ Ri = 1 ] = Pr[ yi > 0 ]  = Pr[ei  > - yi’ ]  = Pr[ -ei  <  yi’ ]  = Pr[ ei  <  yi’ ]

(assuming symmetry of the error distribution in the last step)

If we assume that  ei has a logistic cumulative distribution (very closely approximates a normal

cumulative but is simpler to use) then

Pr[ Ri = 0 ] =  exp(0)/[ exp(0) + exp(yi’)]

Pr[ Ri = 1 ] =  exp(yi’)/[ exp(0) + exp(yi’)]

Note that

          Li =  (Pr[ Ri = 1 ]/ Pr[ Ri = 0 ]) = exp(yi’)

          Ln(Li) = yi’ = b0 + [a1Ai1 + a2Ai1] +  [b1Bi1 + … + b4Bi4] + [c1Ai1ΧBi1 + … + c8Ai2ΧBi4]

Thus the logit score is described by a linear anova model. This is an example of the Generalized Linear Model. The logit transform, mapping Pr[ Ri = 1 ] into yi, is called the link function.

Why not simply use GLM on the logit score Ln(Li)?  Some people do. However,

Problem 1:  Ri  is binary and not normal  (violating our assumptions for f-tests)

Problem 2:  var(Ri) = pjk (1-pjk )    non homogeneous  (again violating assumptions for f-tests)

Problem 3:  If we use the formula derived for GLM for the least squares criterion, then the parameter estimates are not minimum variance.

          Solution: We can use either weighted least squares or maximum likelihood to estimate the model parameters.

2. Multiple Response Logit Model:

Very often we have more than two categories for the response measure. In this case we use the multiple response logit model.

A drug reaction (pos, neg, neutral) is measured at five dosage levels for three age groups of rats. This is a 3 x 5 between groups design with a categorical dependent variable (drug reaction). Reactions from N = 150 rats were obtained with n = 10 subjects per group.

Group

Age

Dose

Pos (j=1)

Neg (j=2)

Neutral (j=3)

1

1

1

0

2

8

2

1

2

2

3

5

3

1

3

1

6

3

4

1

4

4

1

5

5

1

5

6

0

4

6

2

1

1

3

6

7

2

2

3

3

4

8

2

3

3

2

5

9

2

4

7

1

2

10

2

5

8

0

2

11

3

1

0

8

2

12

3

2

1

5

4

13

3

3

3

3

4

14

3

4

9

1

0

15

3

5

9

1

0

 

          pkj =  nkj / nk.   observed relative frequency of response category j given group k

          pkj = true population probability  of response category j given condition k

          nk. = n = 10  number of observations in group k

 

Logistic Response model for J = 3 categories

j=1 negative, j=2 positive, j = 3 neutral

ykj’ is predicted latent strength variable for resp j cond k

yk3’ = 0   (last category j=3, fixed equal to 0)

pkj = exp(ykj’) / [ exp(yk1’) + exp(yk2’) + exp(yk3’) ]

ykj’ = ln[exp(ykj’)/ exp(0)] = ln[exp(ykj’)/ exp(ykJ’)]

 =  ln[ pkj / pk3 ] = ln(pkj) - ln(pk3)

 

      

Multinomial Model J categories

ykj’ is predicted latent strength variable for resp j cond k

ykJ’ = 0 (last response score fixed equal to 0)

pkj = exp(ykj) / [  ε i = 1,J exp(yki) ]

ykj = ln[ pkj / pkJ ] = ln(pkj) - ln(pkJ)

 

 

 

 pj = true probability of category j response

 pj = (nj/n) sample est of pj based on n obs

(this is conditioned on some given population k , but I temporarily suppress the subscript k for convenience)

var(pj) = pj (1-pj ) / n

cov(pi , pj ) = - pip j / n

Pk’ =[ p1 , … , pJ ] = sample probabilities for group k

Vk = Cov(Pk ,Pk) = [diag(Pk)-PkPk'] / n    J x J matrix

(d /dpkj )ln pkj = 1/pkj

Yk = A ln(Pk) (a (J-1) x 1 vector, last response omitted)

A is a (J-1)xJ contrast matrix

A =  [ 1 0 0 … 0 -1

          0 1 0 … 0 -1

         …

         0 0 …     1 -1]

e.g., for three categories

    A =

      [ 1 0 -1

      0 1 -1 ]

Wk = Cov(Yk , Yk ) = [A diag(Pk)-1]Vk [diag(Pk)-1 A']

Y' = [Y1' , ... , Y15' ] (scores for all 15 groups)

W = Cov(Y,Y) = diag[ W1 , ... , W15 ]

 

 

Linear Model of Transformed Probabilities:

Y ’ = F[P ] = Xb , P = population probability vector

Y = F[P]  = sample estimates

X is a design matrix used to code main and interaction effects: main effect dose, main effect age, dose X age interaction

Weighted Least Squares Estimate

           B = (X'W-1X)-1(X'W-1Y)

Chi Square Test of H0 : Cb = 0

c 2 = (CB)'[ C (X'W-1X)-1C' ] -1(CB)

df = no. of elements in CB.

 

 

Alternatively we can use maximum likelihood and log likelihood ratio tests.

Estimate b  using P = F-1[Y’] = F-1[Xb]  by maximizing

 

          Ln(L) = ε ln(Kj) + Sjk njkΧln(pjk)

 

        ε ln(Kj) is a constant that can be ignored

 

Ln(Lc) := the log likelihood of a complex model,  

Ln(Lr) := the log likelihood of a restricted model

 

         Log likelihood ratio = Ln[(Lr / Lc)] = Ln(Lr) – Ln(Lc)

 

G2 =  -2Χ[Ln(Lr) – Ln(Lc)] 

is a chi square statistic lack of fit statistic with

df = (s-r)

 

 

   

    

 

3. Ordinal Responses

Categorical analysis throws away all information. Usually we can assume an ordinal response scale. The above example with positive neutral and negative could be treated as an ordinal scale.

 

Here is another example.

Confidence ratings were obtained in a signal detection task. Each subject rated confidence that a signal was present on each trial (very low, low, med, high, very high). Six levels of signal intensity were manipulated. 300 trials were obtained, with 50 trials observed at each level of intensity.  Here we will make the assumption that the repeated responses from each person are statistically independent (very likely an incorrect assumption).

 

Intensity

VLow =1

Low =2

Med=3 (j)

High=4

VHigh= 5

0

75

20

4

1

0

1

68

18

4

9

1

2  (k)

50

35

15

3

2

3

35

13

12

30

10

4

22

18

20

30

10

5

11

9

20

35

25

k is row index, j is column index

Compute the cumulative probability of responding at or below each response level j for each intensity k

est of Ckj = (nk1 + nk2 + … + nk,j)/nk.  

(using intensity 2 (in row k = 3) and response columns up to j =  4)

e.g. C34 = (50+35+15+3)/100 

Theory:

qj = threshold for confidence level j   

          if subjective experience falls below q1  then choose confidence level 1

if subjective experience falls inside the interval [q1 , q2 ] then choose confidence level 2

if subjective experience falls inside the interval [q2 , q3 ] then choose confidence level 3

if subjective experience falls inside the interval [q3 , q4 ] then choose confidence level 4

if subjective experience falls above q4 then choose confidence level 5

(b0 + ak) = General linear model for mean latent strength of the signal for condition k

e.g.    k = 1 (signal intensity = 0, noise condition) a1 = 0 and mean = b0

k = 3 (signal intensity = 2)  mean  = (b0 + a3) 

(b0 + ak) + ekj (signal plus noise)

ykj’ = qj  - (b0 + ak)  and ykj = ykj’ + ekj   (latent strength variable)

Cjk = Pr[ Respond at or below Confidence level j | stim intensity k]

 = Pr[ Resp = 1 or Resp = 2 … or Resp = j | stim intensity k]

= Pr[(b0 + ak) + ekj < qj ] = Pr[ekj < qj - (b0 + ak)] = Pr[ekj < ykj’]

Using Logistic cumulative distribution function:          

          Ckj = exp(ykj’)/[1 + exp(ykj’)]

          [Ckj / (1 – Ckj)] =  exp(ykj’)

          Ln[Ckj / (1 – Ckj)] = ykj’ = qj – (b0 + ak) 

 

Alternatively we could derive the Probability of responding higher than category j

(j+1,..., J = max category level)

1-Ckj = Pr[ Respond above Confidence level j | stim intensity k]

= Pr[  Resp = j+1 or Resp = j+2 … or Resp = J | stim intensity k]

= Pr[(b0 + ak) + ekj > qj ] = Pr[ (b0 + ak) - qj > -ekj] = Pr[ -ykj’ > -ekj ] = Pr[-ekj < -ykj’]  

(assuming symmetry)

= Pr[ekj < -ykj’ ] = Pr[ekj  < (b0 + ak) - qj ]

           = exp(-ykj’)/[1 + exp(-ykj’)] = 1/[1 + exp(ykj’)]

 

Chcck:   1-Ckj = 1 - { exp(ykj’)/[1 + exp(ykj’)]  }

                   =  {[1 + exp(ykj’)] / [1 + exp(ykj’)] } – { exp(ykj’) / [1 + exp(ykj’)] }

                   = { [1 + exp(ykj’) – exp(ykj’) } / [1 + exp(ykj’)]

                   = 1/[1 + exp(ykj’)].

 

Use maximum likelihood to estimate the parameters

   { qj , (b0 + ak) } ,

 Use G2 = -2{ln(LR) – ln(LC)} to compare restricted and complex (nested models)

 

Example 3: Repeated Measures.

 A drug reaction (pos, neg) is measured at five equally space time points for three age groups of rats. This is a 3 (between) x 5 (within) repeated measures design with a binary dependent variable (drug reaction). N = number of subjects in the study.

The 2 responses at 5 time points produces 25 = 32 cells for each group.

Three groups of 32 cells produces a total of 32 x 3 = 96 proportions.

Pj = a 32 column vector where each cell represents the population proportion of subjects falling into one of the 32 cells for group j ( j = 1,2,3)

P is a 96 element column vector of all the proportions. SAS Program

 

 

Homework

Proofs