categ

Categorical Data Analysis

First consider once again the familiar GLM model for a 3 x 5 factorial design. Factor A has 3 levels and factor B has 5 levels. y_i is the dependent variable for subject i (i-th row). We need 2 effect codes for factor A and 4 for factor B, and then 2 x 4 = 8 for the interaction effect in the GLM:

y_i’= b₀ + [a₁A_i1 + a₂A_i1] + [b₁B_i1 + … + b₄B_i4] + [c₁A_i1×B_i1 + … + c₈A_i2×B_i4]

Using GLM, we assumed that the dependent variable was a normally distributed variable . Now we assume that it takes on a small number of categories.

1. Logit model:

To start , suppose that the observed dependent variable, denoted R, takes on only two values coded (1 = correct, yes, present, or true , vs. 0 = incorrect, no, absent, or false).

Let’s call this binary valued dependent variable R_i. We assume that R_i is a probabilistic function of a latent continuous variable y_i .

The logit model assumes that

y_i = b₀ + [a₁A_i1 + a₂A_i1] + [b₁B_i1 + … + b₄B_i4] + [c₁A_i1×B_i1 + … + c₈A_i2×B_i4] + e_i

y_i’ = b₀ + [a₁A_i1 + a₂A_i1] + [b₁B_i1 + … + b₄B_i4] + [c₁A_i1×B_i1 + … + c₈A_i2×B_i4]

y_i = y_i’+ e_i

Pr[ R_i = 1 ] = Pr[ y_i > 0 ] = Pr[e_i > - y_i’ ] = Pr[ -e_i < y_i’ ] = Pr[ e_i < y_i’ ]

(assuming symmetry of the error distribution in the last step)

If we assume that e_i has a logistic cumulative distribution (very closely approximates a normal

cumulative but is simpler to use) then

Pr[ R_i = 0 ] = exp(0)/[ exp(0) + exp(y_i’)]

Pr[ R_i = 1 ] = exp(y_i’)/[ exp(0) + exp(y_i’)]

Note that

L_i = (Pr[ R_i = 1 ]/ Pr[ R_i = 0 ]) = exp(y_i’)

Ln(L_i) = y_i’ = b₀ + [a₁A_i1 + a₂A_i1] + [b₁B_i1 + … + b₄B_i4] + [c₁A_i1×B_i1 + … + c₈A_i2×B_i4]

Thus the logit score is described by a linear anova model. This is an example of the Generalized Linear Model. The logit transform, mapping Pr[ R_i = 1 ] into y_i, is called the link function.

Why not simply use GLM on the logit score Ln(L_i)? Some people do. However,

Problem 1: R_i is binary and not normal (violating our assumptions for f-tests)

Problem 2: var(R_i) = p_jk (1-p_jk) non homogeneous (again violating assumptions for f-tests)

Problem 3: If we use the formula derived for GLM for the least squares criterion, then the parameter estimates are not minimum variance.

Solution: We can use either weighted least squares or maximum likelihood to estimate the model parameters.

2. Multiple Response Logit Model:

Very often we have more than two categories for the response measure. In this case we use the multiple response logit model.

A drug reaction (pos, neg, neutral) is measured at five dosage levels for three age groups of rats. This is a 3 x 5 between groups design with a categorical dependent variable (drug reaction). Reactions from N = 150 rats were obtained with n = 10 subjects per group.

Group	Age	Dose	Pos (j=1)	Neg (j=2)	Neutral (j=3)
1	1	1	0	2	8
2	1	2	2	3	5
3	1	3	1	6	3
4	1	4	4	1	5
5	1	5	6	0	4
6	2	1	1	3	6
7	2	2	3	3	4
8	2	3	3	2	5
9	2	4	7	1	2
10	2	5	8	0	2
11	3	1	0	8	2
12	3	2	1	5	4
13	3	3	3	3	4
14	3	4	9	1	0
15	3	5	9	1	0

p_kj = n_kj / n_k_. observed relative frequency of response category j given group k

p_kj = true population probability of response category j given condition k

n_k_. = n = 10 number of observations in group k

Logistic Response model for J = 3 categories

j=1 negative, j=2 positive, j = 3 neutral

y_kj’ is predicted latent strength variable for resp j cond k

y_k3’ = 0 (last category j=3, fixed equal to 0)

p_kj = exp(y_kj’) / [ exp(y_k1’) + exp(y_k2’) + exp(y_k3’) ]

y_kj’ = ln[exp(y_kj’)/ exp(0)] = ln[exp(y_kj’)/ exp(y_kJ’)]

= ln[ p_kj / p_k3] = ln(p_kj) - ln(p_k3)

Multinomial Model J categories

y_kj’ is predicted latent strength variable for resp j cond k

y_kJ’ = 0 (last response score fixed equal to 0)

p_kj = exp(y_kj) / [ å _i_{= 1,J} exp(y_ki) ]

y_kj = ln[ p_kj / p_kJ] = ln(p_kj) - ln(p_kJ)

p_j = true probability of category j response

p_j = (n_j/n) sample est of p_j based on n obs

(this is conditioned on some given population k , but I temporarily suppress the subscript k for convenience)

var(p_j) = p_j (1-p_j) / n

cov(p_{i ,}p_j) = - p_ip_j/ n

P_k’ =[ p₁ , … , p_J ] = sample probabilities for group k

V_k = Cov(P_k,P_k) = [diag(P_k)-P_kP_k^'] / n J x J matrix

(d /dp_kj )ln p_kj = 1/p_kj

Y_k= A ln(P_k) (a (J-1) x 1 vector, last response omitted)

A is a (J-1)xJ contrast matrix

A = [ 1 0 0 … 0 -1

0 1 0 … 0 -1

…

0 0 … 1 -1]

e.g., for three categories

A =

[ 1 0 -1

0 1 -1 ]

W_k = Cov(Y_k , Y_k ) = [A diag(P_k)^-1]V_k [diag(P_k)^-1 A']

Y' = [Y₁', ... , Y₁₅' ] (scores for all 15 groups)

W = Cov(Y,Y) = diag[ W₁ , ... , W₁₅ ]

Linear Model of Transformed Probabilities:

Y ’ = F[P ] = Xb , P = population probability vector

Y = F[P] = sample estimates

X is a design matrix used to code main and interaction effects: main effect dose, main effect age, dose X age interaction

Weighted Least Squares Estimate

B = (X'W^-1X)^-1(X'W^-1Y)

Chi Square Test of H₀ : Cb = 0

c ² = (CB)'[ C (X'W^-1X)^-1C' ]^-1(CB)

df = no. of elements in CB.

Alternatively we can use maximum likelihood and log likelihood ratio tests.

Estimate b using P = F^-1[Y’] = F^-1[Xb] by maximizing

Ln(L) = å ln(K_j) + S_jk n_jk×ln(p_jk)

å ln(K_j) is a constant that can be ignored

Ln(L_c) := the log likelihood of a complex model,

Ln(L_r) := the log likelihood of a restricted model

Log likelihood ratio = Ln[(L_r / L_c)] = Ln(L_r) – Ln(L_c)

G² = -2×[Ln(L_r) – Ln(L_c)]

is a chi square statistic lack of fit statistic with

df = (s-r)

3. Ordinal Responses

Categorical analysis throws away all information. Usually we can assume an ordinal response scale. The above example with positive neutral and negative could be treated as an ordinal scale.

Here is another example.

Confidence ratings were obtained in a signal detection task. Each subject rated confidence that a signal was present on each trial (very low, low, med, high, very high). Six levels of signal intensity were manipulated. 300 trials were obtained, with 50 trials observed at each level of intensity. Here we will make the assumption that the repeated responses from each person are statistically independent (very likely an incorrect assumption).

Intensity	VLow =1	Low =2	Med=3 (j)	High=4	VHigh= 5
0	75	20	4	1	0
1	68	18	4	9	1
2 (k)	50	35	15	3	2
3	35	13	12	30	10
4	22	18	20	30	10
5	11	9	20	35	25

k is row index, j is column index

Compute the cumulative probability of responding at or below each response level j for each intensity k

est of C_kj = (n_k1 + n_k2 + … + n_k,j)/n_k_.

(using intensity 2 (in row k = 3) and response columns up to j = 4)

e.g. C₃₄ = (50+35+15+3)/100

Theory:

q_j = threshold for confidence level j

if subjective experience falls below q₁ then choose confidence level 1

if subjective experience falls inside the interval [q₁ , q₂ ] then choose confidence level 2

if subjective experience falls inside the interval [q₂ , q₃ ] then choose confidence level 3

if subjective experience falls inside the interval [q₃ , q₄ ] then choose confidence level 4

if subjective experience falls above q₄ then choose confidence level 5

(b₀ + a_k) = General linear model for mean latent strength of the signal for condition k

e.g. k = 1 (signal intensity = 0, noise condition) a₁ = 0 and mean = b₀

k = 3 (signal intensity = 2) mean = (b₀ + a₃)

(b₀ + a_k) + e_kj (signal plus noise)

y_kj’ = q_j - (b₀ + a_k) and y_kj = y_kj’ + e_kj (latent strength variable)

C_jk = Pr[ Respond at or below Confidence level j | stim intensity k]

= Pr[ Resp = 1 or Resp = 2 … or Resp = j | stim intensity k]

= Pr[(b₀ + a_k) + e_kj < q_j ] = Pr[e_kj < q_j - (b₀ + a_k)] = Pr[e_kj < y_kj’]

Using Logistic cumulative distribution function:

C_kj = exp(y_kj’)/[1 + exp(y_kj’)]

[C_kj / (1 – C_kj)] = exp(y_kj’)

Ln[C_kj / (1 – C_kj)] = y_kj’ = q_j – (b₀ + a_k)

Alternatively we could derive the Probability of responding higher than category j

(j+1,..., J = max category level)

1-C_kj = Pr[ Respond above Confidence level j | stim intensity k]

= Pr[ Resp = j+1 or Resp = j+2 … or Resp = J | stim intensity k]

= Pr[(b₀ + a_k) + e_kj > q_j ] = Pr[ (b₀ + a_k) - q_j > -e_kj] = Pr[ -y_kj’ > -e_kj ] = Pr[-e_kj < -y_kj’]

(assuming symmetry)

= Pr[e_kj < -y_kj’ ] = Pr[e_kj < (b₀ + a_k) - q_j ]

= exp(-y_kj’)/[1 + exp(-y_kj’)] = 1/[1 + exp(y_kj’)]

Chcck: 1-C_kj = 1 - { exp(y_kj’)/[1 + exp(y_kj’)] }

= {[1 + exp(y_kj’)] / [1 + exp(y_kj’)] } – { exp(y_kj’) / [1 + exp(y_kj’)] }

= { [1 + exp(y_kj’) – exp(y_kj’) } / [1 + exp(y_kj’)]

= 1/[1 + exp(y_kj’)].

Use maximum likelihood to estimate the parameters

{ q_j , (b₀ + a_k) } ,

Use G² = -2{ln(L_R) – ln(L_C)} to compare restricted and complex (nested models)

Example 3: Repeated Measures.

A drug reaction (pos, neg) is measured at five equally space time points for three age groups of rats. This is a 3 (between) x 5 (within) repeated measures design with a binary dependent variable (drug reaction). N = number of subjects in the study.

The 2 responses at 5 time points produces 2⁵ = 32 cells for each group.

Three groups of 32 cells produces a total of 32 x 3 = 96 proportions.

P_j = a 32 column vector where each cell represents the population proportion of subjects falling into one of the 32 cells for group j ( j = 1,2,3)

P is a 96 element column vector of all the proportions. SAS Program

Homework

Proofs