MATH 563 — Mathematical Statistics

Linear Models
Casella & Berger Ch. 11
Test 2 on 4/7 on
Optimality of Estimators,
Hypothesis Testing, &
Ordinary Least Squares Through Inference

Fred J. Hickernell

April 3, 2026

Linear Models

Now we move the situation where we have a

  • Response/Output/Dependent Variable \(Y\)

  • Explained/Predicted by a vector of Predictors/Explanatory Variables/Inputs/Independent Variables \(\vx\), where the \(x_j\) may be functions of other variables or each other

Types of linear models

Comparison of inference for linear models and population parameters

Linear models Population parameters
Model \(Y = \vx^\top \vbeta + \varepsilon\),   \(\Ex(\varepsilon) = 0\), \(\var(\varepsilon) = \sigma^2\)
\(\Ex(Y) = \vx^\top \vbeta =: \mu(\vx)\)
\(Y \sim \Bern(p)\), \(Y \sim \Exp(\lambda)\), \(Y \sim \Norm(\mu,\sigma^2)\), …
Parameters \(\vbeta \in \reals^d\) \(\mu\), \(\sigma^2\), \(p\), \(\lambda\)
Fixed \(\vx\)
Random \(Y\), \(\varepsilon\) \(Y\)
Data \((\vx_1,Y_1),\dots,(\vx_n,Y_n)\) \(Y_1,\dots,Y_n\)
Model with data \(Y_i = \vx_i^\top \vbeta + \varepsilon_i\)
\(\underbrace{\vY}_{n \times 1} = \underbrace{\mX}_{n \times d}\, \underbrace{\vbeta}_{d \times 1} + \underbrace{\vveps}_{n \times 1}\)
\(Y_i \sim \Bern(p)\), \(Y_i \sim \Exp(\lambda)\), \(Y_i \sim \Norm(\mu,\sigma^2)\), …
Not observable \(\varepsilon_1,\dots,\varepsilon_n\)
Linear models Population parameters
Model \(Y = \vx^\top \vbeta + \varepsilon\),   \(\Ex(\varepsilon) = 0\), \(\var(\varepsilon) = \sigma^2\)
\(\Ex(Y) = \vx^\top \vbeta =: \mu(\vx)\)
\(Y \sim \Bern(p)\), \(Y \sim \Exp(\lambda)\), \(Y \sim \Norm(\mu,\sigma^2)\), …
Model with data \(Y_i = \vx_i^\top \vbeta + \varepsilon_i\)
\(\underbrace{\vY}_{n \times 1} = \underbrace{\mX}_{n \times d}\, \underbrace{\vbeta}_{d \times 1} + \underbrace{\vveps}_{n \times 1}\)
\(Y_i \sim \Bern(p)\), \(Y_i \sim \Exp(\lambda)\), \(Y_i \sim \Norm(\mu,\sigma^2)\), …
Point estimators \(\hbeta = (\mX^\top \mX)^{-1} \mX^\top \vY\), \(\hvY = \mX \hbeta\) \(\barY\), \(S^2\), \(\hlambda_{\MLE}\), \(\hp_{\MLE}\), …
Confidence/ prediction regions for \(\vbeta\), \(\mu(\vx)\), \(Y \mid \vx\) \(\mu\), \(\sigma^2\), \(p\), \(\lambda\), \(Y\)

Least Squares Linear Regression

(CB §11.3; WMS §11.3–11.4, 11.10–11.11)

  • Model: \(Y = \underbrace{\vx^\top \vbeta}_{\mu(\vx)} + \varepsilon \quad\) with \(\Ex(\varepsilon)=0\), \(\var(\varepsilon)=\sigma^2\)

    • \(x_j\) may be a variable or a function of variables
  • Model with data: \(\displaystyle \overbrace{\begin{pmatrix}Y_1 \\ Y_2 \\ \vdots \\ \vdots \\ Y_n\end{pmatrix}}^{\vY} = \overbrace{\begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \end{pmatrix}}^{\mX}\, \overbrace{\begin{pmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_d \end{pmatrix}}^{\vbeta} + \overbrace{\begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \vdots \\ \varepsilon_n \end{pmatrix}}^{\vveps}\)

    • \(\mX\) is fixed and has full column rank (must have \(n \ge d\))

Estimation of \(\vbeta\) by least squares

\[ \argmin_{\vb \in \reals^d} \lVert \vY - \mX \vb \rVert^2 = (\mX^\top \mX)^{-1} \mX^\top \vY =: \hvbeta \]

Proof

Let \(\hvbeta\) be defined as above, and let \(\vb\) be any other vector in \(\reals^d\): \[\begin{align*} \lVert \vY - \mX \vb \rVert^2 & = \lVert (\vY - \mX \hvbeta) + \mX (\hvbeta - \vb) \rVert^2 \\ & = \lVert \vY - \mX \hvbeta \rVert^2 + 2 (\hvbeta - \vb)^\top \class{alert}{\mX^\top (\vY - \mX \hvbeta)} + \lVert \mX (\hvbeta - \vb) \rVert^2 \\ & = \lVert \vY - \mX \hvbeta \rVert^2 + 2 (\hvbeta - \vb)^\top \class{alert}{\mX^\top (\vY - \mX (\mX^\top \mX)^{-1} \mX^\top \vY)} + \lVert \mX (\hvbeta - \vb) \rVert^2 \\ & = \lVert \vY - \mX \hvbeta \rVert^2 + 2 (\hvbeta - \vb)^\top \class{alert}{[\mX^\top \vY - \mX^\top \mX (\mX^\top \mX)^{-1} \mX^\top \vY]} + \lVert \mX (\hvbeta - \vb) \rVert^2 \\ & = \lVert \vY - \mX \hvbeta \rVert^2 + \lVert \mX (\hvbeta - \vb) \rVert^2 \ge \lVert \vY - \mX \hvbeta \rVert^2 \end{align*}\]

Furthermore

  • \(\hmu(\vx) := \vx^\top \hvbeta\) is an estimator of \(\mu(\vx) := \Ex(Y \mid \vx) = \vx^\top \vbeta\)

  • \(\hvY : = \mX \hvbeta = \underbrace{\mX (\mX^\top \mX)^{-1} \mX^\top}_{\href{https://en.wikipedia.org/wiki/Projection_matrix}{\text{projection matrix }} \mP} \vY\) is an estimator of \(\mu(\vX) = \Ex(\vY)\)

  • \(\mP\) is the projection matrix (also called the hat matrix) onto \[ \cc(\mX):= \text{ the column space of } \mX = \text{ the set of all linear combinations of the columns of } \mX = \{ \mX \vb : \vb \in \reals^d \} \] This means that

    • \(\mP^2 = \mP\) (idempotent), and so \(\mP\) has eigenvalues of \(0\) and \(1\) only
    • \(\mP^\top = \mP\) (symmetric)
    • \(\mP \mX = \mX\) (projection onto \(\cc(\mX)\))
    • \((\mI - \mP) \mX = \mzero\) (projection onto the orthogonal complement of \(\cc(\mX)\) )
  • \(\hvveps : = \vY - \hvY = (\mI - \mP) \vY\) is the vector of residuals,

    • Orthogonal to all columns of \(\mX\), i.e., orthogonal to \(\cc(\mX)\)

Geometric interpretation of least squares regression

Unbiasedness of estimators

Recall that \(Y = \mX \vbeta + \vveps\) with \(\Ex(\vveps)=0\), \(\cov(\vveps)=\sigma^2 \mI\), where \(\mX\) is fixed and of full column rank

Unbiasedness of \(\hvbeta\) and \(\hmu(\vx)\)

\[\begin{align*} \hvbeta & := (\mX^\top \mX)^{-1} \mX^\top \vY = (\mX^\top \mX)^{-1} \mX^\top \left(\mX \vbeta + \vveps\right) = \vbeta + (\mX^\top \mX)^{-1} \mX^\top \vveps \\ \implies \Ex(\hvbeta) & = \vbeta \qquad \text{so $\hvbeta$ is \class{alert}{unbiased}} \\[12pt] \hmu(\vx) & := \vx^\top \hvbeta = \vx^\top \vbeta + \vx^\top (\mX^\top \mX)^{-1} \mX^\top \vveps \\ \implies \Ex(\hmu(\vx)) &= \vx^\top \vbeta = \mu(\vx) \qquad \text{so $\hmu(\vx)$ is \class{alert}{unbiased}} \\[12pt] \hvveps &: = \underbrace{\vY}_{\text{data}} - \underbrace{\hvY}_{\text{fit}} = (\mI-\mP)\vY, \qquad \mP := \mX(\mX^\top \mX)^{-1}\mX^\top \\ & = (\mI-\mP)(\mX \vbeta + \vveps) = (\mI-\mP)\vveps \qquad \text{since $(\mI-\mP)\mX = \mzero$} \\ \implies \Ex(\hvveps) & = \vzero \end{align*}\]

Residual sum of squares and unbiased estimation of \(\sigma^2\)

Since \(\hvveps = (\mI-\mP)\vveps\), it follows that \[\begin{equation*} \Ex\bigl(\lVert \hvveps \rVert^2\bigr) = \Ex\bigl(\vveps^\top (\mI-\mP)\vveps\bigr) \\ = \sigma^2 \trace(\mI-\mP) = \sigma^2 (n - \trace(\mP)) \end{equation*}\]

Now \(\mP\) is a projection matrix onto \(\cc(\mX)\), it has eigenvalues of \(0\) and \(1\) only, so \[ \trace(\mP)= \text{ sum of the eigenvalues of } \mP = \rank(\mP)=\rank(\mX)=d \]

Therefore \(\Ex\bigl(\lVert \hvveps \rVert^2\bigr) = \sigma^2 (n-d)\) and so \[\begin{multline*} S^2 := \frac 1{n-d} \sum_{i=1}^{n} \lvert Y_i - \vx_i^\top \hvbeta \rvert^2 = \frac{\lVert \hvveps \rVert^2}{n-d} = \frac{\lVert (\mI-\mP)\vY \rVert^2 }{n-d} = \frac{\lVert (\mI-\mP)\vveps \rVert^2 }{n-d} \\ \text{is an \class{alert}{unbiased} estimator of $\sigma^2$} \end{multline*}\]

Inference for regression parameters and predictions assuming \(\vveps \sim \Norm(\vzero,\sigma^2 \mI)\)

(CB §11.3; WMS §11.4–11.7, 11.12–11.13)

So far, we have only used the assumptions that

  • \(\mX\) is fixed and of full column rank
  • \(\Ex(\vveps)=\vzero\) and \(\cov(\vveps)=\sigma^2 \mI\)

Going forward, we assume that \(\vveps \sim \Norm(\vzero,\sigma^2 \mI)\). This allows us to

  • Make probabilistic statements about the estimators \(\hvbeta\) and \(\hmu(\vx)\)
  • Construct confidence intervals and bands for \(\vbeta\), \(\mu(\vx)\), and \(\sigma^2\)
  • Construct prediction intervals and bands for \(Y \mid \vx\)
  • Perform hypothesis tests about \(\vbeta\) and \(\mu(\vx)\)
  • Assess model fit and compare models

Confidence regions for \(\vbeta\) and \(\mu(\vx)\) assuming \(\vveps \sim \Norm(\vzero,\sigma^2 \mI)\)

Since a linear combination of normal random variables is normal, the following data-based quantities are all normally distributed:

  • \(\vY = \mX \vbeta + \vveps \sim \Norm(\mX\vbeta,\sigma^2 \mI)\)

  • \(\hvbeta = (\mX^\top \mX)^{-1} \mX^\top \vY = (\mX^\top\mX)^{-1}\mX^\top (\mX \vbeta + \vveps) = \vbeta + (\mX^\top\mX)^{-1}\mX^\top\vveps \sim \Norm\!\left(\vbeta,\sigma^2 (\mX^\top \mX)^{-1}\right)\)

  • \(\hmu(\vx) = \vx^\top \hvbeta = \mu(\vx) + \vx^\top (\mX^\top\mX)^{-1}\mX^\top\vveps \sim \Norm\!\left(\mu(\vx),\sigma^2 \vx^\top (\mX^\top \mX)^{-1} \vx\right)\)

  • \(\hvveps = (\mI-\mP)\vveps \sim \Norm(\vzero,\sigma^2 (\mI-\mP))\)

Confidence ellipsoid for \(\vbeta\) assuming \(\vveps \sim \Norm(\vzero,\sigma^2 \mI)\)

  • From the normal distribution of \(\hvbeta\), it follows that \[\frac{ (\hvbeta-\vbeta)^\top (\mX^\top\mX) (\hvbeta-\vbeta) }{\sigma^2} \sim \chi^2_d \]

  • Also, \(\hvbeta = \vbeta + (\mX^\top \mX)^{-1}\mX^\top \vveps = \vbeta + (\mX^\top \mX)^{-1}\mX^\top \mP \vveps\) since \(\mP \mX = \mX\), so \(\hvbeta\) depends only on \(\mP\vveps\)

  • Moreover, \(S^2 = \lVert(\mI - \mP)\vveps\rVert^2/(n-d)\), so \(S^2\) depends only on \((\mI-\mP)\vveps\)

  • Since \(\hvbeta\) and \(S^2\) depend on orthogonal projections of \(\vveps\) onto \(\cc(\mX)\) and \(\cc(\mX)^\perp\), respectively, and \(\vveps\) is Normal, \(\hvbeta\) and \(S^2\) are independent.

  • Hence, replacing \(\sigma^2\) with \(S^2\) gives \[ \frac{ (\hvbeta-\vbeta)^\top(\mX^\top\mX)(\hvbeta-\vbeta) }{dS^2} \sim F_{d,n-d} \]

Therefore

\[ (\hvbeta-\vbeta)^\top (\mX^\top\mX) (\hvbeta-\vbeta) \le d S^2 F_{d,n-d,\alpha}, \quad \text{where } \Prob(F_{d,n-d} > F_{d,n-d,\alpha}) = \alpha \]

defines a \(\class{alert}{(1-\alpha)\text{ confidence ellipsoid for }\vbeta}\) centered at \(\hvbeta\)

Individual confidence intervals for the \(\beta_j\) assuming \(\vveps \sim \Norm(\vzero,\sigma^2 \mI)\)

\(\exstar\)Show that the variance of \(\hbeta_j\) is \(\sigma^2\bigl[(\mX^\top\mX)^{-1}\bigr]_{jj}\), where \(\bigl[(\mX^\top\mX)^{-1}\bigr]_{jj}\) denotes the \(j\)th diagonal entry of \((\mX^\top\mX)^{-1}\)

Since \(\hvbeta \sim \Norm\!\left( \vbeta,\; \sigma^2(\mX^\top\mX)^{-1} \right)\), it follows that for each \(j=1,\dots,d\),

\[ \hbeta_j \sim \Norm\!\left( \beta_j,\; \sigma^2\bigl[(\mX^\top\mX)^{-1}\bigr]_{jj} \right) \]

where \(\bigl[(\mX^\top\mX)^{-1}\bigr]_{jj}\) denotes the \(j\)th diagonal entry of \((\mX^\top\mX)^{-1}\). Replacing \(\sigma^2\) by \(S^2\) gives

\[\begin{gather*} \frac{ \hbeta_j-\beta_j }{ S\sqrt{\bigl[(\mX^\top\mX)^{-1}\bigr]_{jj}} } \sim t_{n-d} \\ \hbeta_j \pm t_{n-d,\alpha/2} \; S\sqrt{\bigl[(\mX^\top\mX)^{-1}\bigr]_{jj}} \; \text{ is an }\class{alert}{(1-\alpha)\text{ confidence interval for }\beta_j} \text{ for each } j=1,\dots,d \end{gather*}\]

Joint vs individual confidence statements

\[\begin{gather*} (\hvbeta-\vbeta)^\top (\mX^\top\mX) (\hvbeta-\vbeta) \le dS^2F_{d,n-d,\alpha} \quad \text{is a joint confidence region \class{alert}{for the whole vector $\vbeta$}} \\ \hbeta_j \pm t_{n-d,\alpha/2} \; S\sqrt{\bigl[(\mX^\top\mX)^{-1}\bigr]_{jj}}, \; j=1,\dots,d, \quad \text{are \class{alert}{individual confidence intervals for each $\beta_j$}} \end{gather*}\]

  • Joint confidence for \(\vbeta\) is not the same as simultaneous confidence for all \(\beta_j\)

  • The confidence elliposoid for \(\vbeta\) and the individual confidence interval for \(\beta_1\) both include \(\beta_1 = 0\) so one might question whether there should be a constant term in the model

Caution about many intervals

If we form separate \((1-\alpha)\) confidence intervals for each \(\beta_j\), the probability that of them are simultaneously correct is generally less than \((1-\alpha)\)

So:

  • individual confidence intervals control each \(\beta_j\) separately
  • the confidence ellipsoid controls \(\vbeta\) jointly

Confidence interval for \(\mu(\vx)\) assuming \(\vveps \sim \Norm(\vzero,\sigma^2 \mI)\)

Recall \(\mu(\vx)=\vx^\top\vbeta\), \(\hmu(\vx)= \mu(\vx) + \vx^\top (\mX^\top\mX)^{-1}\mX^\top\vveps \sim \Norm\!\left(\mu(\vx),\sigma^2 \vx^\top (\mX^\top \mX)^{-1} \vx\right)\)

Replacing \(\sigma^2\) by \(S^2\) gives

\[ \frac{ \hmu(\vx)-\mu(\vx) }{ S \sqrt{\vx^\top(\mX^\top\mX)^{-1}\vx} } \sim t_{n-d} \]

Therefore

\[ \hmu(\vx) \pm t_{n-d,\alpha/2} \; S \sqrt{\vx^\top(\mX^\top\mX)^{-1}\vx} \]

is a \((1-\alpha)\) confidence interval for \(\mu(\vx)\)

Prediction interval for \(Y\mid \vx\)

A new observation, \(Y \mid \vx = Y_{\text{new}}=\vx^\top\vbeta+\varepsilon_{\text{new}}\), depends both on

  • the value of \(\vx\) and
  • the new error \(\varepsilon_{\text{new}}\sim\Norm(0,\sigma^2)\) that is independent of the data used to fit the model

Predicting \(Y_{\text{new}}\) by \(\hmu(\vx)\) has uncertainty coming both from the uncertainty in the model parameters and the new error term:

\[\begin{align*} Y_{\text{new}}-\hmu(\vx) & = \bigl(\vx^\top\vbeta+\varepsilon_{\text{new}}\bigr) - \bigl(\vx^\top\vbeta + \vx^\top (\mX^\top\mX)^{-1}\mX^\top\vveps \bigr) \\ & = \varepsilon_{\text{new}} - \vx^\top (\mX^\top\mX)^{-1}\mX^\top\vveps\\ & \sim \Norm\!\left(0,\; \sigma^2\left( 1+\vx^\top(\mX^\top\mX)^{-1}\vx \right)\right) \qquad \text{since $\varepsilon_{\text{new}}$ is independent of $\vveps$} \end{align*}\]

Replacing \(\sigma^2\) by \(S^2\) gives a \(t\)-distribution for the standardized prediction error. A \((1-\alpha)\) prediction interval for \(Y_{\text{new}}\) is then given by

\[ Y_{\text{new}} \in \hmu(\vx) \pm t_{n-d,\alpha/2} S \sqrt{ 1+\vx^\top(\mX^\top\mX)^{-1}\vx } \]

Mean vs prediction intervals

Confidence interval for mean response, \(\mu(\vx)\):

\[ \hmu(\vx) \pm t_{n-d,\alpha/2} S \sqrt{\vx^\top(\mX^\top\mX)^{-1}\vx} \]

Prediction interval for a new observation, \(Y \mid \vx\):

\[ \hmu(\vx) \pm t_{n-d,\alpha/2} S \sqrt{ 1+\vx^\top(\mX^\top\mX)^{-1}\vx } \]

  • Prediction intervals are wider than confidence intervals

  • These pointwise intervals exhibit uncertainty at each fixed \(\vx\), not simultaneously for all \(\vx\) at once

Simultaneous confidence and prediction bands

  • Pointwise intervals control uncertainty at one fixed \(\vx\); “\(\forall \vx\)” is outside the probability statement

  • A band controls uncertainty simultaneously for all \(\vx\); “\(\forall \vx\)” is inside the probability statement

\[\begin{multline*} \bigl|\hmu(\vx)-\mu(\vx)\bigr| = \bigl|\vx^\top(\hvbeta-\vbeta)\bigr| = \bigl|[(\mX^\top\mX)^{-1/2}\vx]^\top [(\mX^\top\mX)^{1/2} (\hvbeta-\vbeta)]\bigr| \\ \le \sqrt{\vx^\top(\mX^\top\mX)^{-1}\vx} \sqrt{(\hvbeta-\vbeta)^\top(\mX^\top\mX)(\hvbeta-\vbeta)} \quad \text{for all }\vx \qquad \text{by Cauchy--Schwarz} \end{multline*}\]

So the confidence ellipse,\((\hvbeta-\vbeta)^\top(\mX^\top\mX)(\hvbeta-\vbeta) \le dS^2F_{d,n-d,\alpha}\) with probability \(1-\alpha\), implies the following \((1-\alpha)\) confidence band:

\[ \mu(\vx) \in \hmu(\vx) \pm \sqrt{dF_{d,n-d,\alpha}}\; S\sqrt{\vx^\top(\mX^\top\mX)^{-1}\vx} \qquad \text{for all }\vx \]

Similarly, a \((1-\alpha)\) prediction band is

\[ Y_{\text{new}} \in \hmu(\vx) \pm \sqrt{dF_{d,n-d,\alpha}}\; S\sqrt{1+\vx^\top(\mX^\top\mX)^{-1}\vx} \qquad \text{for all }\vx \]

  • Pointwise intervals use the \(t\) multiplier because they control one fixed \(\vx\)

  • Simultaneous bands use the larger \(\sqrt{dF_{d,n-d,\alpha}}\) multiplier because they must work for all \(\vx\) at once

Simultaneous bands are wider than pointwise intervals because they must cover \(\vx\) values at once, not just one fixed \(\vx\)

Computing regression quantities via QR decomposition

Let \(\mX \in \mathbb{R}^{n \times d}\) have full column rank, and write its QR decomposition as

X n × d = Q n × d R 0 d × d
  • \(\mQ \in \mathbb{R}^{n \times d}\) has orthonormal columns (\(\mQ^\top \mQ = \mI_d\))
  • \(\mR \in \mathbb{R}^{d \times d}\) is upper triangular

Then it follows that

  • Least squares estimate: \(\hvbeta = \mR^{-1} (\mQ^\top \vY)\)

  • Projection matrix: \(\mP = \mQ \mQ^\top\)

  • Fitted values \(\hvY = \mQ (\mQ^\top \vY)\)

  • Residuals: \(\hvveps = \vY - \hvY\)

  • Covariance of \(\hvbeta\): \(\cov(\hvbeta) = \sigma^2 \mR^{-1} \mR^{-\top}\)

    • Variance of \(\hbeta_j\): \(\sigma^2 \bigl[(\mR^{-1} \mR^{-\top})\bigr]_{jj}\), where \(\bigl[(\mR^{-1} \mR^{-\top})\bigr]_{jj}\) denotes the \(j\)th diagonal entry of \(\mR^{-1} \mR^{-\top}\)
  • Variance of \(\hmu(\vx)\): \(\var(\hmu(\vx)) = \sigma^2 \lVert (\mR^{-\top} \vx)\rVert^2\)

Regression diagnostics

(CB §11.3; WMS §11.14)

Up to this point, we have assumed that the chosen linear model is appropriate.

 

In practice, we should also check whether

  • the model assumptions are reasonable
  • the chosen set of predictors is adequate
  • our assumptions about the errors are justified

The main questions are:

  • Is the mean response approximately linear in the chosen predictors, which themselves may be nonlinear transformations of underlying variables?
  • Does the variability of the errors stay roughly constant with respect to \(\vx\) and \(\mu(\vx):=\Ex(Y \mid \vx)\)?
  • Are the errors approximately normally distributed?
  • Are there unusual or influential observations that strongly affect the fit?

Residuals versus fitted values

  • Fitted values \(\hvY = \mX \hvbeta = \mP \vY\) are the predicted mean responses

  • Residuals \(\hvveps = \vY - \hvY = (\mI - \mP)\vY\) are the differences between observed and fitted values

Under the normal linear model, the fitted values and residuals are independent; more generally, they are uncorrelated. A good first diagnostic is a plot of residuals versus fitted values.

If the linear model is appropriate, then the residual plot should show

  • points scattered around \(0\)
  • no strong curve or pattern
  • roughly constant vertical spread across the range of fitted values

A residual plot should look like random noise around zero, not like a pattern.

Standardized residuals

Because the variability of \(\hveps_i\) depends on the leverage of observation \(i\), it is useful to scale residuals.

A standardized residual is

\[ r_i = \frac{\hveps_i}{S\sqrt{1-h_{ii}}}, \]

where \(h_{ii}\) is the \(i\)th diagonal entry of the projection (hat) matrix \(\mP\) (using \(h_{ii}\) rather than \(p_{ii}\) since \(p_i\) is common for probabilities))

Large values of \(|r_i|\) indicate observations whose responses are unusual relative to the fitted model.

As a rough rule of thumb:

  • \(|r_i| > 2\): worth attention
  • \(|r_i| > 3\): potentially serious

Normal Q–Q plot

Let \(r_1,\dots,r_n\) be standardized residuals, and let \(r_{(1)} \le \cdots \le r_{(n)}\) denote the sorted values.

 

Define plotting positions and corresponding normal quantiles: \[ p_i = \frac{i - 1/2}{n}, \qquad i=1,\dots,n \qquad z_i = \Phi^{-1}(p_i), \qquad i=1,\dots,n \]

Plot the points \((z_i, r_{(i)})\), \(i=1,\dots,n\)

 

  • If the residuals are Normal, the points lie approximately on \(y = x\)
  • Systematic deviations, such as curvature, indicate non-normality
  • This matters most for small samples
  • For large samples, moderate nonnormality is often less serious for estimation, though it may still affect inference

Why not a P–P plot?

P–P plot would plot \[ \bigl(p_i, \Phi(r_{(i)})\bigr) \]

  • This checks whether the \(\Phi(r_i)\) are approximately \(\Unif[0,1]\)
  • However, it compresses the tails and is less sensitive to deviations there

 

👉 Q–Q plots are preferred because they better reveal tail departures from Normality

Projection (hat) matrix and leverage

The fitted values satisfy

\[ \hvY = \mP \vY, \qquad \mP := \mX(\mX^\top \mX)^{-1}\mX^\top \]

The matrix \(\mP\) is sometimes called the hat matrix because it puts a “hat” on \(\vY\)

Its diagonal entries \(h_{ii} = (\mP)_{ii} = \|\vQ_{i,\cdot}\|^2\), where \(\mX = \mQ \mR\) is the QR decomposition of \(\mX\), are called the leverages

  • A large value of \(h_{ii}\) means that observation \(i\) has an unusual predictor vector \(\vx_i\)
  • Such observations can have a large effect on the fitted regression
  • High leverage by itself is not necessarily bad, but it does mean that the observation deserves attention

Our example

Here the diagnostics look pretty good:

  • The residuals show no strong pattern and roughly constant spread across fitted values and \(x\)

  • The normal Q-Q plot shows only mild departures from normality

  • The leverage plot shows that no observations have unusually high leverage

When things go wrong

Heteroscedasticity

Suppose the mean model is still correct,

\[ Y_i = \beta_1 + \beta_2 x_i^2 + \varepsilon_i, \]

but the error variance is no longer constant:

\[ \varepsilon_i \sim \Norm(0,\sigma_i^2), \qquad \sigma_i = 0.25 + 0.35 x_i \]

So:

  • \(\Ex(Y_i \mid x_i) = \beta_1 + \beta_2 x_i^2\) is still correct
  • but \(\var(Y_i \mid x_i)\) increases with \(x_i\)

This violates the constant-variance assumption

  • A funnel shape in the residual plot suggests that the variance is changing with the mean or with \(x\)

  • Possible responses include transforming the response, modeling the variance, or using weighted least squares

What might help?

  • transform the response (for example, \(\log Y\) when appropriate)
  • use weighted least squares if the variance changes systematically
  • reconsider the mean model, since a bad mean model can also create patterns
  • for inference, use heteroscedasticity-robust standard errors

A high-leverage point

Suppose the mean model is still

\[ Y_i = \beta_1 + \beta_2 x_i^2 + \varepsilon_i, \qquad \varepsilon_i \sim \Norm(0,\sigma^2), \]

but now one observation has a predictor value far from the rest

For example, most of the data may have \(0 \le x_i \le 3\), but one point might have \(x_i = 4.5\)

Then:

  • the point may not have a large residual
  • but it can still have high leverage
  • and it can strongly affect the fitted regression

A point can have a large effect on the fitted model because its predictor value is unusual, even if its residual is not especially large.

What might help?

  • check whether the unusual predictor value is a data-entry error
  • determine whether the point belongs to the same population as the others
  • compare fits with and without the point
  • collect more data in that region of predictor space if possible

Summary of common diagnostic plots

Common diagnostic plots include:

  • Standardized residuals versus fitted values checks linearity and constant variance

  • Standardized residuals versus a predictor can reveal patterns that are easier to see when there is one main predictor

  • Normal Q-Q plot checks approximate normality of residuals

  • Leverage plot checks which observations have unusual predictor values

What to do if diagnostics raise concerns?

If diagnostics reveal problems, one might

  • transform the response or some predictors
  • add additional predictors or interaction terms
  • use weighted least squares if variance is not constant
  • check for data-entry errors or unusual observations
  • reconsider whether a different model is more appropriate

Diagnostics do not automatically tell us the right fix, but they do tell us where the model may be going wrong

Model selection

(CB §11.4; WMS §11.15)

  • Adding more predictors almost always improves the fit to the observed data
    • Reduces residual sum of squares (RSS)
  • However, a more complicated model is not better if the improvement is insignificant

Model selection aims to balance:

  • goodness of fit
  • model simplicity
  • interpretability
  • predictive performance

Comparing nested models

Suppose we compare two nested models:

  • reduced model with \(d_0\) parameters and residual sum of squares \(\RSS_0\)
  • full model with \(d_1 > d_0\) parameters and residual sum of squares \(\RSS_1\)

Since the full model has more predictors, we always have

\[ \RSS_1 \le \RSS_0 \]

The question is whether the reduction in RSS is large enough to justify the extra parameters.

F-test for nested models

To test whether the additional predictors in the full model are needed, use

\[ F = \frac{(\RSS_0-\RSS_1)/(d_1-d_0)}{\RSS_1/(n-d_1)} \]

Under the null hypothesis, \(H_0: \beta_{d_0+1} = \cdots = \beta_{d_1} = 0\), and assuming the model assumptions hold, we have \(\RSS_0-\RSS_1\) and \(\RSS_1\) are independent scaled chi-square random variables, so

\[ F \sim F_{d_1-d_0,\;n-d_1} \]

  • \(F > F_{d_1-d_0,\;n-d_1, \alpha}\) is compelling evidence to reject \(H_0\) and conclude that

    extra predictors improve the model

  • Small \(F\) suggests the simpler model is sufficient

Using \(t\)-tests in model selection

A natural way to simplify a model is to examine individual coefficients:

  • If a confidence interval for \(\beta_j\) includes 0, the predictor may be unnecessary
  • If the interval is clearly away from 0, the predictor appears important

 

This provides a useful first pass for identifying weak predictors

 

However, caution is needed:

  • Predictors may be correlated, so their effects can be shared
  • A variable may be insignificant alone but important together with others
  • Decisions based only on individual tests can be unstable

Suggested strategy

  • \(t\)-tests identify candidates for removal, but do not imply automatic removal
  • Compare models using an \(F\)-test (or another criterion)

 

Model selection should be based on the model as a whole, not just individual coefficients

Coefficient of determination \(R^2\)

To measure goodness of fit, we decompose the variability in the response:

\[\begin{gather*} \underbrace{\sum_{i=1}^n (Y_i - \barY)^2}_{\SST = \text{total sum of squares}} = \underbrace{\sum_{i=1}^n (\hY_i - \barY)^2}_{\SSR = \text{regression sum of squares}} + \underbrace{\sum_{i=1}^n (Y_i - \hat{Y}_i)^2}_{\RSS = \text{residual sum of squares}} \\ R^2 = \frac{\SSR}{\SST} = 1 - \frac{\RSS}{\SST} \in [0,1] \end{gather*}\]

  • Interpreted as the proportion of variability in \(Y\) explained by the model

  • Larger \(R^2\) indicates better fit, but it has a major drawback:

    • Never decreases when more predictors are added

Adjusted \(R^2\)

Adjusted \(R^2\) corrects for this by penalizing model complexity:

\[ R^2_{\text{adj}} = 1-\frac{\RSS/(n-d)}{\SST/(n-1)} \]

  • Larger values are better

Decomposition of sums of squares

Assume the model includes an intercept, so \(\vone\in\cc(\mX)\) and recall that \(\hvY=\mP\vY \in \cc(\mX)\) and \(\vY-\hvY=(\mI-\mP)\vY\in\cc(\mX)^\perp\)

Then we can decompose the total variability in \(\vY\) as \[\begin{align*} \SST & = \sum_{i=1}^n (Y_i - \barY)^2 = \lVert \vY-\barY\vone\rVert^2 \\ & = \lVert (\hvY-\barY\vone) + (\vY-\hvY) \rVert^2 \\ & = \lVert \underbrace{\hvY-\barY\vone}_{\in \cc(\mX)}\rVert^2 + \lVert \underbrace{\vY-\hvY}_{\in \cc(\mX)^\perp} \rVert^2 + 2 \underbrace{(\hvY-\barY\vone)^\top(\vY-\hvY)}_{\text{dot product of orthogonal vectors}} \\ & = \sum_{i=1}^n (\hY_i - \barY)^2 + \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 + 0 \\ & = \SSR + \RSS \end{align*}\]

Other metrics and methods for model selection

  • Akaike information criterion (AIC) rewards good fit but penalizes complexity
  • Bayesian information criterion (BIC) penalizes complexity more strongly than AIC
  • Cross-validation (CV) separates the data into a training set and a testing set to estimate predictive performance on new data
  • Forward selection, backward elimination, and stepwise regression are automated procedures for adding or removing predictors based on criteria like AIC or BIC

Practical principles for model selection

When comparing regression models:

  • Prefer simpler models if they explain the data nearly as well
  • Use subject-matter knowledge, not just automatic procedures
  • Check diagnostics again after selecting a model
  • Remember that the “best” model for explanation may differ from the “best” model for prediction

A good regression model should fit well, make scientific sense, and generalize to new data.

Our example revisited

\[ y = 0.10 + 0.90 x^2 + \varepsilon, \qquad \varepsilon \sim \Norm(\vzero, 1.00\,\mI) \]

The \(p\)-value is not small, so there is no compelling evidence that the intercept is needed

Coefficient Estimates (Full Model)

Estimate Std. Error t value p value CI Lower CI Upper
Intercept -0.2341 0.3021 -0.7749 0.4449 -0.8529 0.3847
x^2 0.9830 0.0732 13.4323 0.0000 0.8331 1.1329

The confidence interval for the intercept includes 0, suggesting it may not be necessary, while the interval for the \(x^2\) term is clearly away from 0, indicating it is important

Model Comparison

Model Parameters (d) RSS
Reduced (no intercept) 1 35.512100
Full 2 34.766600

The RSS for the full model is not much smaller than the RSS for the reduced model, suggesting that the intercept may not be needed

F-test for Removing Intercept

Quantity Value
F statistic 0.600400
Numerator df (d_full − d_reduced) 1.000000
Denominator df (n − d_full) 28.000000
p-value 0.444900

Goodness of Fit

Metric Value
R^2 (full model) 0.865700
Adjusted R^2 (full model) 0.860900
Uncentered R^2 (reduced model, no intercept) 0.927300

For models without an intercept, we use the uncentered \(R^2\):

\[ R^2_{\text{unc}} = 1 - \frac{\sum_{i=1}^n (Y_i - \hat{Y}_i)^2}{\sum_{i=1}^n Y_i^2} \]

  • Measures fit relative to 0 (not \(\barY\))
  • Not directly comparable to the usual \(R^2\)

Diagnostic plots for the reduced model

Confidence/prediction intervals/bands for the reduced model

  • Because the reduced model has only one regression parameter, the “simultaneous” and “pointwise” intervals are the same in this case

  • The reduced model fits the data well, even though it is not of the same form as the true model

Reduced model caveat

  • Removing the intercept forces \(\mu(0)=0\)

  • Near \(x=0\), the confidence band misses the true mean (it shrinks like \(x^2\))

    • Widening the band does not help
  • This is model misspecification, not a band issue

  • Confidence bands reflect uncertainty given the model, not model correctness

  • From the observed data, there is no clear evidence that an intercept is needed

F-tests: A unifying framework

From Normal to \(\chi^2\) to \(F\)

  • If \(Z_1,\dots,Z_\nu \overset{\text{iid}}{\sim} \mathcal{N}(0,1)\), then \[ \sum_{i=1}^{\nu} Z_i^2 \sim \chi^2_\nu, \text{ a chi-square distribution with $\nu$ degrees of freedom} \] Note that \(\Ex(\chi^2_\nu) = \nu\) and \(\var(\chi^2_\nu) = 2\nu\)

  • If \(U \sim \chi^2_\nu\) and \(V \sim \chi^2_\lambda\) are independent, then \[ \frac{U/\nu}{V/\lambda} \sim F_{\nu,\lambda}, \text{ an $F$-distribution with $\nu$ and $\lambda$ degrees of freedom} \] Note that \(\Ex(F_{\nu,\lambda}) = \frac{\lambda}{\lambda - 2}\) for \(\lambda > 2\) and \(\var(F_{\nu,\lambda}) = \frac{2\lambda^2(\nu + \lambda - 2)}{\nu(\lambda - 2)^2(\lambda - 4)}\) for \(\lambda > 4\)

Comparing variances

  • For two independent samples: \[ \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \sim F_{n_1-1,\;n_2-1} \]

  • To test \(H_0:\ \sigma_1^2 = \sigma_2^2\), use \[ F = \frac{S_1^2}{S_2^2} \]

  • Put the larger sample variance in the numerator

    • ensures \(F \ge 1\)
    • simplifies one-sided rejection

Regression and ANOVA

  • In nested linear models where \(H_0\) assumes the reduced model is correct: \[\begin{align*} F & = \frac{\text{extra explained variation (gain in fit) per added parameter}} {\text{residual variation (noise) per degree of freedom}} \\ &= \frac{(\text{RSS}_0 - \text{RSS}_1)/(d_1 - d_0)} {\text{RSS}_1/(n - d_1)} \sim F_{d_1 - d_0,\, n - d_1} \end{align*}\]

  • Large \(F \implies\) reduced model is inadequate, (compelling evidence to reject \(H_0\))

  • Both numerator and denominator estimate \(\sigma^2\)

  • Under \(H_0\), the extra parameters explain only noise, so the ratio is near 1

  • A large ratio indicates signal beyond noise

Inverting degrees of freedom

  • If \(F \sim F_{\nu,\lambda}\), then \[ \frac{1}{F} \sim F_{\lambda,\nu} \]

  • Therefore, critical values satisfy \[ F_{\nu,\lambda,\alpha} = \frac{1}{F_{\lambda,\nu,\,1-\alpha}} \]

  • Useful when tables only give one ordering

Big picture

  • \(F\)-tests compare two independent estimates of variance

  • Unify

    • variance comparison
    • ANOVA
    • regression model selection
  • Interpretation depends on context:

    • variance comparison: are two noise levels equal?
    • model selection: \(\displaystyle\frac{\text{extra variation}}{\text{noise}}\)

From Regression to One-Way Analysis of Variance (ANOVA) and Back

(CB §11.2; WMS Ch. 13)

  • So far, we modeled relationships with quantitative predictors

  • Now we consider a categorical predictor (apples and oranges, not numbers)

Do different groups have different means?

Set-up

  • \(Y_{ij}\), \(i=1,\ldots,k,\; j=1,\ldots,n_i\) comes from group \(i\) and IID observation \(j\) in that group

  • Total sample size is \(n = \sum_{i=1}^k n_i\)

  • Compare the group means, \(\mu_i = \Ex(Y_{ij})\), \(i=1,\ldots,k\)

Model for one-way ANOVA

Model

\[ Y_{ij} = \mu_i + \varepsilon_{ij}, \qquad \Ex(Y_{ij}) = \mu_i, \quad \varepsilon_{ij} \sim \Norm(0,\sigma^2), \quad \text{independent} \]

Hypotheses

\[ H_0: \mu_1 = \mu_2 = \cdots = \mu_k, \qquad H_a: \text{not all } \mu_i \text{ are equal} \]

Sample means

\[ \text{Group: } \barY_i = \frac{1}{n_i} \sum_{j=1}^{n_i} Y_{ij}, \, i=1, \ldots, k, \qquad \text{Overall: } \barY = \frac{1}{n} \sum_{i=1}^k \sum_{j=1}^{n_i} Y_{ij} \]

Variability

\[\begin{align*} \SST &= \SSB + \SSW \qquad \text{$\exstar$ verify this}\\ \text{Sum of Squares Total ($\SST$)} & = \sum_{i=1}^k \sum_{j=1}^{n_i} (Y_{ij} - \barY)^2 \\ & \qquad \qquad \text{measures how spread out all observations are} \\ \text{Sum of Squares Between ($\SSB$)} & = \sum_{i=1}^k n_i (\barY_i - \barY)^2 \\ & \qquad \qquad \text{measures how spread out the group means are} \\ \text{Sum of Squares Within ($\SSW$)} & = \sum_{i=1}^k \sum_{j=1}^{n_i} (Y_{ij} - \barY_i)^2 \\ & \qquad \qquad \text{measures how spread out the observations are within each group} \end{align*}\]

Compare signal (between) vs noise (within)

Mean square and \(F\)-statistic

\[\begin{gather*} \MSB= \frac{\SSB}{\underbrace{k-1}_{\text{degrees of freedom between}}} \qquad \text{and}\qquad \MSW = \frac{\SSW}{\underbrace{n-k}_{\text{degrees of freedom within}}} \text{ (estimator of $\sigma^2$)} \\ F = \frac{\text{MSB}}{\text{MSW}} \sim F_{k-1,\,n-k} \qquad \text{under }H_0: \mu_1 = \mu_2 = \cdots = \mu_k \end{gather*}\]

Interpretation of the F-test

  • \(F > F_{k-1, n-k, \alpha}\) gives strong evidence to reject \(H_0\) and conclude that the means differ

  • Small \(F\) implies that there is insufficient evidence of differences between the means of groups

ANOVA detects differences, not which groups differ

ANOVA table

\[ \begin{array}{lccc} \text{Source} & \text{Sum of Squares} & \text{df} & \text{Mean Square} \\[6 pt] \hline \text{Between groups} & \SSB & k-1 & \MSB \\ \text{Within groups} & \SSW & n-k & \MSW \\[10 pt] \hline \text{Total} & \SST & n-1 & \end{array} \]

 

Big picture

  • One-way ANOVA compares multiple means

  • Uses variance decomposition to compare variability between groups to variability within groups

  • Leads to a single global test

Example: one-way ANOVA data

Here are three groups of data:

1 2 3
8.2 8.0 10.1
7.9 8.3 9.8
8.4 7.8 10.4
8.1 8.1 10.0
nan 8.2 nan
n mean variance
group
1 4 8.150 0.0433
2 5 8.080 0.0370
3 4 10.075 0.0625
Source Sum of Squares df Mean Square (MS)
0 Between groups (B) 10.6914 2 5.3457
1 Within groups (W) 0.4655 10 0.0465
2 Total 11.1569 12

 

\[\begin{gather*} H_0:\mu_1=\mu_2=\mu_3 \text{ vs} \\ H_a:\text{not all } \mu_i \text{ are equal} \end{gather*}\]

Quantity Value
F = MSB/MSW 114.838100
df Between 2.000000
df Within 10.000000
p-value 0.000000

ANOVA as a regression model

We can rewrite the ANOVA model using indicator functions.

\[ Y_{ij} = \mu_i + \varepsilon_{ij} = \sum_{\ell=1}^k \mu_\ell \, \indic(i = \ell) + \varepsilon_{ij} \]

  • \(\indic(i=\ell)=1\) if observation \((i,j)\) is in group \(\ell\), and \(0\) otherwise

  • For each observation, exactly one indicator is \(1\), so \[ \sum_{\ell=1}^k \indic(i=\ell) = 1 \]

ANOVA is a regression model with categorical predictors

In vector/matrix form

\[\begin{equation*} \vY = \mX \vmu + \vveps \qquad \text{where } \mX = \begin{pmatrix} 0 & 1 & 0 & \cdots & 0 & 0 \\ 0 & 0 & 0 & \cdots & 1 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 1 & 0 & 0 & \cdots & 0 & 0 \end{pmatrix} \end{equation*}\]

  • Each row of \(\mX\) has exactly one \(1\) and the rest \(0\)

No intercept needed

\[ \sum_{\ell=1}^k \indic(i=\ell) = 1, \]

so including an intercept would make the columns of \(\mX\) linearly dependent

We include all group indicators and omit the intercept to keep the \(\mX\) having full column rank

Connection to least squares

  • The \(\mu_\ell\) are the regression coefficients and correspond to the group means \(\Ex(Y_{ij} \mid i=\ell)\)

  • \(\mX^\top \mX = \begin{pmatrix} n_1 & 0 & \cdots & 0 \\ 0 & n_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & n_k \end{pmatrix}\) so \(\hvmu = (\mX^\top \mX)^{-1} \mX^\top \vY = \begin{pmatrix} \hmu_1 \\ \hmu_2 \\ \vdots \\ \hmu_k \end{pmatrix}\) with \(\hmu_\ell = \displaystyle \frac{1}{n_\ell} \sum_{j=1}^{n_\ell} Y_{\ell j} =: \barY_\ell\)

  • \(\displaystyle \underbrace{\sum_{i=1}^k \sum_{j=1}^{n_i} (Y_{ij} - \barY)^2}_{\SST} = \underbrace{\sum_{i=1}^k n_i (\barY_i - \barY)^2}_{\SSB} + \underbrace{\sum_{i=1}^k \sum_{j=1}^{n_i} (Y_{ij} - \barY_i)^2}_{\SSW}\)

  • The one-way ANOVA test statistic, \(\displaystyle F = \frac{\MSB}{\MSW} = \frac{\SSB/(k-1)}{\SSW/(n-k)}\),

    is exactly the regression F-test comparing

    • the full model (different group means) to the
    • reduced model (all means equal)

ANOVA is just linear regression with indicator variables

  • Same least squares framework
  • Same projection geometry
  • Same F-test machinery

What regression adds

The global ANOVA test examines \(H_0:\mu_1=\mu_2=\cdots=\mu_k\) vs \(H_a:\) not all means are equal, but it does not tell us which means differ

Regression can test whether some means are equal, not just whether all means are equal

Examples of hypotheses we can test

  • Are groups 1 and 2 the same? \(H_0:\mu_1=\mu_2\)

  • Can we pool groups 1 and 2, and also pool groups 3 and 4? \(H_0:\mu_1=\mu_2,\ \mu_3=\mu_4\)

These can all be written in the form \(H_0:\mA \vmu = \vb\)

Testing a linear hypothesis

Consider the model \(\vY = \mX \vmu + \vveps\), \(\vveps \sim \Norm(\vzero, \sigma^2 \mI)\)

and the hypothesis \(H_0:\mA \vmu = \vb\), where \(\mA\) is an \(r \times k\) matrix

F-statistic

The test statistic under \(H_0\) is

\[ F = \frac{(\RSS_{\mA\vmu = \vb} - \RSS_{\text{full}})/r}{\RSS_{\text{full}}/(n-k)} = \frac{ (\mA \hvmu - \vb)^\top \bigl[\mA(\mX^\top \mX)^{-1}\mA^\top\bigr]^{-1} (\mA \hvmu - \vb)/r }{ S^2 } \sim F_{r,n-k} \quad \text{under } H_0. \]

  • \(\mA \hvmu - \vb\) measures how far we are from satisfying the constraints

  • \(\mA(\mX^\top \mX)^{-1}\mA^\top\) gives the covariance of \(\mA \hvmu\) up to a factor of \(\sigma^2\)

  • \(S^2 = \RSS/(n-k)\) estimates \(\sigma^2\)

We compare signal (violation of constraints) to noise

Example: Testing Whether Two Means Are Equal

We test whether groups 1 and 2 have the same mean

\[ H_0:\mu_1=\mu_2 \qquad \Longleftrightarrow \qquad H_0: \begin{pmatrix} 1 & -1 & 0 \end{pmatrix} \vmu = 0 \]

  • Full model: \(Y_{ij} = \mu_i + \varepsilon_{ij}\)

  • Reduced model under \(H_0:\mu_1=\mu_2\): \(\displaystyle Y_{ij} = \begin{cases} \mu_{12} + \varepsilon_{ij}, & i=1,2 \\ \mu_3 + \varepsilon_{ij}, & i=3 \end{cases}\)

  • \(\displaystyle F = \frac{(\RSS_{\mu_1=\mu_2} - \RSS_{\text{full}})/r}{\RSS_{\text{full}}/(n-k)}\)

Quantity Value
0 F 0.2339
1 df (constraint) 1.0000
2 df (within) 10.0000
3 p-value 0.6390

 

  • The reduced model forces groups 1 and 2 to have the same mean

  • If this constraint greatly increases the residual sum of squares, then \(H_0\) is not plausible

Reject \(H_0\) if enforcing \(\mu_1=\mu_2\) causes a large increase in error

Big Picture

ANOVA is a special case of regression, and regression allows targeted comparisons between groups

  • ANOVA tests whether all means are equal
  • Regression lets us test specific relationships such as \(\mu_1=\mu_2\)
  • The same F-test machinery applies
  • This flexibility becomes even more valuable in more complicated models
«
»