From decision theory to P-values

The following is a comparison of economic rationality and Null Hypothesis Significance Testing from the perspective of the former. The starting point is maximization of expected utility, which is a standard criterion for economic rationality. Then the focus is on the ingredients in economic rationality that must be thrown away to end up with NHST.

1. Decision theory and economic rationality

Consider a simple situation where there are two mutually exclusive hypotheses that we call h_1 and h_0. It is not known for certain which hypothesis is true, all we can do is assign probabilities P(h_1) and P(h_0). Assume for simplicity that no other hypotheses are in contention, i.e. P(h_1 \lor h_0) = 1 and P(\lnot h_1 \land \lnot h_0) = 0. Connected to the hypotheses are two mutually exclusive actions a_1 and a_0. To quantify the reward or cost associated with the actions, we introduce a utility function. Minimally, the utility depends on which hypothesis true and which action is performed. The expected utility given that h_i is true and the action a_j is performed is denoted

\displaystyle u(h_i,a_j) = E(U|h_i,a_j).

The standard criterion in decision theory and economics is to choose the action that maximizes the expected utility. In the present simple setting this means comparing the expected utilities

\displaystyle E(U|a_1) = P(h_1) u(h_1,a_1) + P(h_0) u(h_0, a_1),
\displaystyle E(U|a_0) = P(h_1) u(h_1,a_0) + P(h_0) u(h_0, a_0)

If E(U|a_1) > E(U|a_0), then a_1 is the economically rational action. Conversely, if E(U|a_0) > E(U|a_1), then a_0 is the economically rational action.

One could stop the exposition of decision theory here, on the note that decision theorists take into account both probabilities and utilities when choosing how to act. However, to better connect to standard statistics, it is useful to also consider the situation after new data becomes available. Once new data x is available, a good Bayesian updates the probabilities. The update from the prior probabilities P(h_i) to the posterior probabilities P(h_i|x) is done using Bayes theorem:

\displaystyle P(h_i|x) = \frac{P(x|h_i) P(h_i)}{P(x)}

Next, assume for simplicity that the new data does not affect the utilities. In symbols, u(h_i \land x, a) = u(h_i,a). To choose between the actions a_1 and a_0 one should now compare the expected utilities
\displaystyle E(U|x,a_j) = P(h_1|x) u(h_1,a_j) + P(h_0|x) u(h_0, a_j)
\displaystyle = \frac{P(x|h_1) P(h_1)}{P(x)} u(h_1,a_j) + \frac{P(x|h_0) P(h_0)}{P(x)} u(h_0, a_j)

1.1 Reduction to three parameters

Utilities lack scale in the sense that two utility functions that just differ by constant shift are equivalent. Moreover, two utilities that differ by a constant factor are also equivalent. This can be exploited to reduce the number of parameters that need to be considered when choosing between the actions.

To check whether E(U|x,a_1) > E(U|x,a_0) is equivalent to checking whether the difference

G(U|x) = E(U|x,a_1) - E(U|x,a_0)

is positive or negative. In this difference any constant shift of utilities cancels out, which may be emphasized by introducing two relative utilities

v(h_1) = u(h_1,a_1) - u(h_1,a_0)
v(h_0) = u(h_0,a_0) - u(h_0, a_1).

In terms of these relative utilities,

\displaystyle G(U|x) = \frac{P(x|h_1) P(h_1)}{P(x)} v(h_1) - \frac{P(x|h_0) P(h_0)}{P(x)} v(h_0).

Next, the lack of multiplicative scale can be exploited. Because a_1 is supposed to be an action suitable when h_1 is true and a_0 is supposed to be an action suitable when h_0 is true, we may assume that v(h_1) and v(h_0) are positive quantities. Introducing the relative utilities

\displaystyle w(h_i) = \frac{v(h_i)}{v(h_1) + v(h_0)},

we may note that 0 \leq w(h) \leq 1, and w(h) + w(\lnot h) = 1. Hence,

\displaystyle G'(U|x)  = \frac{G(U|x)}{v(h_1) + v(h_0)} = \frac{P(x|h_1) P(h_1)}{P(x)} w(h_1) - \frac{P(x|h_0) P(h_0)}{P(x)} w(h_0).

As a final touch, since we only care about whether G'(U|a) and G(U|a) are positive or negative, and not their numerical value, we may multiply by P(x) and divide by \sqrt{P(x|h_1) P(x|h_0)} to obtain

\displaystyle g(U|x) = \frac{P(x)}{\sqrt{P(x|h_1) P(x|h_0)}} G'(U|x) = \sqrt{\frac{P(x|h_1)}{P(x|h_0)}} P(h) w(h_1) - \sqrt{\frac{P(x|h_0)}{P(x|h_1)}} P(h_0) w(h_0)
\displaystyle = \sqrt{K(x)} P(h_1) w(h_1) - \frac{1}{\sqrt{K(x)}} P(h_0) w(h_0).

In the final expression above, the Bayes factor or likelihood ratio has been introduced:

\displaystyle K(x) = \frac{P(x|h_1)}{P(x|h_0)}.

An economically rational agent can thus choose action by checking whether g(U|x) is positive or negative. And to check that, it is enough to know the prior probability P(h), the utility fraction w(h), and the likelihood ratio K(x). A form of the decision criterion that emphasizes the likelihood ratio is

\displaystyle K(x) > \frac{P(h_0) w(h_0)}{P(h_1) w(h_1)},

where the right-hand side can be computed before the data x is available.

The rescaled expected utility g(U|x) illustrated for three different values of K(x). The dashed lines separates the positive region, where a1 is preferred, from the negative region, where a0 is preferred. Note the symmetry between K(x) = 3 and K(x) = 1/3.

2. The Neyman-Pearson school

To go from decision theory and economic rationality to Neyman–Pearson’s framework is almost as easy as these two steps:

  • Step 1: Forget about prior knowledge and the probabilities P(h_1) and P(h_0).
  • Step 2: Relegate the utilities to a background role, but do not completely forget them.

The philosophy behind Step 1 is that prior probabilities of hypotheses are too subjective to have a place in statistical reasoning. On the other hand, the likelihood P(x|h_i) is considered more objective, at least when the hypothesis h_i is a simple statistical model without any unknown parameters.

In the section on decision theory, it was not necessary to assume much about the actions. In Neyman–Pearson’s framework, the action a_i is often taken to mean “accept h_i“. Accepting h_1 when h_0 is true is a Type I error, and is associated with some cost. Accepting h_0 when h_1 is true is a Type II error, which is also associated with a cost. Hence, the assumptions v(h_1) \geq 0 and v(h_0) \geq 0 from the previous section are still appropriate.

In the Neyman–Pearson framework, the ideal decision criterion is a likelihood ratio test. In such a test, one chooses a threshold \kappa. If K(x) > \kappa, then the action a_1 (“accept h_1“) is taken. If K(x) < \kappa, then the action a_0 (“accept h_0“) is taken. If the threshold happens to equal

\displaystyle \frac{P(h_0) w(h_0)}{P(h_1) w(h_1)}

then this procedure is equivalent to the economically rational criterion. However, in Neyman–Pearson’s framework one does not aim to be economically rational. Instead, one aims to find a good trade off between Type I and Type II errors. The likelihood of a Type I error is denoted

\alpha = P("accept\ h_1"|h_0) = P(K(x) > \kappa | h_0)

and is called the significance level. The likelihood of a Type II error is written

\beta = P("accept\ h_0"|h_1) = P(K(x) \leq \kappa | h_1)

and 1-\beta is called the power of the test. Ideally, both \alpha and \beta should be vanishingly small, but this is typically not possible to achieve. The threshold \kappa is supposed to be chosen with the costs of Type I and Type II errors in mind, so as to yield a good trade off between \alpha and \beta.

2.1 Filter data

There is an optional third step:

  • Step 3: Replace the likelihood ratio by a suboptimal statistic.

A statistic is a quantity that summarizes the data x in some way. In the Neyman–Pearson framework, the likelihood ratio K(x) is the ideal statistic in the sense that no other statistic can yield lower \beta for a given \alpha. However, though suboptimal, it is permissible to choose a different test statistic. Denoting that statistic by t(x), the likelihoods of Type I and II errors are then

\alpha = P(t(x) > \tau | h_0),
\beta = P(t(x) \leq \tau | h_1).

The test statistic should mimic the feature of the likelihood ratio that large values are more compatible with h_1 and small values are more compatible with h_0. The threshold value \tau should be chosen with the relative cost of Type I and II errors in mind, so as to yield a good compromise between low \alpha and low \beta.

The choice of statistic t and threshold \tau should be made before the actual data x is available. When it becomes available, one compares t(x) to the threshold. If t(x) > \tau, then one chooses the action a_1. If t(x) \leq \tau, then one chooses the action a_0.

3. Null Hypothesis Significance Testing

Step 3 was optional above. In Null Hypothesis Significance Testing (NHST), Step 3 is mandatory. Additionally:

  • Step 4: Forget all about the costs of Type I and II errors.
  • Step 5: Forget about the likelihood P(x|h_1).

In NHST, prior knowledge and utilities are disregarded and decisions are made based on only the likelihood P(x|h_0). Often h_1 is a relatively vague purely verbal research hypothesis, whereas h_0 is a simplistic and sharp statistical model. As in the Neyman–Pearson school, one relies on a test statistic t(x) and a significance level \alpha chosen before the data becomes available. The test statistic should be chosen so that large values are intuitively more compatible with h_1 and small values are intuitively more compatible with h_0. The typical choice is \alpha = 0.05, not because it is sensible but because that’s what everyone else chooses. Having fixed an \alpha, the threshold \tau is chosen to satisfy

\alpha = P(t(x) > \tau | h_0).

Deviating slightly from the notation in previous sections, let’s denote the actually observed data by x_{\text{obs}} and the random variable representing the data by x. Once the data is observed, one then checks whether t(x_{\text{obs}}) > \tau. If so, one accepts h_1. If not, one accepts h_0.

Checking whether t(x_{\text{obs}}) > \tau is equivalent to the more familiar test involving the P-value:

"\text{P-value}" = P(t(x) > t(x_{\text{obs}}) | h_0) < \alpha.

Since h_1 and h_0 are mutually exclusive, “accepting h_1” implies rejecting h_0. Likewise, “accepting h_0” implies rejecting h_1. However, many practitioners of NHST prefer other actions:

  • Step 6: Let the actions be “reject h_0“, if the P-value is less than \alpha, and “not reject h_0“, otherwise.

“Not reject h_0” is perhaps better expressed as “remain undecided about the two hypotheses”. This means that h_1 is perpetually in limbo, never accepted or rejected. Moreover, it means that any action that would be advisable if h_1 is sufficiently well confirmed will never be taken.

3.1 Why rely on NHST?

NHST is flawed enough that one may wonder why it is so popular. Many articles linked in previous posts on this blog have commented on this. I also found two recent articles devoted to this question:

Leave a comment

Design a site like this with WordPress.com
Get started