From decision theory to P-values

The following is a comparison of economic rationality and Null Hypothesis Significance Testing from the perspective of the former. The starting point is maximization of expected utility, which is a standard criterion for economic rationality. Then the focus is on the ingredients in economic rationality that must be thrown away to end up with NHST.

1. Decision theory and economic rationality

Consider a simple situation where there are two mutually exclusive hypotheses that we call $h_1$ and $h_0$ . It is not known for certain which hypothesis is true, all we can do is assign probabilities $P(h_1)$ and $P(h_0)$ . Assume for simplicity that no other hypotheses are in contention, i.e. $P(h_1 \lor h_0) = 1$ and $P(\lnot h_1 \land \lnot h_0) = 0$ . Connected to the hypotheses are two mutually exclusive actions $a_1$ and $a_0$ . To quantify the reward or cost associated with the actions, we introduce a utility function. Minimally, the utility depends on which hypothesis true and which action is performed. The expected utility given that $h_i$ is true and the action $a_j$ is performed is denoted

$\displaystyle u(h_i,a_j) = E(U|h_i,a_j)$ .

The standard criterion in decision theory and economics is to choose the action that maximizes the expected utility. In the present simple setting this means comparing the expected utilities

$\displaystyle E(U|a_1) = P(h_1) u(h_1,a_1) + P(h_0) u(h_0, a_1)$ ,
$\displaystyle E(U|a_0) = P(h_1) u(h_1,a_0) + P(h_0) u(h_0, a_0)$

If $E(U|a_1) > E(U|a_0)$ , then $a_1$ is the economically rational action. Conversely, if $E(U|a_0) > E(U|a_1)$ , then $a_0$ is the economically rational action.

One could stop the exposition of decision theory here, on the note that decision theorists take into account both probabilities and utilities when choosing how to act. However, to better connect to standard statistics, it is useful to also consider the situation after new data becomes available. Once new data $x$ is available, a good Bayesian updates the probabilities. The update from the prior probabilities $P(h_i)$ to the posterior probabilities $P(h_i|x)$ is done using Bayes theorem:

$\displaystyle P(h_i|x) = \frac{P(x|h_i) P(h_i)}{P(x)}$

Next, assume for simplicity that the new data does not affect the utilities. In symbols, $u(h_i \land x, a) = u(h_i,a)$ . To choose between the actions $a_1$ and $a_0$ one should now compare the expected utilities
$\displaystyle E(U|x,a_j) = P(h_1|x) u(h_1,a_j) + P(h_0|x) u(h_0, a_j)$
$\displaystyle = \frac{P(x|h_1) P(h_1)}{P(x)} u(h_1,a_j) + \frac{P(x|h_0) P(h_0)}{P(x)} u(h_0, a_j)$

1.1 Reduction to three parameters

Utilities lack scale in the sense that two utility functions that just differ by constant shift are equivalent. Moreover, two utilities that differ by a constant factor are also equivalent. This can be exploited to reduce the number of parameters that need to be considered when choosing between the actions.

To check whether $E(U|x,a_1) > E(U|x,a_0)$ is equivalent to checking whether the difference

$G(U|x) = E(U|x,a_1) - E(U|x,a_0)$

is positive or negative. In this difference any constant shift of utilities cancels out, which may be emphasized by introducing two relative utilities

$v(h_1) = u(h_1,a_1) - u(h_1,a_0)$
$v(h_0) = u(h_0,a_0) - u(h_0, a_1).$

In terms of these relative utilities,

$\displaystyle G(U|x) = \frac{P(x|h_1) P(h_1)}{P(x)} v(h_1) - \frac{P(x|h_0) P(h_0)}{P(x)} v(h_0).$

Next, the lack of multiplicative scale can be exploited. Because $a_1$ is supposed to be an action suitable when $h_1$ is true and $a_0$ is supposed to be an action suitable when $h_0$ is true, we may assume that $v(h_1)$ and $v(h_0)$ are positive quantities. Introducing the relative utilities

$\displaystyle w(h_i) = \frac{v(h_i)}{v(h_1) + v(h_0)}$ ,

we may note that $0 \leq w(h) \leq 1$ , and $w(h) + w(\lnot h) = 1$ . Hence,

$\displaystyle G'(U|x) = \frac{G(U|x)}{v(h_1) + v(h_0)} = \frac{P(x|h_1) P(h_1)}{P(x)} w(h_1) - \frac{P(x|h_0) P(h_0)}{P(x)} w(h_0)$ .

As a final touch, since we only care about whether $G'(U|a)$ and $G(U|a)$ are positive or negative, and not their numerical value, we may multiply by $P(x)$ and divide by $\sqrt{P(x|h_1) P(x|h_0)}$ to obtain

$\displaystyle g(U|x) = \frac{P(x)}{\sqrt{P(x|h_1) P(x|h_0)}} G'(U|x) = \sqrt{\frac{P(x|h_1)}{P(x|h_0)}} P(h) w(h_1) - \sqrt{\frac{P(x|h_0)}{P(x|h_1)}} P(h_0) w(h_0)$
$\displaystyle = \sqrt{K(x)} P(h_1) w(h_1) - \frac{1}{\sqrt{K(x)}} P(h_0) w(h_0)$ .

In the final expression above, the Bayes factor or likelihood ratio has been introduced:

$\displaystyle K(x) = \frac{P(x|h_1)}{P(x|h_0)}$ .

An economically rational agent can thus choose action by checking whether $g(U|x)$ is positive or negative. And to check that, it is enough to know the prior probability $P(h)$ , the utility fraction $w(h)$ , and the likelihood ratio $K(x)$ . A form of the decision criterion that emphasizes the likelihood ratio is

$\displaystyle K(x) > \frac{P(h_0) w(h_0)}{P(h_1) w(h_1)}$ ,

where the right-hand side can be computed before the data $x$ is available.

The rescaled expected utility g(U|x) illustrated for three different values of K(x). The dashed lines separates the positive region, where a1 is preferred, from the negative region, where a0 is preferred. Note the symmetry between K(x) = 3 and K(x) = 1/3.

2. The Neyman-Pearson school

To go from decision theory and economic rationality to Neyman–Pearson’s framework is almost as easy as these two steps:

Step 1: Forget about prior knowledge and the probabilities $P(h_1)$ and $P(h_0)$ .
Step 2: Relegate the utilities to a background role, but do not completely forget them.

The philosophy behind Step 1 is that prior probabilities of hypotheses are too subjective to have a place in statistical reasoning. On the other hand, the likelihood $P(x|h_i)$ is considered more objective, at least when the hypothesis $h_i$ is a simple statistical model without any unknown parameters.

In the section on decision theory, it was not necessary to assume much about the actions. In Neyman–Pearson’s framework, the action $a_i$ is often taken to mean “accept $h_i$ “. Accepting $h_1$ when $h_0$ is true is a Type I error, and is associated with some cost. Accepting $h_0$ when $h_1$ is true is a Type II error, which is also associated with a cost. Hence, the assumptions $v(h_1) \geq 0$ and $v(h_0) \geq 0$ from the previous section are still appropriate.

In the Neyman–Pearson framework, the ideal decision criterion is a likelihood ratio test. In such a test, one chooses a threshold $\kappa$ . If $K(x) > \kappa$ , then the action $a_1$ (“accept $h_1$ “) is taken. If $K(x) < \kappa$ , then the action $a_0$ (“accept $h_0$ “) is taken. If the threshold happens to equal

$\displaystyle \frac{P(h_0) w(h_0)}{P(h_1) w(h_1)}$

then this procedure is equivalent to the economically rational criterion. However, in Neyman–Pearson’s framework one does not aim to be economically rational. Instead, one aims to find a good trade off between Type I and Type II errors. The likelihood of a Type I error is denoted

$\alpha = P("accept\ h_1"|h_0) = P(K(x) > \kappa | h_0)$

and is called the significance level. The likelihood of a Type II error is written

$\beta = P("accept\ h_0"|h_1) = P(K(x) \leq \kappa | h_1)$

and $1-\beta$ is called the power of the test. Ideally, both $\alpha$ and $\beta$ should be vanishingly small, but this is typically not possible to achieve. The threshold $\kappa$ is supposed to be chosen with the costs of Type I and Type II errors in mind, so as to yield a good trade off between $\alpha$ and $\beta$ .

2.1 Filter data

There is an optional third step:

Step 3: Replace the likelihood ratio by a suboptimal statistic.

A statistic is a quantity that summarizes the data $x$ in some way. In the Neyman–Pearson framework, the likelihood ratio $K(x)$ is the ideal statistic in the sense that no other statistic can yield lower $\beta$ for a given $\alpha$ . However, though suboptimal, it is permissible to choose a different test statistic. Denoting that statistic by $t(x)$ , the likelihoods of Type I and II errors are then

$\alpha = P(t(x) > \tau | h_0),$
$\beta = P(t(x) \leq \tau | h_1).$

The test statistic should mimic the feature of the likelihood ratio that large values are more compatible with $h_1$ and small values are more compatible with $h_0$ . The threshold value $\tau$ should be chosen with the relative cost of Type I and II errors in mind, so as to yield a good compromise between low $\alpha$ and low $\beta$ .

The choice of statistic $t$ and threshold $\tau$ should be made before the actual data $x$ is available. When it becomes available, one compares $t(x)$ to the threshold. If $t(x) > \tau$ , then one chooses the action $a_1$ . If $t(x) \leq \tau$ , then one chooses the action $a_0$ .

3. Null Hypothesis Significance Testing

Step 3 was optional above. In Null Hypothesis Significance Testing (NHST), Step 3 is mandatory. Additionally:

Step 4: Forget all about the costs of Type I and II errors.
Step 5: Forget about the likelihood $P(x|h_1)$ .

In NHST, prior knowledge and utilities are disregarded and decisions are made based on only the likelihood $P(x|h_0)$ . Often $h_1$ is a relatively vague purely verbal research hypothesis, whereas $h_0$ is a simplistic and sharp statistical model. As in the Neyman–Pearson school, one relies on a test statistic $t(x)$ and a significance level $\alpha$ chosen before the data becomes available. The test statistic should be chosen so that large values are intuitively more compatible with $h_1$ and small values are intuitively more compatible with $h_0$ . The typical choice is $\alpha = 0.05$ , not because it is sensible but because that’s what everyone else chooses. Having fixed an $\alpha$ , the threshold $\tau$ is chosen to satisfy

$\alpha = P(t(x) > \tau | h_0).$

Deviating slightly from the notation in previous sections, let’s denote the actually observed data by $x_{\text{obs}}$ and the random variable representing the data by $x$ . Once the data is observed, one then checks whether $t(x_{\text{obs}}) > \tau$ . If so, one accepts $h_1$ . If not, one accepts $h_0$ .

Checking whether $t(x_{\text{obs}}) > \tau$ is equivalent to the more familiar test involving the P-value:

$"\text{P-value}" = P(t(x) > t(x_{\text{obs}}) | h_0) < \alpha.$

Since $h_1$ and $h_0$ are mutually exclusive, “accepting $h_1$ ” implies rejecting $h_0$ . Likewise, “accepting $h_0$ ” implies rejecting $h_1$ . However, many practitioners of NHST prefer other actions:

Step 6: Let the actions be “reject $h_0$ “, if the P-value is less than $\alpha$ , and “not reject $h_0$ “, otherwise.

“Not reject $h_0$ ” is perhaps better expressed as “remain undecided about the two hypotheses”. This means that $h_1$ is perpetually in limbo, never accepted or rejected. Moreover, it means that any action that would be advisable if $h_1$ is sufficiently well confirmed will never be taken.

3.1 Why rely on NHST?

NHST is flawed enough that one may wonder why it is so popular. Many articles linked in previous posts on this blog have commented on this. I also found two recent articles devoted to this question:

J. Stunt et al. Why we habitually engage in null-hypothesis significance testing: A qualitative study. Plos ONE (2021), https://doi.org/10.1371/journal.pone.0258330
G. Gigerenzer. Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science (2018), https://doi.org/10.1177/2515245918771329