The following is a comparison of economic rationality and Null Hypothesis Significance Testing from the perspective of the former. The starting point is maximization of expected utility, which is a standard criterion for economic rationality. Then the focus is on the ingredients in economic rationality that must be thrown away to end up with NHST.
1. Decision theory and economic rationality
Consider a simple situation where there are two mutually exclusive hypotheses that we call and . It is not known for certain which hypothesis is true, all we can do is assign probabilities and . Assume for simplicity that no other hypotheses are in contention, i.e. and . Connected to the hypotheses are two mutually exclusive actions and . To quantify the reward or cost associated with the actions, we introduce a utility function. Minimally, the utility depends on which hypothesis true and which action is performed. The expected utility given that is true and the action is performed is denoted
.
The standard criterion in decision theory and economics is to choose the action that maximizes the expected utility. In the present simple setting this means comparing the expected utilities
,
If , then is the economically rational action. Conversely, if , then is the economically rational action.
One could stop the exposition of decision theory here, on the note that decision theorists take into account both probabilities and utilities when choosing how to act. However, to better connect to standard statistics, it is useful to also consider the situation after new data becomes available. Once new data is available, a good Bayesian updates the probabilities. The update from the prior probabilities to the posterior probabilities is done using Bayes theorem:
Next, assume for simplicity that the new data does not affect the utilities. In symbols, . To choose between the actions and one should now compare the expected utilities
1.1 Reduction to three parameters
Utilities lack scale in the sense that two utility functions that just differ by constant shift are equivalent. Moreover, two utilities that differ by a constant factor are also equivalent. This can be exploited to reduce the number of parameters that need to be considered when choosing between the actions.
To check whether is equivalent to checking whether the difference
is positive or negative. In this difference any constant shift of utilities cancels out, which may be emphasized by introducing two relative utilities
In terms of these relative utilities,
Next, the lack of multiplicative scale can be exploited. Because is supposed to be an action suitable when is true and is supposed to be an action suitable when is true, we may assume that and are positive quantities. Introducing the relative utilities
,
we may note that , and . Hence,
.
As a final touch, since we only care about whether and are positive or negative, and not their numerical value, we may multiply by and divide by to obtain
.
In the final expression above, the Bayes factor or likelihood ratio has been introduced:
.
An economically rational agent can thus choose action by checking whether is positive or negative. And to check that, it is enough to know the prior probability , the utility fraction , and the likelihood ratio . A form of the decision criterion that emphasizes the likelihood ratio is
,
where the right-hand side can be computed before the data is available.
2. The Neyman-Pearson school
To go from decision theory and economic rationality to Neyman–Pearson’s framework is almost as easy as these two steps:
- Step 1: Forget about prior knowledge and the probabilities and .
- Step 2: Relegate the utilities to a background role, but do not completely forget them.
The philosophy behind Step 1 is that prior probabilities of hypotheses are too subjective to have a place in statistical reasoning. On the other hand, the likelihood is considered more objective, at least when the hypothesis is a simple statistical model without any unknown parameters.
In the section on decision theory, it was not necessary to assume much about the actions. In Neyman–Pearson’s framework, the action is often taken to mean “accept “. Accepting when is true is a Type I error, and is associated with some cost. Accepting when is true is a Type II error, which is also associated with a cost. Hence, the assumptions and from the previous section are still appropriate.
In the Neyman–Pearson framework, the ideal decision criterion is a likelihood ratio test. In such a test, one chooses a threshold . If , then the action (“accept “) is taken. If , then the action (“accept “) is taken. If the threshold happens to equal
then this procedure is equivalent to the economically rational criterion. However, in Neyman–Pearson’s framework one does not aim to be economically rational. Instead, one aims to find a good trade off between Type I and Type II errors. The likelihood of a Type I error is denoted
and is called the significance level. The likelihood of a Type II error is written
and is called the power of the test. Ideally, both and should be vanishingly small, but this is typically not possible to achieve. The threshold is supposed to be chosen with the costs of Type I and Type II errors in mind, so as to yield a good trade off between and .
2.1 Filter data
There is an optional third step:
- Step 3: Replace the likelihood ratio by a suboptimal statistic.
A statistic is a quantity that summarizes the data in some way. In the Neyman–Pearson framework, the likelihood ratio is the ideal statistic in the sense that no other statistic can yield lower for a given . However, though suboptimal, it is permissible to choose a different test statistic. Denoting that statistic by , the likelihoods of Type I and II errors are then
The test statistic should mimic the feature of the likelihood ratio that large values are more compatible with and small values are more compatible with . The threshold value should be chosen with the relative cost of Type I and II errors in mind, so as to yield a good compromise between low and low .
The choice of statistic and threshold should be made before the actual data is available. When it becomes available, one compares to the threshold. If , then one chooses the action . If , then one chooses the action .
3. Null Hypothesis Significance Testing
Step 3 was optional above. In Null Hypothesis Significance Testing (NHST), Step 3 is mandatory. Additionally:
- Step 4: Forget all about the costs of Type I and II errors.
- Step 5: Forget about the likelihood .
In NHST, prior knowledge and utilities are disregarded and decisions are made based on only the likelihood . Often is a relatively vague purely verbal research hypothesis, whereas is a simplistic and sharp statistical model. As in the Neyman–Pearson school, one relies on a test statistic and a significance level chosen before the data becomes available. The test statistic should be chosen so that large values are intuitively more compatible with and small values are intuitively more compatible with . The typical choice is , not because it is sensible but because that’s what everyone else chooses. Having fixed an , the threshold is chosen to satisfy
Deviating slightly from the notation in previous sections, let’s denote the actually observed data by and the random variable representing the data by . Once the data is observed, one then checks whether . If so, one accepts . If not, one accepts .
Checking whether is equivalent to the more familiar test involving the P-value:
Since and are mutually exclusive, “accepting ” implies rejecting . Likewise, “accepting ” implies rejecting . However, many practitioners of NHST prefer other actions:
- Step 6: Let the actions be “reject “, if the P-value is less than , and “not reject “, otherwise.
“Not reject ” is perhaps better expressed as “remain undecided about the two hypotheses”. This means that is perpetually in limbo, never accepted or rejected. Moreover, it means that any action that would be advisable if is sufficiently well confirmed will never be taken.
3.1 Why rely on NHST?
NHST is flawed enough that one may wonder why it is so popular. Many articles linked in previous posts on this blog have commented on this. I also found two recent articles devoted to this question:
- J. Stunt et al. Why we habitually engage in null-hypothesis significance testing: A qualitative study. Plos ONE (2021), https://doi.org/10.1371/journal.pone.0258330
- G. Gigerenzer. Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science (2018), https://doi.org/10.1177/2515245918771329