5 Hypothesis Testing

By Zhihang Ruan

5.1 Introduction

Either in our daily lives or in scientific research, we come across a lot of claims. We may formulate our own hypotheses based on our knowledge, available information, or existing theory. These hypotheses can be descriptive, e.g., we may hypothesize that a certain percent of U.S. voters support the policy of universal basic income. Or the hypothesis can be causal, e.g., we may believe that education leads to stronger support for gender equality. The measures (for example, mean or standard deviation) used to describe a population distribution are called population parameter. If we have access to everyone among the population we are interested in, then we may easily tell whether our hypothesis of a population parameter is true or false (e.g., if we know every voter’s support for the policy of universal basic income, then we can prove/disprove our hypothesis concerning the support rate for the policy). But in many cases, we do not have access to the population to firmly prove or disprove our hypotheses. For example, it may cost too much to ask each U.S. voter about their opinions on specific policies. In these cases, statistical theory and methods provide us some effective ways to test a hypothesis, or more accurately, assess whether the observed data is or is not consistent with a claim of interest concerning the population. In this chapter, we will go through the idea of hypothesis testing in statistics and how it is applied in political science.

5.2 Background

There are different understandings of hypothesis testing. In this chapter, we will follow the Neyman-Pearson paradigm (Rice 2007, 331), which casts hypothesis testing as a decision problem. Within this paradigm, we first have a null hypothesis and an alternative hypothesis concerning the population. A null hypothesis is a claim or hypothesis we plan to test, or more specifically, something we decide whether to reject or not. It can be descriptive (e.g., the support rate for the president among all U.S. voters) or causal (education leads to stronger support for gender equality among all human beings). An alternative hypothesis is also called the research hypothesis, which is opposite to the null hypothesis. It is what we believe to be true if we reject the null hypothesis. Then with what we observe in a random sample from the population, we make a decision to reject or not reject the null hypothesis concerning the population. This approach does not enable us to say the exact probability that the null or alternative hypothesis is true.⁷ To do that, we need more information and maybe another paradigm (e.g., so-called prior probability within the Bayesian paradigm), and we will not go in details in this chapter. But, even though the approach we discuss in this chapter does not directly tell us how likely a hypothesis is true or false, the approach is very useful in scientific studies as well as daily lives, as you will see in this chapter.

As mentioned in the introduction of this chapter, the classic idea of hypothesis testing concerns a sample and a population. In the previous chapter, we learned what the terms population, random sample and random sampling mean. The techniques we discuss in this chapter mostly assume a random sample. Below, we will quickly review the idea of random sampling and random sample and explains how random sampling enables us to make inference about the population with what we observe in the sample.

5.3 Samples and Sampling

As mentioned in the beginning of this chapter, in many cases, we do not have the access to all the units of the population we are interested in. For example, if we are interested in the support rate for the president, it would be perfect if we know the opinion of every single person (i.e., unit of the population) in the U.S. However, it is almost impossible to get access to everyone’s opinion. In many cases, we can only get access to a small group of individuals, which we call a sample from the population. When the sample is randomly chosen from the population (i.e., everyone in the population has an equal chance to be selected, or at least has a specific chance known to the researchers before the sample is drawn), then we may learn about the population with what we observe in the random sample we have. More specifically, statistical theory enables us to make inference about the population from the random sample. In the next part, I will explain how we may make inference from a random sample to the population and test a hypothesis concerning the population with a random sample.

5.3.1 Magic of the Central Limit Theorem

Let’s say, we roll a fair die. We know the probability of getting 1 is 1/6. In other words, the probability that the mean of the number we get from one trial equals 1 is 1/6. Then, if we roll the same die twice, we get two numbers. We can calculate the mean of the two numbers. What is the probability that the mean equals 1? Is the probability still 1/6? No, because if the mean is 1, we have to get 1 twice, the probability of which would be 1/36 (which equals 1/6 times 1/6). Very likely, the mean we get is larger than 1. Similarly, if we roll the die three times, the mean of the three numbers we get would probably be larger than 1. If we roll the die many times (e.g. 1,000 times), it is almost impossible that the mean would be 1 or even close to 1 (since it means we need to get 1 in all or most of the trials). Then what would the mean be? The mean would not be an extreme number like 1 or 6. Instead, it would be very close to the expected value we get from rolling it once, which is 3.5, the average of all possible numbers we get. Among the 1,000 trials, the number of 1s we get would be close to the amount of 2s we get, or the amount of 3s, etc. If we take the average of all numbers we get in the 1,000 trials, we would get a number very close to 3.5, which equals (1+2+3+4+5+6)/6.

This is what we call the weak law of large numbers: the sample average converges in probability towards the expected value or the population average, or in other words, the average of the sample gets close to the population average when the sample size is large (e.g., when rolling the die 1000 times).

One step further from the law of large numbers, we can rely on something called the central limit theorem to make inference. The central limit theorem suggests that the mean of a sufficiently large number of independent draws from any distribution will be normally distributed. A normal distribution is a bell-shaped probability density. From the example above, we already know the mean of a large amount of draws is very close to the expected value of the population. But in most cases, the average of the draws will not be exactly equal to the expected value of the population (which is 3.5 in the example of rolling a fair die). The central limit theorem enables us to calculate/quantify the probability that the sample average falls into intervals around the expected value of the population. As long as the expected value and variance of a normal distribution is known, we can calculate the probability that we get a sample mean within a specified interval. For example, with some calculation based on the central limit theorem (which we will not go into details here), we know that if we roll a fair die 1,000 times, the chance that the mean of the 1,000 numbers we get falls between 3.394 and 3.606 is roughly 0.95 (or 95 percent).

What if, after rolling the die 1,000 times, the average of the 1,000 numbers we get is much smaller than 3.394 or much larger than 3.606? Then we may want to check whether there is some problem with the rolling process, or whether the die is fair. Similarly, if we hypothesize that the support rate for the president is 50 percent, but after interviewing 1,000 people randomly drawn from the population, we find that the support rate is much lower than 50 percent, then we may doubt whether the support rate is really 50 percent. This makes sense when the sample is drawn randomly from the population. But if the sample is not drawn randomly (e.g., all the people in the sample are drawn from a gathering of a specific party), then the result does not tell us much about the support rate among the population. This is like a magician who uses tricks and gets 1 every time rolling a fair die. We cannot learn anything about the die based on the mean the magician gets.

These examples show us how central limit theorem works and how it makes hypothesis testing possible. In the next part, I will explain more specifically how we may estimate the population average/expected value based on what we observe from the sample, as well as how to test a hypothesis.

5.4 Estimates and Certainty

Based on the central limit theorem, we can make inferences about the population with the data we observe. One way to estimate the population parameter is called point estimate, which is a sample statistic used to estimate the exact value of a population parameter. We may consider the point estimate as our best guess to the population parameter based on what we observe in the sample. For example, if we learn that the mean of a random sample from simple random sampling is 3.5, then we may say that the point estimate of the population mean is 3.5.

But in most cases, the point estimate does not equal the true value of the population parameter (e.g., the population mean can be 3.5001, 3.4986 or other number when the sample mean is 3.5). Another way to estimate the population parameter is interval estimation. With the information we learn from the sample, we may calculate an interval that may include the population average. The central limit theorem enables us to quantify how confident we are that the interval will include the population average. The interval is called confidence interval, which defines a range of values within which the population parameter is estimated to fall. If we want to estimate the confidence interval of the population mean, we need the sample mean, the estimated population variance, and the sample size. A 95 percent confidence interval for the population mean equals \(\bar{X}\pm 1.96 * (S_{\bar{X}})\). \(S_{\bar{X}}\) is the estimated standard error of the sampling distribution of the sample mean. It is equal to the standard error (or the square root of the variance) of the population divided by the square root of the sample size.⁸ We can see from the formula that the range of the interval will decrease when the population variance is small, and the sample size is large. This makes sense intuitively because when there is little variation among the population, or when we have a large sample, the sample mean may be close to the population mean, and thus our estimation will be more precise.

In short, we can estimate the confidence interval of the population mean based on the sample we get. Similarly, if we have a hypothesis about the population average, then we can calculate an interval which the sample mean may fall into, and quantify how confident we are that the sample average will fall onto this interval.

It is intuitive to say that if we increase the range of our estimated interval, we are more confident that the interval will include the population mean. The trade-off is that our estimation is less precise. The likelihood, expressed as a percentage or a probability, that a specified interval will contain the population parameter, is called confidence level. For example, if we learn from a random sample (with a sample size of 1,000) that the support rate for the president is 52 percent, then a 95 percent confidence interval of the support rate among the population is between 50.5 and 53.5. And a 99 percent confidence interval is roughly 50.0 to 54.0 percent. As we can see, the confidence interval becomes wider (in other words, our estimation becomes less precise) if we want to be more confident that the population mean is within the confidence interval we estimate (i.e., we have a higher confidence level). More specifically, a 99 percent confidence interval for the population mean equals \(\bar{X}\pm 2.58 * (S_{\bar{X}})\).⁹ As we can see, the interval is wider than the 95 percent confidence interval, which is \(\bar{X}\pm 1.96 * (S_{\bar{X}})\), and the 90 percent confidence interval, which is \(\bar{X}\pm 1.64 * (S_{\bar{X}})\).

5.5 Steps of Hypothesis Testing

Hypothesis testing becomes more straightforward once we understand the central limit theorem and confidence interval. As mentioned earlier, if we have a hypothesis of the population mean, then we can calculate a confidence interval that the sample average will fall into. But if the sample average is very different from the population average we hypothesize, or in other words, falls outside the confidence interval at a specific confidence level, then we may reject the null hypothesis with a specific level of confidence. For example, if we hypothesize a die is a fair one, then the expected value (or the population mean) we get from rolling the die once is 3.5. However, if we roll the die many times (e.g., 1000 times), and the mean of all the numbers we get is 2.003, then we may be very confident to say that the die is not a fair die (i.e., we will reject the null hypothesis that the die is a fair one).

More specifically, there are four steps of hypothesis testing. First, we need to have a statement about a population parameter evaluated with a test statistic. The parameter can be the population mean (e.g., the average number of basketball games Americans go to), proportion (e.g., the support rate for the president among all U.S. voters), or some other characteristics of the population, like the variance of heights among all first-grade children. Any statement concerning the population implies a null hypothesis and an alternative/research hypothesis concerning the population. The research hypothesis is the hypothesis we’re putting forward to test, which reflects the substantive hypothesis. It is also called ‘alternative hypothesis’, but some prefer ‘research’ to convey that this hypothesis comes from an understanding of the subject area and is often derived from theory. The research/alternative hypothesis is in contrast to the null hypothesis, which is the ‘default’ one that we wish to challenge. For example, if most people believe that on average individuals in the U.S. go to more than 1 basketball game annually, and we hypothesize that on average Americans go to fewer than 1 basketball game every year. Then we can set our hypothesis as the research hypothesis and the common belief as the null hypothesis.

Then, we collect a random sample, calculate the statistic from the sample, and compare the statistic with the null hypothesis and the alternative hypothesis. What kind of statistic is calculated depends on the kind of hypothesis we have and statistical methods we use in hypothesis testing. For example, if we are interested in the population mean, then we need to calculate the mean and standard error of the sample.

Then we determine the rejection of the null hypothesis or of failure to reject the null. If the statistic we observe differs significantly from what we hypothesize, then we will reject the null hypothesis. Otherwise, we fail to reject the null hypothesis. As stated earlier, in most cases what we get from the sample is different from what we state in the null hypothesis. But we reject the null hypothesis only when what we observe in the sample is really weird or significantly different from the null hypothesis. What counts as weird, depends on the rule we set before, as well as common practices in the field. In social science, we usually take a pretty strict standard concerning the rejection of the null hypothesis. In many cases, only when the sample mean is outside the 95 percent, 99 percent or 99.9 percent of the confidence interval, do we reject the null hypothesis. This means, that we would expect to get a result as ‘weird’ as ours less than 5% of the time, if the null hypothesis is true. Since the probability is so low (e.g., 0.05), we reject the null hypothesis.

We tend to be conservative and decide not to reject the null hypothesis. Thus, failing to reject the null hypothesis does not mean the hypothesis is true, but just means that we do not have enough evidence to reject it. Similarly, rejecting the null hypothesis does not mean we prove that it is false, but it only suggest that we have pretty strong evidence (that we feel confident about) that it is false, if all the assumptions of the sampling process and statistical methods we use are met (e.g., the sample is a random sample from the population).

5.6 Types of Hypothesis testing

In our lives, we may have different types of claims or hypotheses. It can be a hypothesis about the mean of the population (e.g., the support rate for the president, the average income, etc.) or the variance of the population (e.g., the variance of people’s income). Or it can be a hypothesis concerning the difference between two groups, or a hypothesis about the correlation between two variables. Statisticians have developed different tests for different types of hypotheses. In this section, we will introduce some basic methods of hypothesis testing.

5.6.1 Single Mean Hypothesis Testing

The single mean hypothesis testing concerns the mean of the population we care about. In many cases, we are interested in the population average. For example, in an election, we may want to know the support rate for a specific candidate, which is important for the development of campaign strategy. We may hypothesize that the support rate for the candidate is a specific number, and we can test the hypothesis with a random sample we get from the population. If the support rate for the candidate among the sample is very different from the rate stated in our hypothesis, we may reject the hypothesis. If the rate we get from the sample is not very different from the number stated in the hypothesis, we may fail to reject the hypothesis.

Here is how a single mean hypothesis works. As we have discussed, the central limit theorem suggests that the mean of a random sample with a sufficiently large sample size is normally distributed. The normal distribution of the sample mean is an example of sampling distribution, which is a theoretical distribution of all possible sample values for the statistic which we are interested. For example, when we have a sample (with the sample size of 1,000), we can calculate the sample mean. If we do the sampling multiple times (e.g., 1 million times), we get 1 million samples and 1 million sample means (each sample still has 1000 cases). From the central limit theorem, we know that the 1 million sample means follow a normal distribution. This distribution is the sampling distribution of the sample mean, for samples with the sample size of 1,000.

If we get a simple random sample (explained in the previous chapter), the expected value of the sampling distribution of the mean equals the population mean, and the variance of the sampling distribution is determined by the population variance and the size of the sample. When there is less variation among the population, or we have a larger sample, the variance of the sampling distribution is smaller, which means the sample mean is expected to be closer to the population mean.

Since the sampling distribution of the mean is a normal distribution, we can calculate the probability that the sample mean falls into a specific range given the hypothesis is true. If the sample mean we get is very different from the hypothesized population mean, we may think there is some problem with the null hypothesis and we may reject the null hypothesis. Statisticians have learned a lot about the normal distribution, and we know that if we randomly draw a number from a normal distribution, we have roughly 95 percent chance of getting a number within two (or more accurately, 1.96) standard deviations (which equals the square root of the variance) away from the expected value of the normal distribution. Since the sampling distribution of the sample mean is a normal distribution, the chance that the distance between the sample mean we observe and the expected value of the normal distribution is more than two standard deviations of the normal distribution is roughly 5 percent. Thus, if we observe a difference between the sample mean and the hypothesized population mean that is larger than twice the standard deviations of the sampling distribution, we may reject the null hypothesis at the significance level of 95 percent. It is weird (e.g., less than 5 percent chance) to get a sample mean as extreme as the one we have if the null hypothesis is true, so we decide to reject the null hypothesis. We can also set a stricter standard (e.g., a significance level of 99 percent, or 99.9 percent) and reject the null only when the difference between the sample mean and the population mean is more extreme.

5.6.2 Difference of Means Hypothesis Testing

Sometimes, we are not interested in the mean of a single group, but more interested in the difference of means between two groups. Testing the difference of means is especially useful when we aim to make causal inference with an experiment. It can also be useful when we compare two groups without aiming to make causal inference. For example, in an election, especially an election within the majority system, we may be interested in whether one candidate has a higher support rate than another candidate. In this case, we are dealing with a hypothesis concerning the difference of means. The hypothesis may take the forms of \(A>B\), \(A<B\), \(A=B\), or \(A-B=c\). If our research hypothesis is \(A>B\), the null hypothesis would be \(A<B\). Then we test the hypothesis with what we observe in the random sample. For example, if the null hypothesis is that Candidate A has a higher support rate than Candidate B and we get a random sample in which Candidate A has a support rate much lower than Candidate B, then we may reject the hypothesis.

Similar to the single mean test, testing the difference of means hypothesis requires the standard deviation of the sampling distribution. We observe the difference of means among the two samples (groups), and then compare the difference to the standard deviation of the sampling distribution. If the difference is much larger than (e.g. more than two times) the standard deviation, then we may reject the null hypothesis that there is no difference between the two groups and suggest that there is statistically significant difference between the two groups.

5.6.3 Regression Coefficients Hypothesis Testing

In other cases, we are not only interested in describing the population, but analyzing the correlations of different variables concerning the population. We may want to test whether two characteristics or variables within the population are correlated with each other. To test the correlations, we may put them into a regression model, which we will discuss more in later chapters on regressions. Here we can briefly explain how testing regression coefficients works.

A bivariate regression model is like this. \[Y=\beta_{0} + \beta_{1} X\] If there is no correlation between a variable X and another variance Y, then any change of X will not be correlated to any change of Y. Thus, \(\beta_{1}\) in the regression model should be 0, which implies the value of Y will not change with the value of X. When we do the hypothesis testing, the null hypothesis is that the coefficient is 0. Then we put the data we get from a random sample into the regression model. The model will provide us an estimate of the coefficient. Then we do statistic tests (e.g., \(t\) test which compares the difference with the standard deviation) to see whether the coefficient estimated differs significantly from 0. If it differs significantly from 0, we may reject the null hypothesis and suggest that there may be some correlation between X and Y.

5.6.4 Conclusions you can draw based on the type of test

Based on the type of tests we conduct, we may draw certain types of conclusions. For example, with the single mean test, we may reject the null hypothesis that the single mean is a specific number or within a specific interval. With the test of the difference of means, we may reject that the null hypothesis that there is no difference between two groups. Based on the test of the regression coefficient, we may reject the null hypothesis that there is no correlation between two variables. But as stated above, in many cases we may fail to reject the null hypothesis. This does not suggest the null hypothesis is true, but that we do not have strong enough evidence to reject it.

5.7 Applications

The single mean hypothesis testing is very straightforward in statistics and one of the basic tools in social science research. Once we get a random sample and get the sample mean and sample variance, we can easily estimate the confidence interval for the population mean, e.g., the public opinions on specific policies. Then we can compare the null hypothesis with the sample mean or the confidence interval, and decide whether to reject the null hypothesis or not. The main challenge in these descriptive works is not statistical theory or method per se, but the sampling process. As we emphasize earlier in this chapter, to make inference about the population with a sample, we need to first have a random sample from the population, otherwise it is like trying to make inference based on magicians’ tricks. But it is extremely difficult to get a random sample in real lives. Many factors, like the non-response rate, lack of access to specific groups, financial and time constraints, make it unlikely to get a perfect random sample from the population. Researchers have tried different techniques to get a representative and random sample from the population. To test whether a sampling method is reliable, one way is to compare the findings we get with the new technique with census data or others authoritative data. In an article by Ansolabehere and Schaffner, they compare three sampling techniques (over the Internet, by telephone with live interviews, and by mail) with other data sources (Ansolabehere and Schaffner 2014). Comparing the confidence interval estimated from the sample with validating source, provides us some inputs on whether the sampling process provides a good enough (though not perfect) sample.

Testing hypotheses concerning the difference of means and regression coefficients are even more widely used in political science. In most studies in political science nowadays, researchers care about correlations or causal relations between different variables. Different methods, like regression and experiments, have been developed to explore the relations between different variables in the world, e.g., democracy and economic growth (Boix and Stokes 2003), social network and welfare provision (Tsai 2007), media frame strategy and public opinion (Bonilla and Mo 2018), etc. In these works which aim to explore relations between different variables, we often have a null hypothesis that there are no correlations between two variables, and researchers aim to find strong evidence to reject the null hypothesis.

More specifically, in an experiment, the null hypothesis is often that there are no difference between the treatment group and the control group. If we find statistically significant difference in the means between the treatment and the control groups, we may reject the null hypothesis and suggest that there are some difference between the two groups. And since the two groups differ in getting the treatment or not, researchers may suggest that the treatment is the cause for the difference between the two groups. Here is an example for of an experiment. As some may know, the general support for aid to foreign countries is low among U.S. citizens. This is a descriptive finding. But what explains the low support? Some researchers (Scotto et al. 2017) suggest, one reason is that people in the United States and other developed countries tend to overestimate the percent of government budget spend on overseas aid. To test this research hypothesis, they designed an experiment in the United States and Great Britain, in which one group of people (i.e., the control group) are provided the amount of dollars/pounds spent on foreign aid each year, and the other group (i.e., the treatment group) of people are provided the amount of money as well as the percentage of government budget on overseas aid. Then they ask the two groups of people about their opinions on foreign aid, and test the difference of means between the two groups. They find out that the group of people informed the percentage as well as the amount of overseas aid are less likely to think that the governments have spent "too much" on foreign aid. The difference is statistically significant at the confidence level of 99 percent, which enables them to reject the null hypothesis that there are no difference between the two groups and argue that overestimating the percentage of budget spent on aid is one cause for the low support for foreign aid.

In many cases, we cannot randomly assign people into different groups and change the treatment they get. Other techniques, like regression discontinuity designs (RDD), may be used for testing whether there are differences between groups that were similar before the treatment. For example, some researchers are interested in whether advantaged individuals may see the world through the lens of the poor after engagement with disadvantaged populations (Mo and Conn 2018). To do that, they surveyed top college graduates who were accepted into Teach For America program and those who were not. The former group of students had selection scores just above the threshold score and the later group had scores fall just short of the threshold score. Since the two groups differed only slightly in the scores, so it may be reasonable to suggest that the two groups were similarly to each other, and then we can see whether the experience in the program changes how the students view the world.

When we use regressions based on observational data instead of experiments, the idea of hypothesis testing is similar. Researchers often have a null hypothesis that the coefficient for a specific variable \(X\) is 0, which implies no correlations between the explanatory variable \(X\) and the outcome variable \(Y\). If from the sample we find that the estimated coefficient differs significantly from 0, then we may decide to reject the null hypothesis and suggest that there is some correlation between \(X\) and \(Y\). Whether the correlation implies causal relations, requires a closer look on the research design, but is not something hypothesis testing can tell. For example, a study explores the correlation between anti-Muslim hostility and the support for ISIS in Western Europe; on Twitter, ISIS followers who are in constituencies with high vote shares for far-right parties are more likely to support ISIS. But the correlation does not necessarily mean that anti-Muslim hostility causes the support, and thus the researcher looks closer into the tweets before and after major events related to ISIS to show that the support is indeed linked to the anti-Muslim hostility (Mitts 2019). Another example is from the field of American politics; a researcher tests whether people whose family members are arrested or incarcerated become mobilized to vote or not (A. White 2019).

5.8 “Is it weird?”

The idea of hypothesis testing can be formulated as some kind of "Is it weird" question. We start from a hypothesis concerning the population, then we observe the data from a sample, and ask ourselves, someone with training in statistical methods, "is it weird that we get a sample like this, if the null hypothesis is true?" If it is weird (AKA statistically unlikely), in the sense of statistical method, then we will reject the null hypothesis. Otherwise, we decide not to reject the null hypothesis, though that does not mean we prove or accept the null hypothesis.

5.9 Broader significance/use in political science

The Neyman-Pearson paradigm of hypothesis testing may be a bit obscure if we have not gone through the idea behind it. Students without a firm understanding of the statistical theory behind may make mistakes when interpreting the result of hypothesis testing. In recent years, there have been some heated discussions on whether we should continue this paradigm and use some jargon with this paradigm, e.g., \(p\) value, statistical significance, et al. (Ziliak and McCloskey 2008; Amrhein, Greenland, and McShane 2019). One concern with this paradigm is whether we should set a threshold value (e.g., the confidence level of 95 percent) to reject the null hypothesis and suggest there is statistically significant correlation once the threshold is met, since this may mislead someone without much training in statistical methods to think that we are more than 95 percent confident that the alternative hypothesis is true.¹⁰ Another concern is that the paradigm of hypothesis testing may not tell us much about substantial relationship. When the sample size is very large, it may be very easy to reject the null hypothesis and suggest that one variable may have statistically significant correlation with another variable, but the effect/correlation may be trivial.¹¹ Besides, the paradigm may bring the problem of publication bias. Researchers and journal editors may tend to report findings that show statistically significant correlations, but not findings that do not show significant correlations. This may make our understanding of the world biased.

Other than that, for studies that do not involve random sampling, how the Neyman-Pearson paradigm of hypothesis testing works is not very clear. For example, when we have a sample which is not randomly drawn from the population, we cannot test a hypothesis concerning the population with the sample we have. And if we have access to information concerning every unit of the population (e.g., if the unit of interest is country, then in many cases we get access to the whole population as long as we learn specific information of all countries in the world), what hypothesis testing means and how the method we introduced above tells us about the population is less clear.

Other paradigms of hypothesis testing, like Bayesian approach, may provide more intuitive ways for us to understand and explain hypothesis testing and quantitative results to new learners and the general public. But these paradigms are not necessarily incompatible with the paradigm introduced in this chapter. The main issue is when we use this approach of hypothesis testing, we should be clear what each step and the results mean, and what we can and cannot say with our findings.

5.10 Conclusion

Hypothesis testing is a basic tool in contemporary political science studies, especially in quantitative political science. In the following chapters, we will introduce specific methods that explore the relations between different variables in our society. Hypothesis testing is the basic idea behind most of these methods. Understanding how hypothesis testing works will make it easier for us to understand experiments, large-N analysis and other quantitative methods.

5.11 Application Questions

Before an election, a political analyst argues that the support rate for a candidate is above 60 percent. With a sample from all voters (assuming the sample is a random one), researchers find that the 95 percent confident interval of the support rate for the candidate is between 56.2 percent and 58.9 percent. Does this provide strong evidence that the analyst is wrong? Why or why not?
In an experiment, 80 students are randomly divided into two groups. The first group of students are asked to read a news article on the negative effects of climate change on peasants in developing countries, and the other group of students are asked to read an article on a new electronic device. Then both groups of students are asked about their opinions on the role of the United States in fighting climate change. Researchers find compared to the second group, the first group of students show slightly higher support for the U.S. government to take more responsibility in fighting climate change, but the difference is not statistically significant at the level of 95 percent. Does it mean that reading the news article on climate change has no effects on students’ opinions on U.S.’s responsibility in fighting climate change? Why or why not?
A student is interested in the average amount of courses Northwestern undergrads took last quarter. In total, there were 8,231 Northwestern undergrads last quarter. With a random sample from all NU undergrads, whose sample size is 196, she learned that on average, a student took 4.0 courses last quarter. With the sample, she estimated that the population variance is 1.21. Can you calculate a 95 percent confidence interval for the average amount of courses Northwestern undergrads took last quarter?

5.12 Key Terms

Central Limit Theorem
confidence interval
mean
null hypothesis
population
population parameter
point estimate
quantitative data
random sample
regression coefficient
research hypothesis
sample
standard deviation
standard error
statistically significant difference

5.13 Answers to Application Questions

Yes. This provides strong evidence that the analyst is wrong. The confidence interval of the support rate among the population suggests that we are 95 confident that the support rate will not be higher than 58.9 percent or lower than 56.2 percent. Since the prediction of the analyst (higher than 60 percent) is well beyond the confidence interval we calculated from the random sample, we are pretty confident the prediction is wrong. But this is based on assumptions that the sample is a random one, respondents in the survey tell their true preference for the candidate, etc. If these assumptions are not met, the sample does not tell us anything about the population and we cannot tell whether the analyst is right or wrong.
Finding no statistically significant difference between the two groups makes us fail to reject the null hypothesis, which is that there are no difference between the two groups. However, it does not tell us that the null hypothesis is true. We can only say that we do not find enough evidence to show that there are difference between the two groups based on one study, but we cannot say the difference is exactly 0.
A 95 confidence interval is \(\bar{X}\pm 1.96 * (S_{\bar{X}})\). The sample mean is 4.0. The estimated standard error of the sampling distribution equals the square root of the population variance divided by the square root of the sample size, which is \(\sqrt{1.21}/\sqrt{196}=0.0785\). Thus the 95 confidence interval is \(\bar{X}\pm 1.96 * (S_{\bar{X}}) = 4.0\pm 1.96* 0.0785= [3.846, 4.154]\).

References

Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. “Scientists Rise up Against Statistical Significance.” Nature 567: 305–7.

Ansolabehere, Stephen, and Brian F Schaffner. 2014. “Does Survey Mode Still Matter? Findings from a 2010 Multi-Mode Comparison.” Political Analysis 22 (3): 285–303.

Boix, Carles, and Susan C Stokes. 2003. “Endogenous Democratization.” World Politics 55 (4): 517–49.

Bonilla, Tabitha, and Cecilia Hyunjung Mo. 2018. “Bridging the Partisan Divide on Immigration Policy Attitudes Through a Bipartisan Issue Area: The Case of Human Trafficking.” Journal of Experimental Political Science 5 (2): 107–20. https://doi.org/10.1017/XPS.2018.3.

Mitts, Tamar. 2019. “From Isolation to Radicalization: Anti-Muslim Hostility and Support for ISIS in the West.” American Political Science Review 113 (1): 173–94. https://doi.org/10.1017/S0003055418000618.

Mo, Cecilia Hyunjung, and Katharine M. Conn. 2018. “When Do the Advantaged See the Disadvantages of Others? A Quasi-Experimental Study of National Service.” American Political Science Review 112 (4): 721–41. https://doi.org/10.1017/S0003055418000412.

Rice, John. 2007. Mathematical Statistics and Data Analysis. Belmont, CA: Thompson/Brooks/Cole.

Scotto, Thomas J., Jason Reifler, David Hudson, and Jennifer vanHeerde-Hudson. 2017. “We Spend How Much? Misperceptions, Innumeracy, and Support for the Foreign Aid in the United States and Great Britain.” Journal of Experimental Political Science 4 (2): 119–28. https://doi.org/10.1017/XPS.2017.6.

Tsai, Lily L. 2007. “Solidary Groups, Informal Accountability, and Local Public Goods Provision in Rural China.” American Political Science Review 101 (2): 355–72. https://doi.org/10.1017/S0003055407070153.

White, Ariel. 2019. “Family Matters? Voting Behavior in Households with Criminal Justice Contact.” American Political Science Review 113 (2): 607–13. https://doi.org/10.1017/S0003055418000862.

Ziliak, Steve, and Deirdre Nansen McCloskey. 2008. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: University of Michigan Press.

This may be a bit confusing. But you may consider it this way. Let’s say, we hypothesize that the average height of all Northwestern undergrads is 5.7 feet. If we do the hypothesis testing as we will learn in this chapter, we will not reject the null hypothesis unless we get a random sample whose average height is much higher than 5.7 or much lower than 5.7 feet. In many cases, we may not reject the hypothesis. However, how likely is the hypothesis true, even if we do not reject it? Almost 0, because the exact average height can be any number slightly different from 5.7 feet, e.g., 5.700001 or 5.697382. As a result, the hypothesis is almost always wrong, but we do not always reject it. Thus, whether to reject the hypothesis or not does not tell us whether it is true or false. Nor does it tell us the probability that it is true.↩︎
We have 1.96 in the formula because statisticians tell us if we randomly draw a number from a normal distribution, we have a 95 percent chance of getting a number no more than 1.96 standard errors above or below the mean of the distribution.↩︎
We have 2.58 in the formula because if we randomly draw a number from a normal distribution, we have a 99 percent chance of getting a number no more than 2.58 standard errors above or below the mean of the distribution.↩︎
As I have tried to explain, the level of significance is not the probability that the research hypothesis is true.↩︎
For example, the finding that 1 million investment in education for one student may increase her annual income by 100 dollars after graduation may be statistically significant, but the effect is too small to tell any substantial relations.↩︎