Let's explain the above table. The first time, or if you were to think in terms of the probability of the next year being hotter than this year based on only one event (N = 1) where say this year was hotter than the last year, you would say there is a 50/50 chance, so probability will be 0.5. What is the chance that five years in a row, you will see that the succeeding year is hotter than the previous one? Fairly low, as you can see it turns out to be 3 percent. So, 97% chance that such events would not occur if these events were not systematic or random as we would expect. How about ten years in a row? That probability is really low, like 0.009 % or less than even 0.01%, or one in a thousand chance. But that is only under the conditions of randomness. If that does not happen like that, there must be something systematic going on and you will be most likely right to conclude that the next year is likely to be hotter than this year as well.
This brings us to the concept of sample size and sampling. You saw that if you were to base your decision on the basis of a single sample (such as just one year of data), then your chance of prediction is much lower than if you were to base your predictions on the basis of a lot of trials or a lot of years of experience. This is the issue of estimation. If you want to estimate an effect, or say if you want to study the prevalence of a condition (how much of a disease prevails in the society), then you need large enough samples of data. But how much is too much and how much is too little. This is why you conduct sample size estimation. There are several calculators that you can use online, and you can use sample size procedures in statistical programming environment such as stata or R.  You can use the free open sourced web based sample size calculator at openepi.com\cite{menu} here:
http://www.openepi.com/Menu/OE_Menu.htm
Sampling refers to the process and steps of conducting a process where you select a representative set of data that resemble as close as possible the original population. You run all your estimations on the sample and then "extrapolate" your conclusions as to what may happen in the population. When you do sampling, you will need to keep in mind that your sample must have participants whose distributions should be as close as possible to the population. Different investigators, depending on the specific research question, use or adopt different strategies to select the samples of their studies. Some investigators use simple random sampling, others use sampling in different clusters so that they divide their population into different groups they call clusters and then use the clusters at random and from within those clusters or blocks they select their participants. Other researchers use weights to sample their participants. Say, you are studying within New Zealand and you know that in some areas, as in South Island, you will have less representative Maori population than in the North Island. You may want to weight your samples in a way that the people from Maori population are given greater weights than those from other ethnicity.
When you read research, pay close attention to how the individuals in the study were sampled. If the authors mention that they used simple random sampling, how did they actually do the sampling? From where did they select their participants? You can be assured that they used random selection if by reading their research methods, you know that they used a random numbers table to select their participants. Otherwise, you may need to think in ways their sample selection could have been biased.
We are getting into the issue of sampling bias. When you read research reports, also pay attention to from where did the authors or researchers obtain their participants? If you wanted to study the prevalence of attitudes of people towards consumption of fruits and vegetables, and you sampled individuals from affluent neighbourhoods and from malls from where people would obtain their supply of fruits and vegetables only, you might end up omitting or excluding people from the poorer neighbourhoods who may not consume much fruits and vegetables nor can they afford to consume or purchase them; you may be biased in your study only because you selected a biased set of people to provide your survey. These problems point to the fact that your sample is not representative enough. In occupational health studies, frequently, researchers survey factories and shop floors where they conduct cross-sectional surveys. If they conduct their surveys that way and only one time, they may end up oversampling those who were healthy to attend the workplace that day and miss the sicker workers. This bias is referred to as "healthy worker effect".
We wrote about the probability of occurrence of phenomena and finding the rare events. But how rare is rare? is there a borderline? What would be ways of thinking about some of these things? For those variables that are randomly distributed for example height or weight, you can use a normal distribution chart (Gaussian distribution chart) to map out the difference between what is expected and what is outside of normal expectation. For example, let's say you are interested to find out the average height of year 7 school students in schools of Canterbury and for this purpose, you want to conduct a survey of a sample of 100 school children all over Christchurch. As you do so, you assume:
  1. The school children in Christchurch are representative sample of all school children in Canterbury for that age group
  2. If you have a sample of 100 school children, you will be able to estimate the average height of children for all of Canterbury
  3. The height of the school children follow a normal distribution.
All of these are fair approximations and this is the good news: your measurement of the height of 100 school children in Christchurch will approximate the population average of all school children in Canterbury, but you also know that they will be off by a margin. This is the point of central limit theorem which states that the mean of the sample (m) will estimate the population mean \(\mu\) ; the standard error about the mean (call it sd) will also estimate the standard deviation such that \(\sigma\)\(sd/\sqrt(N)\) where sd = standard error of the mean, and N is the sample size. Let's say after conducting the survey, you find that the average height was 150 cm and the standard error of the mean was 50 cm; then your estimated standard deviation would be 50/\(\sqrt(100) = 5\). How do we interpret this? The true population average would lie somewhere between 150 - 1.96 * 5 = 140.2 and 159.8 cm with the average  of 150 cm. The magic number of 1.96 has come from the standard normal distribution (z distribution of a z score of 0.95. This means that 2.5% of the children will be either shorter than 140.2 centimetres or taller than 159.8 centimetres or 5% of the children will outside of this range. This concept that those who are outside of the 95% of the scores are "outliers" or statistically significantly different from the rest of the children is at the root of the thinking of a concept that the p-value of a distribution is 0.05 or lower. We will review this concept again in the following section on reasoning by abduction, but this is a concept well worth keeping in mind. The interval band of 140.2 and 159.8 in this case also tells us that if your survey were to be repeated 100 times, in 95 out of those 100 surveys, your average height of the children would be some figure in between 140.2 and 159.8 with the most likely figure around 150 cm. This concept is referred to as 95% confidence interval and we will review it in the section on statistical inference.
For now, we know that reasoning by inductive logic tells us:
We can still use the framework of standard form where we use premises for facts or carefully observed valid and reliable observations and the conclusion in the form of generalisable statement. However, there are three things that we need to keep in mind:
  1. First, our generalisable statement with probability statement on the basis of the observations must make sense. If they make sense and if it is OK to make that kind of a statement, then the probability of our generalised statement is high. A generalisable statement that is also likely to be highly probable is deemed as a strong argument.
  2. Second, the value we assign to that probability estimate is called "inductive force" or "inductive strength". If both these conditions are met (that is the argument is both strong AND has high inductive force, then such as argument is referred to as a cogent argument.
  3. Even though past experience may guide to frame our generalisable statement (albeit with certain probability estimate), by no means just because we experienced some events in the past, that implies the future will continue to be similar. We can have exceptions or events or observations that will not meet the generalisable patterns and this is the basis of refutation of the generalisability. David Hume attributed this concept as "problem of induction" (see https://plato.stanford.edu/entries/induction-problem/)\cite{philosophy}.  Nassim Nicholas Taleb (2007) expanded on Hume's observation and came up with a theory where he statistically modelled unusual or exceptional events that do not match with the expectations based on observations\cite{taleb2007black}. Such events are referred to as black swan events after the lore that there existed no black coloured swans in Europe and therefore it was unthinkable in the early 1700s that black coloured swans could exist. A Dutch expedition in Southern Hemisphere identified black coloured swans. Examples of black swan events in health sciences: in 2013, an outbreak of E.coli infections occurred in people who ate mussels cultured from the waters around Alaska. The event qualified as a black swan event in environmental health sciences as it was unthinkable at the time that E.coli outbreak from consuming mussels cultured in cold Alaskan water could have occurred; it was later identified that warm ocean currents circulated in the Alaskan coasts, rendering the E.coli infection possible\cite{mclaughlin2005outbreak}.
To recapitulate, reasoning based on inductive methods are based on first observations of events and phenomena and then generalising from these observed events, either in the form of predictions or general statements of truth. These generalised statements are in the form of proabilistic estimates of what is possible based on the observations. The extent to which the probabilities associated with the specific events and the generalisations are "feasible", we talk in terms of strength of the arguments; the actual probability of the generalisable "concepts" are referred to as inductive force and together they are referred to as cogent arguments. Cogent arguments in this sense are similar to sound arguments in deductive reasoning we reviewed earlier. That said, one observation to the contrary is enough to refute the general pattern of observation which are essentially based on conjecture. We will review this in details in the third aspect of reasoning: the reasoning by abduction.

Reasoning by abductive logic: explanation, theory, and hypotheses

Thus far, we have discussed that when we read research reports and review them, we pay attention to the individual arguments and we should attempt to evaluate their logical consistency and the force of these arguments. We test their arguments for their logical validity by constructing a standard form where we list the premises that contribute to the final conclusion (one argument, one final conclusion) and evaluate them. We also use reasoning by inductive logic to list premises in the form of systematic observation of phenomena and based on these phenomena, we arrive at a set of generalisations; these generalised statements are then expressed in the form of probabilistic arguments. While these allow for evaluation of individual arguments, these do not account for the reason or they cannot answer the question, "Why do we get to see the pattern that we get to see? Why or how such events occur?"
The answer to these questions come from a set of principles referred to as "abductive reasoning". In contrast with deductive reasoning that deal with the structure of the arguments and the validity of the arguments, and inductive reasoning that put a probabilistic perspective of a generalised truth on the patterns of occurrence of phenomena, abductive reasoning answers the question about what explains the occurrences. Abductive reasoning is based on three related concepts: explanation, theory building, and testing of hypotheses.
Explanation refers to a low level detailed approach to explain the phenomena that are observed. This follows directly from the probabilistic arguments in inductive reasoning. A more high level and abstraction of the explanation would be building a theory that would put a general concept on the explanations. A theory is a good theory if it can explain EVERY OBSERVATION by invoking some general principles. For a theory to be a successful theory, it MUST account for every observation related to an explanation. The theory would then must be tested for its robustness by searching for two things:
(note, in health care settings, Occam's Razor principle states that between two rival theories, one that has least assumptions and parameters is likely to be the simpler theory and therefore more accurate; in contrast to Occam's Razor, a rival hypothesis often used in Medicine for diagnostic tests is Hickam's dictum that states that most disease conditions have more than one causation, notably conditions such as the Saint's triad (hiatus hernia, gall bladder stone, diverticulosis) that all point to a related set of conditions, see more here in the Wikipedia entry here: https://en.wikipedia.org/wiki/Hickam%27s_dictum).
Essentially, you start with observations, and then frame a theory that should take into account ALL the facts and then search for other facts or situations that refute the theory. The way to do this is to use the theory to predict specific scenarios or set up hypotheses that are rivals of each other. The hypothesis that follows directly from the theory is referred to as alternative hypothesis, and the hypothesis that negates the alternative hypothesis and preserves the status quo is referred to as the null hypothesis. After stating the null and the alternative hypothesis, the researcher then collects data and tests whether the null hypothesis can be rejected so that the alternative hypothesis holds. The researcher does this by examining the probability estimates that his observations conforms to the conditions of the null hypothesis. If that probability is low, then the null hypothesis is rejected in favour of the alternative hypothesis. Convention states that the probability value at which one can reject the null hypothesis is 5% or lower. The probability estimate is also referred to as the p-value and is quoted in research documentation. Besides testing for the probability estimate at which the null hypothesis holds, the researcher also estimates the boundaries of the effect estimate using a 95% confidence interval. The 95% confidence interval states the boundary within which the effect estimate will lie, and states that if the studies were conducted a 100 times over, what would be the effect estimate in 95 out of those 100 iterations. If the effect estimate were to include the null value, then it would be reasonable not to reject the null hypothesis. Let's take a look at an example.
Example: In the late nineteen eighties, physicians observed patients from specific areas of the West bengal state of India with skin lesions that were characteristic of individuals who would be exposed to inorganic arsenic. At the time, the physicians developed a theory that the exposure to inorganic arsenic was through consumption of drinking water. The investigators set up a case control study design with people with and without skin lesions and assessed arsenic concentrations in their drinking water. They found that those with skin lesions had higher concentrations of inorganic arsenic\cite{haque2003arsenic}.

Correlation, causation, inference, p-values, confidence interval

Let's recapitulate the story so far. Consider the following scenario.
P1: We have observed from geological sources that arsenic can be dissolved in groundwater
P2: If humans are exposed to inorganic arsenic, then they have a high chance of developing characteristic skin lesions
P3: Some humans were exposed to inorganic arsenic through occupational sources
P4: They developed skin lesions and other problems.
P5: Some people are exposed to arsenic in their drinking water
Implicit Assumption: Inorganic arsenic works in the same way if exposed through inhalation and through ingestion as in eating or through drinking
Intermediate conclusion: People who are exposed to arsenic in drinking water will have skin lesions
P6: Some people have developed skin lesions
C: While not absolute, there may be a probability that these people may have been exposed to inorganic arsenic through their drinking water
Up to this point, you get to see a play of inductive and deductive logic leading to arguments about exposure to inorganic arsenic and appearance of skin lesions.
Now we get into abductive reasoning and ask:
Why do some people exposed to arsenic in drinking water develop skin diseases (and some people do not develop skin disease)?
With a question like this, there are several lines of explanations and theories that are possible. Some (not an exhaustive list):
Explanation 1: There is a threshold limit above which arsenic will cause skin disease and below which arsenic exposure will not lead to skin disease (call it dose-response theory)
Explanation 2: Arsenic may act differently in children and adults: so while adults may get the skin lesions, children may not get it (arsenic age theory)
Explanation 3: Arsenic may take a long time to act and so for some people the lesions may show up and for others it may not have shown up yet (arsenic time theory)
Explanation 4: Some people may metabolise or remove arsenic faster than others and therefore for these people arsenic may not cause skin lesions (arsenic metabolic theory)
Explanation 5: Some people may metabolise arsenic faster and it is the metabolic by products rather than inorganic arsenic as such that can lead to diseases (arsenic metabolic byprouct theory)
... and so on.
A few things to note here:
Once you have the theories, then you should start putting together hypotheses. Think carefully about the hypotheses that you will put up. The first hypothesis, termed the alternative hypothesis will conform to the conditions of your theory and will be stated as such. Let us start with the first explanation or the first theory the "threshold theory" and see how we can build up a pair of hypotheses from this theory by way of an illustation.
So you note that this theory tell us that the reason some people develop skin lesions and some other people do not develop skin lesions even though all of them are exposed to drinking water that contains arsenic (albeit in various concentrations and people differ in their daily water intake) is that, arsenic may act on the basis of a threshold value. This indicates or implies that if someone were to drink or consume arsenic at low concentrations, that person would not develop skin lesions. Which lends support to develop a dose response study with arsenic exposure, that is, people in lower dose of arsenic in the body will not develop skin diseases while people with higher dose of arsenic in the body will develop skin diseases. Remember also that for every theory is as good as how we refute it. In other words, you cannot "prove" a theory, you can only fail to disprove it. The way to disprove a theory is to either find a counter example or propose a hypothesis that will render it false. Karl Popper, the Austrian philosopher of science, who spent some time at hte University of Canterbury referred this as "falsification of hypothesis" (see his essay at http://stephenjaygould.org/ctrl/popper_falsification.html. This is also referred to as Karl Popper's theories of conjecture and refutation \cite{popper2014conjectures}. Anyway, if you get the idea, let's put the alternative and null hypothesis together:
Alternative hypothesis (H1): People exposed to higher levels of inorganic arsenic in drinking water will have a higher likelihood of skin lesions, or alternatively, compared with people without skin lesions, those people who have skin lesions will have higher levels of inorganic arsenic in their drinking water
Null hypothesis (H0): People with and without skin lesions will have similar levels of arsenic in their drinking water, or alternatively, the risk of skin lesions will remain same for those with high levels of arsenic in drinking water and those with relatively lower levels of arsenic in drinking water.
This is it. Note the sequence of how it all works together. You have made careful observations or studies of phenomena. You use reasonings of abduction or abductive reasoning to explain the phenomena, and you set up theories (well, explanations first and then theories). One theory, one set of hypotheses. Your hypotheses are tied to the theories you have about phenomena you want to explain. You may be tempted to put up several hypotheses to test but remember that your hypotheses must be derived from the theory you want to test. Rival theories will have rival hypotheses or you can use the same pair of hypotheses in ways to test rival theories. Let us stick to the simpler notion that we will have one pair of hypotheses for one theory.
Now based on the hypothesis you derive from your theory, you will set out to collect data. Your data collection can be on the basis of surveys that you will conduct yourself, or administer surveys in many different ways, or you will conduct experimental studies, or other means, for instance, compile other studies to conduct a meta analysis. After you will have collected data, you will examine whether your findings can be explained on the basis of your null hypothesis.
But how will you know that your null hypothesis is to be rejected or whether you fail to reject the null hypothesis? You set it up before the study begins. Consider the following table: