Hypothesis testing and the scientific method

The null and the alternative hypotheses

Procedures and interpretation of hypothesis tests

The concept, meaning and significance of the p-value

Errors of statistical testing

* *

*Key Words and Terms:*

Error type 2 (beta error)

Error, type 1 (alpha error)

False negative

False positive

Hypothesis testing, formal

Hypothesis testing, informal

Hypothesis, alternative

Hypothesis, null

Non-rejection region

Null value

Region, critical

Region, of non-rejection

Region, of rejection

Significance, clinical

Significance, clinical

Significance, practical

Significance, practical

Significance, statistical

Statistical power

Statistical tests

Test statistic

Test, of significance

Test, one-tail

Test, statistic

Test, two-tail

True negative

True positive

AGENDA

1.0 EPIDEMIOLOGIC METHODOLOGY:

A. Epidemiological Research

B. Hypotheses:

C. Sources of Epidemiological Data

D. Empiricism, Induction, Refutation, and Bayeseniasm

E. Balance of Strengths and Weaknesses:

2.0 HYPOTHESES AND THE SCIENTIFIC METHOD

A. The Scientific Method

B. Formulation Of Hypotheses

C. Informal Hypothesis Testing

D. Formal Testing Of Hypotheses

E. Generation Of New Hypotheses

3.0 NULL HYPOTHESIS (H_{o} ) & ALTERNATIVE HYPOTHESIS (H_{A}):

A. Types Of Hypotheses:

B. Conclusions About Hypotheses:

C. False Positive And False Negative

D. Summary Of Testing Errors

4.0 HYPOTHESIS TESTING USING TESTS OF SIGNIFICANCE

A. Parameters Used In Hypothesis Testing

B. Procedure Of Hypothesis Testing

C. Statistical Significance:

D. Interpretation Of P-Values:

5.0 HYPOTHESIS TESTING USING CONFIDENCE INTERVALS

A. Two Approaches To Hypothesis Testing

B. The Concept Of Null Value

C. Procedure Of Hypothesis Testing

D. Interpretation

6.0 CONCLUSIONS and INTERPRETATIONS

A. Implications Of Statistically Significant

B. Implications Of Not Statistically Significant

C. Statistical And Practical Significance

D. 1-Tail And 2-Tail Tests:

E. Errors Of Testing

EPIDEMIOLOGIC METHODOLOGY:

An epidemiologic investigation proceeds through
identifying and describing a problem, using the scientific method to formulate and test hypotheses, and interpreting findings.

Epidemiological information is sourced
from existing data or studies (observational or experimental). Existing data
is from census, medical facilities, government, and private sector, health surveys, and vital statistics.

Experimental studies, natural or
true experiments, and involve deliberate human action or intervention whose
outcome is then observed. They have the
advantage of controlled conditions but have ethical problems of experimenting on humans.

Observational studies allow nature to take its course and just record
the occurrences of disease and describe the what, where, when, and why of a disease. They are of 4 types of observational studies: ecologic, cross-sectional, case control, and cohort
(follow-up) studies. Their advantage is low cost and fewer ethical issues. They suffer from 3 disadvantages: disease aetiology
is not studied directly because the investigator does not manipulate the exposures,
unavailability of information, and confounding.

Epidemiological methodology, following the scientific method, is empirical,
inductive, and refutative. Epidemiology relies on and respects only empirical findings. Empiricism refers to reliance on physical
proof. Induction is building a theory on several individual observations. Refutation is basically refusal of a supposition
until it is proved otherwise. Epidemiological investigation is not as deterministic as laboratory investigation but is cheap
and easy.

HYPOTHESES AND THE SCIENTIFIC METHOD

The scientific method if hypothesis formulation, experimentation to test the hypothesis, and drawing conclusions. Hypotheses
are statements of prior belief. They are modified by results of experiments to give rise to new hypotheses. The new ones then
in turn become the basis for new experiments.

There are two traditions of formal hypothesis testing: significance testing and the Neyman-Pearson testing. Significance
testing depends on use of a single p-value to reach a decision. The Neyman-Pearson approach uses the confidence interval conventionally
selected as the 95% CI. The two approaches are related because a in
significance testing corresponds to 1-a in the Neyman-Pearson approach.

NULL HYPOTHESIS (H_{o} ) & ALTERNATIVE HYPOTHESIS (H_{A}):

The null or research hypothesis, H_{0}, states that there is no difference between the two comparison groups
and that the apparent difference seen is due to sampling error. The alternative hypothesis, H_{A}, disagrees with
the null hypothesis. H_{0} and H_{A} are complimentary and exhaustive and between them cover all the possibilities.

A hypothesis cannot be proved; you only give an objective measure of probability of its truth.

We can use concepts of conditional probability to define errors of statistical testing.

Type 1 error = a error = Probability of rejecting a true H_{0} (false positive)
= Pr (rejecting H_{0} | H_{0 }is true).

Type 2 error = berror = Probability of not rejecting a false H_{0} (false negative)
= Pr (not rejecting H_{0} | H_{0} is false).

The confidence level (1 - a) = True positive = Pr (not rejecting H_{0} | H_{0} is
true).

Power (1-b) = True negative = Pr (rejecting H_{0} | H_{0} is false).

Whereas a relates to significance error, b relates
to error of acceptance.

TABULAR SUMMARY OF TESTING ERRORS

True
situation |
Result
of testing |
Decision |
Type
of error |

HO
is true |
Do
not reject HO |
Correct |
None |

HO
is true |
Reject
HO |
Wrong |
Type
1 |

HO
is false |
Do
not reject HO |
Wrong |
Type
2 |

HO
is false |
Reject
HO |
Correct |
None |

The above table can be set out in a different way as follows:

DECISION
MADE |
TRUE
SITUATION |

H_{0}
is true |
H_{0}
is false |

Do
not reject H_{0} |
Correct
Decision (1-a) |
Type
2 error (b) |

Reject
H_{0} |
Type
1 error (a) |
Correct
decision (1-b) |

HYPOTHESIS TESTING USING TESTS OF SIGNIFICANCE

Parameters of significance testing are the significance level, critical region, p-value, type 1 error, type II
error, and power.

a,
the critical or rejection region is the far end of the distribution. 1-a is the non-rejection
region. a, the pre-set level of significance usually 0.05, is the probability that
a test statistic falls in the rejection region or the probability of wrongfully rejecting H_{0} 5% of the time, a
ratio of 1:20.

The p value, the observed significance level, is the percentage of extreme observations away from the null or mean
value. P value can be defined in a commonsense way as the probability of rejecting a true hypothesis by mistake.

Hypothesis testing start by stating H0 and HA, assuming a level of significance usually 0.05, selecting a test
statistic which when applied to the data will yield a p-value. Four test statistics based on approximate Gaussian distribution
are employed: F, t, c. Exact methods based on the binomial distribution are used for small
samples.

The decision rules are: If the p < 0.05 H0 is rejected (test statistically significant). If the p>0.05 H0
is not rejected (test not statistically significant).

HYPOTHESIS TESTING USING CONFIDENCE INTERVALS

The 95% confidence interval is more informative than the p-value approach because it indicates precision.

Under H_{0} the null value is defined as 0 (when the difference between comparison groups=0) or as 1.0 (when
the ratio between comparison groups=1).

The 95% CIs can be computed using approximate Gaussian or exact binomial methods.

The decision rules are: if the CI contains the null value, H_{0} is not rejected. If the CI When the interval
does not contain the null value, H_{0} is rejected.

CONCLUSIONS and INTERPRETATIONS

IMPLICATIONS OF STATISTICALLY SIGNIFICANT

H0 is false

H0 is rejected

Observations are not compatible with H0

Observations are not due to sampling variation

Observations are real/true biological phenomenon

IMPLICATIONS OF NOT STATISTICALLY SIGNIFICANT

H0 is not false (we do not say true)

H0 is not rejected

Observations are compatible with H0

Observations are due to sampling variation or random errors of measurement.

Observations are artificial, apparent and not real biological phenomena

Statistically significant may have no clinical/practical significance/importance.
This is due other factors being involved and not studied and measurements that are not valid.

Clinically important difference may not reach statistical significance due
to small sample size and measurement that are not discriminating enough.

Hypothesis testing may be 1-sided or 2-sided. The 1-sided test considers extraneous
values on one side (1 tail) and is rarely used. The 2-sided test considers extraneous values on 2 sides (2 tails), is a more
popular conservative test, and looks for any change in the parameter whatever its direction.

PRACTICAL ASSIGNMENT (survival data set)

1. Test the null hypothesis that there is no difference in
mean survival time between the 2 treatment groups

2. Test the hypothesis that there is no difference in the proportion
of males and females between the two treatment groups

HYPOTHESES AND THE SCIENTIFIC METHOD

A. THE SCIENTIFIC METHOD

The scientific method is currently the most powerful method available in empirical investigations. It proceeds in stages
starting with formulation of a study hypothesis. An experiment is then designed based on the hypothesis. The data from the
experimentation is used to draw objective conclusions about the hypothesis.

B. FORMULATION OF HYPOTHESES

In accordance with the scientific method a null hypothesis is formulated usually in the form that there is no difference
between 2 groups being compared. Correct formulation of the null hypothesis is necessary for study design and study interpretation.
A series of studies can be interpreted to yield a general explanatory law or hypothesis. Such generalizations are based on
results of a series of valid studies.

C. INFORMAL HYPOTHESIS TESTING

The use of hypotheses and the scientific method is sometimes informal. For example when a patient walks into the doctor's
office, the doctor will form a hypothesis based on preliminary observations. This will become the working hypothesis used
to guide further clinical examination and investigations. The hypothesis may be changed or updated in view of new information
that may be collected.

D. FORMAL TESTING OF HYPOTHESES

There are two traditions of formal hypothesis testing: significance testing and the Neyman-Pearson hypothesis testing.

Significance testing depends on use of a single p-value to reach a decision. Significance testing has been criticized
on various grounds. It has does not incorporate any measure of the magnitude of association. It cannot assess precision of
the measurement. It historically developed in agriculture and industry which required simple choices. It does not fit in the
epidemiological paradigm being more suited to industry and agriculture where problems are less complicated. Significance testing
involves putting the hypothesis in a mathematical formulation and computing the probability of the hypothesis being correct
or incorrect. The bulk of inferential statistics is concerned with the formulation and testing of hypotheses. In formal testing
the data is used to generate a test statistic that is used to generate a probability value. The value is compared to a pre-set
probability of significance to make a decision about the null hypothesis.

The Neyman-Pearson approach avoids the criticisms of significance testing stated above. It does not give a conclusion
based on a single probability of the hypothesis being true. It provides a confidence range of the probability of the hypothesis
being true. It is therefore more informative and makes more use of the data provided. The Neyman-Pearson approach uses the
confidence interval conventionally selected as the 95% confidence interval. It however must be noted that the two are related
to one another. If significance testing uses a significance level of a, the corresponding
confidence interval is 1 - a %.

E. GENERATION OF NEW HYPOTHESES

Hypotheses are statements of prior belief. They are modified by results of experiments to give rise to new hypotheses.
The new ones then in turn become the basis for new experiments. This process is repeated continuously enabling scientific
knowledge and understanding to grow. In this process no facts or knowledge can remain static for long. Changes are continually
taking place.

2.4.2 NULL HYPOTHESIS (H_{o} ) & ALTERNATIVE HYPOTHESIS
(H_{A}):

A. TYPES OF HYPOTHESES:

A hypothesis is a statement of belief in something. Unlike other types of beliefs, scientific beliefs are subject to
experimental verification. Two hypotheses are always stated for proper scientific investigation: the null and the alternative
hypotheses. The null hypothesis or research hypothesis, H_{0}, states that there is no difference between the two
comparison groups and that the apparent difference seen is due to sampling error. The alternative hypothesis, H_{A},
disagrees with the null hypothesis and states that there is a real difference not explained by sampling error. H_{0}
and H_{A} are complimentary and exhaustive in that between them they cover all the possibilities. H_{A} could
be vague. When H_{0} is rejected, we cannot accept H_{A} we only fail to reject it.

B. CONCLUSIONS ABOUT HYPOTHESES:

The aim of hypothesis testing is to make a conclusion about H0. The conclusion is in the form of rejecting or not rejecting
the hypothesis. If H0 is rejected, HA becomes the new working hypothesis. A hypothesis cannot be proved; you only give an
objective measure of probability of its truth

C. FALSE POSITIVE and FALSE NEGATIVE

Intersections of distributions of H0 and HA: The observed data can be plotted on 2 normal curves one under the assumptions
of the null hypothesis; the other under the assumption of the alternative hypothesis. The 2 curves will naturally intersect.
This intersection gives rise to two concepts that are basic in hypothesis testing: the probability of false positive and the
probability of false negative. False positive is that part of the HA curve intersecting into the H0 curve. It is also referred
to as the type 11 or beta error. The probability of false negative is that part of the H0 curve intersecting into the HA curve.
It is also referred to as the type 1 or alpha error.

D. SUMMARY OF TESTING ERRORS

True situation |
Result of testing |
Decision |
Type of error |

HO
is true |
Do
not reject HO |
Correct |
None |

HO
is true |
Reject
HO |
Wrong |
Type
1 |

HO
is false |
Do
not reject HO |
Wrong |
Type
2 |

HO
is false |
Reject
HO |
Correct |
None |

The above table can be set out in a different way as follows:

DECISION
MADE |
TRUE
SITUATION |

H_{0}
is true |
H_{0}
is false |

Do
not reject H_{0} |
Correct
Decision (1-a) |
Type
2 error (b) |

Reject
H_{0} |
Type
1 error (a) |
Correct
decision (1-b) |

We can use concepts of conditional probability to define the parameters explained above as follows. Type 1 error
= Pr (rejecting H_{0} | H_{0 }is true). Type 2 error = Pr (not rejecting H_{0} | H_{0} is
false). The confidence level (1 - a) = True positive = Pr (not rejecting H_{0} | H_{0} is true). Power (1-b) = True negative = Pr (rejecting H_{0}
| H_{0} is false). Whereas alpha related to significance error, beta relates to error of acceptance.

2.4.3 HYPOTHESIS TESTING USING TESTS OF SIGNIFICANCE

A. PARAMETERS USED IN HYPOTHESIS TESTING

Four parameters or concepts are used in hypothesis or significance testing: critical region, significance level,
p-value, type 1 error, type II error, and beta. The critical region is the far
end of the distribution. We may talk of a one-sided critical region or a 2-sided critical region. The critical region is also
called the rejection region denoted by alpha. The non-rejection region is denoted by 1-alpha. Alpha is the probability that
a test statistic falls in the critical or rejection region. Alpha, the level of significance usually set at 0.05, is probability
of wrongfully rejecting H0 5% of the time, a ratio of 1:20. The p value can be
defined or described in various ways. The p value is a measure of the compatibility of the observed data with the null hypothesis.
The p value is the observed significance level. The p value is the probability of results as extensive or more extensive than
the preset level of significance. The p-value is the percentage of extreme observations away from the null or mean value.
The p-value is the probability of observing the test statistic or a more extreme value. P-value can also de defined as the
area of the tail beyond the value of the test statistic. The p-value can be 1-tail or 2-tail (upper tail and lower tail).
The upper tail p value is the probability that the test statistic is higher than the stated value. The lower tail p value
is the probability that the test statistic is less than the observed value. Type 1 error, also called alpha error, is the
probability of false positive. This stated in other words it is the probability of rejecting a true null hypothesis. It can also be defined as the probability of incorrect rejection of the null hypothesis.
Type II or beta error is the probability of false negative. This stated in a different way is the probability of failing to
reject a false null hypothesis.

B. PROCEDURE OF HYPOTHESIS TESTING

The procedures start by stating H0 and HA. Then a level of significance is assumed; this is traditionally taken to
be 0.05 or 1 in 20. The level of significance is the same as saying that I am taking a 1 in 20 risk of being wrong. The next
step is selecting a test statistic which when applied to the data will yield
a p-value. If approximate methods are used, 4 test statistics based on the Gaussian distribution are employed: F-test, t-test,
and the chi-test. These are computed from the data and the corresponding p-value is looked up in appropriate tables. In cases of small samples for which the Gaussian distribution is not valid, exact
methods of computing the p-value are used. These methods based on the binomial distribution yield the p-value directly from
the data. The following decision rules are used in making conclusions about the null hypothesis. If the p-value is less than
0.05, the H0 is rejected. If the p-value is greater than 0.05, the H0 is not rejected.

C. STATISTICAL SIGNIFICANCE:

The results of hypothesis testing may reveal either a statistically significant difference or a statistically non-significant
difference. H0 is rejected for statistically significant results.

H0 is not rejected for statistically non-significant results.

D. INTERPRETATION OF P-VALUES:

The following is a guideline on the interpretation of p-values. P value <0.01 indicates strong evidence against
H0. P values 0.01 - 0.05 indicate moderate evidence against H0. P values 0.05 - 0.10 are suggestive evidence against H0. P
values > 0.1 provide little or no evidence against H0.

2.4.4 HYPOTHESIS TESTING USING CONFIDENCE INTERVALS

A. TWO APPROACHES TO HYPOTHESIS TESTING

P-value defined for practical purposes as probability of more than 95% deviation from the average/null. 95% confidence
interval. The 95% confidence interval is used less often than the p-value, although many investigators are of the opinion
that it is more informative. The 95% confidence interval is more informative than the p-value approach. It gives information
about precision whereas the p-value approach only indicates whether they’re a significant difference or not.

B. THE CONCEPT OF NULL VALUE

Under the null hypothesis of no real difference between summary statistics of 2 samples that are compared, the difference
between the sample statistics is zero and their ratio is 1.0. Thus zero and 1.0 are called the null values. Hypothesis testing
is a form of proof by contradiction.

C. PROCEDURE OF HYPOTHESIS TESTING

The 95% confidence interval of a parameter consists of all values of the parameter that would not be rejected at
the a level of significance. The procedure of testing starts with stating H0 and HA. Under the null assumptions
the null value is defined as 0 (when the difference between comparison groups=0) or as 1.0 (when the ratio between comparison
groups=1). At the start we assume the level of significance usually 0.05. The 95% lower and upper confidence intervals can
be computed in 2 ways (a) using approximate methods based on the Gaussian distribution. The following test statistics are
involved: t, chi. using the test statistic and applying a special formula, the lower and higher confidence intervals can be
determined. (b) Exact methods based on the binomial distribution are used when the sample size is small. Exact methods require
use of powerful computers and appropriate statistical software. The decision rule is that if the Interval contains the null
value, H0 is not rejected. When the interval does not contain null value, H0 is rejected.

D. INTERPRETATION

If the 95% CI does not contain the null value, we reject the null hypothesis and we can conclude that there is statistical
significance. In other words we are sure that the null value is not within the interval. Our chance of error is 5% i.e. there
is a 5% chance that we have made a mistake by concluding that the null is not in the interval. If on the other hand, the 95%
CI contains the null value, we do not reject the null hypothesis and conclude that there is no statistical significance.

2.4.5 CONCLUSIONS and INTERPRETATIONS

A. IMPLICATIONS OF STATISTICALLY SIGNIFICANT

H0 is false

H0 is rejected

Observations are not compatible with H0

Observations are not due to sampling variation

Observations are real/true biological phenomenon

B. IMPLICATIONS OF NOT STATISTICALLY SIGNIFICANT

H0 is not false (we do not say true)

H0 is not rejected

Observations are compatible with H0

Observations are due to sampling variation or random errors of measurement.

Observations are artificial, apparent and not real biological phenomena

C. STATISTICAL AND PRACTICAL SIGNIFICANCE

Statistically significant may have no clinical/practical significance/importance. This may be due to (a) other factors
being involved and not studied here (b) measurements that are not valid. Clinically important difference may not reach statistical
significance due to 2 main reasons: (a) small sample size (b) measurement that are not discriminating enough

D. 1-TAIL AND 2-TAIL TESTS:

The test may be 2-tail or 1-tail. The 1-tail test may be right tail or left tail. The decision to use a 1-tail or 2-tail
test. The decision on which test to use depends on the intention. The 1-sided test considers extraneous values on one side
(1 tail). Under the 1-tail test the following are true: (a) upper tail: mu =
mu0 or mu > mu0 (b) lower tail: mean = mu0 or mu < mu0. The test is rarely
used; it is used in situations in which the direction of the difference is known. The 2-sided test considers extraneous values
on 2 sides (2 tails). It is a more conservative test. The 2-tail test looks for any change in the parameter whatever its direction.

D. ERRORS OF TESTING

CLASSIFICATION

Type 1 = rejecting a true H0 (false positive)

Type 2 = not rejecting a false H0 (false negative)

Alpha = Pr (type 1 error) = Pr (rejecting H0 when H0 is true)

Beta = Pr (type 2 error) = Pr (not rejecting H0 when H0 is false).

DETERMINANTS OF TYPE 2 ERRORS

Type II error is
determined by the discrepancy between the true and hypothesized value, the sample size,
variability/standard deviation, significance level, and the tail of the test.
The bigger the discrepancy between the tru and hypothesized values, the less the probability of type 2 error. The larger the sample size the lower the probability of type 2 error. The lower the
variability, the lower the probability of type 2 error. The lower the level of significance (type 1 error), the higher the
probability of type 2 error. 1-tail tests have a higher probability of type 2 error