By Professor Omar Hasan Kasule Sr.



Epidemiology is the study of the distribution and determinants of both disease and injury. Two triads are involved in epidemiology: (a) the agent, host, and environment triad and the time, place, and person triad.


The primary goals of epidemiology are prevention, control, and, in rare instances, eradication disease and injury.


Epidemiology started as a study of epidemics and extended to cover infectious disease and later non-infectious diseases. It has now become a methodological discipline that is used to study disease and non-disease phenomena.


Qualitative epidemiology deals with qualitative descriptions. Quantitative epidemiology deals with numerical descriptions. Observational epidemiology is based on observation of human phenomena. Experimental epidemiology involves assessment of the effects of intervention against a disease phenomenon. Theoretical epidemiology deals with mathematical and methodological issues. Descriptive epidemiology describes the patterns of disease occurrence in terms of place, time and person. Analytic epidemiology seeks to discover the underlying causes of diseases.


Public-health epidemiology deals with preventive medicine. Clinical epidemiology deals with diagnosis, management, and prognosis of disease. Hospital epidemiology deals with nosocomial infections and other aspects of hospital operations that can be studied using epidemiological methodology. Drug or pharmaco-epidemiology studies phenomena of adverse reactions and side-effects of drugs. Genetic epidemiology studies the patterns of inheritance of disease from  parents and how genetic and environmental factors interact in the final pathway of disease causation. Molecular epidemiology deals with phenomena at the molecular level. Occupational epidemiology studies diseases due to exposure to hazardous material or working conditions in the work-place. Environmental epidemiology studies the impact of air, water, and soil pollution on health.


The supporting disciplines of epidemiology are clinical sciences, demographical sciences, data and information sciences, behavioral sciences, and environmental sciences.



Epidemiology is used in clinical medicine, public health, and actuarial sciences. The major activities of an epidemiologist are: study design including selection of the study sample, data collection, data analysis, data interpretation, and initiation of action programs to prevent disease and promote health. Professional practice and careers in Epidemiology are in government (Ministry of Health), universities, hospitals, and the private sector (drug manufacturers), and research institutes.




Famous epidemiologists contributed to the early growth of the discipline. Hippocrates made the first recorded epidemiological observations by describing the relation of disease to climate and geography. John Snow (1813-1858) recognized the importance of field epidemiology in his study of the London cholera and its relation to water pollution William Budd (1811-1880) described the spread of typhoid due to ingestion of infected material from patients. William Furr realized that cycles of epidemics could be described mathematically. Major Greenwood (1880-1949) was chief of epidemiology and vital statistics at the London School of Hygiene and Tropical Medicine worked on models of epidemics.



An epidemiologic investigation proceeds through identifying and describing a problem, using the scientific method to formulate and test hypotheses, and interpreting findings. Epidemiological information is sourced from existing data or studies (observational or experimental). Existing data is from census, medical facilities, government, and private sector, health surveys, and vital statistics.


Experimental studies, natural or true experiments, involve deliberate human action or intervention whose outcome is then observed. They have the advantage of controlled conditions but have ethical problems of experimenting on humans.


Observational studies allow nature to take its course and just record the occurrences of disease and describe the what, where, when, and why of a disease. There are of 3 types of observational studies: cross-sectional, case control, and cohort (follow-up) studies.  Their advantage is low cost and fewer ethical issues. They suffer from 3 disadvantages: disease aetiology is not studied directly because the investigator does not manipulate the exposures, unavailability of information, and confounding.


Epidemiological methodology, following the scientific method, is empirical, inductive, and refutative. Epidemiology relies on and respects only empirical findings. Empiricism refers to reliance on physical proof. Induction is building a theory on several individual observations. Refutation is basically refusal of a supposition until it is proved otherwise. Epidemiological investigation is not as deterministic as laboratory investigation but is cheap and easy.



Five stages can be identified in the evolution of epidemiological knowledge. The ancient period up to 1500, the post renaissance period 1500-1750, the sanitary period 1750-1870, the infectious disease period 1870-1945, and modern epidemiology period starting in 1945 (also considered the chronic disease period).


In the ancient period, inter-personal disease transmission, connection between diseases and the environment, quarantine and isolation were known. In 400 BC Hippocrates suggested the relation between disease on one side and lifestyle and environmental factors on the other side.


The post renaissance period witnessed rapid growth of knowledge of pathology, and transmission as well as control of disease. In the 1660s Bacon and others developed inductive logic that provided a philosophical basis for epidemiology. Girolamo Fracastoro (1478-1553) suggested that disease spread by direct contact and by small living particles. In 1683 Van Leeuwenhoek saw microorganisms under the microscope. In 1662 Captain John Graunt analyzed births and deaths and described disease in population quantitatively with significant epidemiological observations and determinations. In 1747 James Lind discovered the prevention of scurvy by conducting one of the first experimental trials on humans. In 1798 Edward Jenner discovered vaccination. Ramazzini wrote on occupational health in 1770. Percival Pott (1713-1788) associated scrotal cancer to chimney soot.


In the sanitary period concern was about environmental correlates of disease; quarantine and isolation were used for disease control.


During the infectious disease period, the microbial basis of disease became firmly established when Louis Pasteur (1822-1895) and Robert Koch (1843-1900) developed the germ theory through experimentation. Dr Robert Koch the father of bacteriology identified causative organisms of anthrax (1876), tuberculosis (1882), and cholera (1883). He developed Koch’s postulates which were criteria for determining an infectious etiology of disease. In 1847 Ignaz Philip Semmelweis suggested hand-washing to avoid obstetric infection. John Snow described the association between cholera and contaminated water by forming and testing a series of hypotheses thus being a pioneer of analytic epidemiology. William Budd in 1857-73 concluded that typhoid was contagious. In 1839 William Farr started the discipline of vital statistics as a system of regular collection and interpretation of data and set up a system for routine summaries of causes of death. Joseph Lister introduced antiseptic surgery in 1865. Manson Barr, Bruce-Chwatt and others studied the transmission of mosquito-borne infections, malaria and yellow fever.


Towards the end of the infectious disease period, there were developments in knowledge of non-infectious disease and statistical methodology. Non-infectious diseases (nutritional, occupational, psychiatric, and environmental) were identified and were studied. In 1905 beriberi was found associated with eating milled rice. In 1920 Joseph Goldberger published a descriptive field study relating pellagra to diets high in cereal & canned foods and free of fresh animal products. Elmer McCollum a Professor at Johns Hopkins since 1918 discovered vitamin-deficiency diseases. Statistical theory and practice developed rapidly towards the close of the 19th century to keep up with developments in basic research and public health all of which required statistical analysis.


The period of modern epidemiology starting in 1945 is the chronic disease epoch. By 1945 there was convergence of the non-mechanistic concepts of disease (environment, social, and behavioral basis of disease) and the mechanistic concepts of disease (molecular, biological, gent-host interaction). Health was defined in a broad sense as: physical, mental, psychological, and spiritual well-being. Scientists recognized the multi-causal nature of disease (genetic, psycho-social, physiological, and metabolic). The period is witnessed a demographic transition (ageing populations) as an epidemiologic transition (change from communicable to non-communicabe diseases). It also witnessed major studies that helped redefine the direction of epidemiology and public health. In 1949 the Framingham Heart Study was began as the first cohort study of the causative factors of cardiovascular disease. In 1950 Doll and Hill, Levin et al, Schreck et al. and Wynder and Graham published the first case control studies of smoking and lung cancer. In 1954 the Field trials of the Salk polio vaccine were the largest formal human experiment. In 1971-1972 the North Karelia Project and the Stanford Three Community studies were launched as the first community-based cardiovascular disease prevention programs. Further methodological developments were witnessed in this period. In 1960 MacMahon published the first epidemiology textbook with systematic treatment of study design. In 1959 Mantel and Haenszel developed statistical procedures for case control studies. In the 1970s logistic regression and log-linear regression were developed as new multivariate analytic methods. In the 1970s – present new developments in computer hardware and software. In the 1990s molecular techniques are being applied to study of large populations.



A study involving humans must get approval from a recognized body. For approval the study must fulfil certain criteria. It must be scientifically valid. It is unethical to waste resources (time and money) on a study that will give invalid conclusions. In 1992 the Council for International Organizations of the Medical Sciences published ‘Guidelines for Ethical Review of Epidemiological Studies. Among ethical considerations are: individual vs. community rights, benefits vs. risks, informed consent, privacy and confidentiality, and conflict of interest.


Study interpretation and communication of findings to the public pose problems. Risk reports that are not yet confirmed are picked up by the media and create unnecessary public concern. Study findings affect policy. Epidemiologists must know how to communicate risk to the public. It is an ethical obligation to report research findings to subjects so that they may take measures to lessen risk. Epidemiological evidence is different from legal evidence. Epidemiological evidence may not be accepted in a court of law because it has few certainties; it is concerned with populations whereas legal evidence pertains to individuals.




The size of the sample depends on the hypothesis, the budget, the study durations, and the precision required. If the sample is too small the study will lack sufficient power to answer the study question. A sample bigger than necessary is a waste of resources. Power is ability to detect a difference and is determined by the significance level, magnitude of the difference, and sample size. The bigger the sample size the more powerful the study. Beyond an optimal sample size, increase in power does not justify costs of larger sample. There are procedures, formulas, and computer programs for determining sample sizes for different study designs.



Secondary data is from decennial censuses, vital statistics, routinely collected data, epidemiological studies, and special health surveys. Census data is reliable. It is wide in scope covering demographic, social, economic, and health information. The census describes population composition by sex, race/ethnicity, residence, marriage, socio-economic indicators. Vital events are births, deaths, Marriage & divorce, and some disease conditions. Routinely collected data are cheap but may be unavailable or incomplete. They are obtained from medical facilities, life and health insurance companies, institutions (like prisons, army, and schools), disease registries, and administrative records. Observational epidemiological studies are of 3 types: cross-sectional, case-control, and follow-up/cohort studies. Special surveys cover a larger population that epidemiological studies and may be health, nutritional, or socio-demographic surveys.



Questionnaire design involves content, wording of questions, format and layout. The reliability and validity of the questionnaire as well as practical logistics should be tested during the pilot study. Informed consent and confidentiality must be respected. A protocol sets out data collection procedures. Questionnaire administration by face-to-face interview is the best but is expensive. Questionnaire administration by telephone is cheaper. Questionnaire administration by mail is very cheap but has a lower response rate. Computer-administered questionnaire is associated with more honest responses.



Data can be obtained by clinical examination, standardized psychological/psychiatric evaluation, measurement of environmental or occupational exposure, and assay of biological specimens (endobiotic or xenobiotic) and laboratory experiments. Pharmacological experiments involve bioassay, quantal dose-effect curves, dose-response curves, and studies of drug elimination. Physiology experiments involve measurements of parameters of the various body systems. Microbiology experiments involve bacterial counts, immunoasays, and serological assays. Biochemical experiments involve measurements of concentrations of various substances. Statistical and graphical techniques are used to display and summarize this data.


Self-coding or pre-coded questionnaires are preferable. Data is input as text, multiple choice, numeric, date and time, and yes/no responses. Data in the computer can be checked manually against the original questionnaire. Interactive data entry enables detection and correction of logical and entry errors immediately.


Data editing is the process of correcting data collection and data entry errors. The data is 'cleaned' using logical, statistical, range, and consistency checks. All values are at the same level of precision (number of decimal places) to make computations consistent and decrease rounding off errors. The kappa statistic is used to measure inter-rater agreement. Data editing identifies and corrects errors such as invalid or inconsistent values. Data is validated and its consistency is tested. The main data problems are missing data, coding and entry errors, inconsistencies, irregular patterns, digit preference, out-liers, rounding-off / significant figures, questions with multiple valid responses, and record duplication. Data transformation is the process of creating new derived variables preliminary to analysis and includes mathematical operations such as division, multiplication, addition, or subtraction; mathematical transformations such as logarithmic, trigonometric, power, and z-transformations.


Data analysis consists of data summarization, estimation and interpretation. Simple manual inspection of the data is needed before statistical procedures. Preliminary examination consists of looking at tables and graphics. Descriptive statistics are used to detect errors, ascertain the normality of the data, and know the size of cells. Missing values may be imputed or incomplete observations may be eliminated. Tests for association, effect, or trend involve construction and testing of hypotheses. The tests for association are the t, chi-square, linear correlation, and logistic regression tests or coefficients. The common effect measures Odds Ratio, Risk Ratio, Rate difference. Measures of trend can discover relationships that are not picked up by association and effect measures. The probability, likelihood, and regression models are used in analysis. Analytic procedures and computer programs vary for continuous and discrete data, for person-time and count data, for simple and stratified analysis, for univariate, bivariate and multivariate analysis, and for polychotomous outcome variables. Procedures are different for large samples and small samples.




The cross-sectional study, also called the prevalence study or naturalistic sampling, has the objective of determination of prevalence of risk factors and prevalence of disease at a point in time (calendar time or an event like birth or death).  Disease and exposure are ascertained simultaneously. A cross-sectional study can be descriptive or analytic or both.  It may be done once or may be repeated. Individual-based studies collect information on individuals. Group-based (ecologic) studies collect aggregate information about groups of individuals. Cross-sectional studies are used in community diagnosis, preliminary study of disease etiology, assessment of health status, disease surveillance, public health planning, and program evaluation.



The case-control study is popular because or its low cost, rapid results, and flexibility. It uses a small numbers of subjects. The source population for cases and controls must be the same. Cases are sourced from clinical records, hospital discharge records, disease registries, data from surveillance programs, employment records, and death certificates. Cases are either all cases of a disease or a sample thereof. Only incident cases (new cases) are selected. Controls must be from the same population base as the cases and must be like cases in everything except having the disease being studied. Information comparability between the case series and the control series must be assured. Hospital, community, neighborhood, friend, dead, and relative controls are used. There is little gain in efficiency beyond a 1:2 case control ratio unless control data is obtained at no cost. Confounding can be prevented or controlled by stratification and matching. Exposure information is obtained from interviews, hospital records, pharmacy records, vital records, disease registry, employment records, environmental data, genetic determinants, biomarker, physical measurements, and laboratory measurements.



A follow up study (also called cohort study, incident study, prospective study, or longitudinal study), compares disease in exposed to disease in non-exposed groups after a period of follow-up. It can be prospective (forward), retrospective (backward), or ambispective (both forward and backward) follow-up. In a nested case control design, a case control study is carried out within a larger follow up study. The follow-up cohorts may be closed (fixed cohort) or open (dynamic cohort). Analysis of fixed cohorts is based on CI and that of open cohorts on IR. The study population is divided into the exposed and unexposed populations. A sample is taken from the exposed and another sample is taken from the unexposed. Both the exposed and unexposed samples are followed for appearance of disease. The ascertainment of the outcome event must be standardized with clear criteria. Follow-up can be achieved by letter, telephone, surveillance of death certificates and hospitals. Care must be taken to make sure that surveillance, follow-up, and ascertainment for the 2 groups are the same.



A community intervention study targets the whole community and not individuals. There are basically 2 different study designs. In a single community design, disease incidence is measured before and after intervention. In a 2-community design, one community receives an intervention whereas another one serves as the control. Allocation of a community to either the intervention or the control group is by randomization. The intervention and the assessment of the outcome may involve the whole community or a sample of the community. Outcome measures may be individual level measures or community level measures.



The aim of randomization in controlled clinical trials is to make sure that there is no selection bias and that the two series are as alike as possible by randomly balancing confounding factors. Patients with a disease are allocated randomly to 2 groups. One group receives the drug being tested. The other group, also called comparison group, receives a placebo or receives another drug being compared. Equal allocation in randomization is the most efficient design. Methods of randomization include alternate cases and sealed serially numbered envelopes. Stratified randomization is akin to block design of experimental studies. Randomization is not successful with small samples and does not always ensure correct conclusions.




Data analysis affects practical decisions and should therefore be taken seriously. Simple manual inspection of the data is needed can help identify outliers, identify commonsense relationships, and alert the investigator to errors in computer analysis. Two procedures are employed in analytic epidemiology: test for association and measures of effect. The test for association is done first. The assessment of the effect measures is done after finding an association. Effect measures are useless in situations in which tests for association are negative. The common tests for association are: t-test, F test, chi-square, the linear correlation coefficient, and the linear regression coefficient. The effect measures commonly employed are: Odds Ratio, Risk Ratio, Rate difference. Measures of trend can discover relationships that are too small to be picked up by association and effect measures.



An epidemiological study should be considered as a sort of measurement with parameters for validity, precision, and reliability. Validity is a measure of accuracy. Precision measures variation in the estimate. Systematic errors lead to bias and therefore invalid parameter estimates. Random errors lead to imprecise parameter estimates.


Internal validity is concerned with the results of each individual study. Internal validity is impaired by study bias. External validity is generalizability of results. Traditionally results are generalized if the sample is representative of the population. In practice generalizability is achieved by looking at results of several studies each of which is individually internally valid. It is therefore not the objective of each individual study to be generalizable because that would require assembling a representative sample.


Precision is a measure for lack of random error. An effect measure with a narrow confidence interval is said to be precise. An effect measure with a wide confidence interval in imprecise. Precision is increased in three ways: increasing the study size, increasing study efficiency, and care taken in measurement of variables to decrease mistakes.



Meta analysis refers to methods used to combine data from more than one study to produce a quantitative summary statistic. Meta analysis enables computation of an effect estimate for a larger number of study subjects thus enabling picking up statistical significance that would be missed if analysis were based on small individual studies. Meta analysis also enables study of variation across several population subgroups since it involves several individual studies carried out in various countries and populations. Criteria must be set for what articles to include or exclude. Information is abstracted from the articles on a standardized data abstract form with standard outcome, exposure, confounder, or effect modifying variables.




Misclassification is inaccurate assignment of exposure or disease status. Misclassification bias is classified as information bias or detection bias. Information bias is systematic incorrect measurement on response due to questionnaire defects, observer errors, respondent errors, instrument errors, diagnostic errors, and exposure mis-specification. Detection bias arises when disease or exposure are sought more vigorously in one comparison more than the other group.



Selection bias arises when subjects included in the study differ in a systematic way from those not included. It is due to biological factors, disease ascertainment procedures, or data collection procedures. Selection bias during data collection is represented by non-response bias and follow-up bias.



Confounding is mixing up of effects. Confounding bias arises when the disease-exposure relationship is disturbed by an extraneous factor called the confounding variable. The confounding variable is not actually involved in the exposure-disease relationship. It is however predictive of disease but is unequally distributed between exposure groups. Being related both to the disease and the risk factor, the confounding variable could lead to a spurious apparent relation between disease and exposure if it is a factor in the selection of subjects into the study.

Professor Omar Hasan Kasule, Sr, 14th October 2005