Lecture at the FIFTH ADVANCED ASIAN COURSE IN TROPICAL EPIDEMIOLOGY/ INSTITUTE FOR MEDICAL RESEARCH KUALA LUMPUR 18-29 AUGUST 2003 By Professor Dr Omar Hasan Kasule, Sr. MB ChB (MUK), MPH, DrPH (Harvard) Deputy Dean for Research, Faculty of Medicine, UIA PO Box 141 Kuantan Pahang MALAYSIA Tel 609 513 2797 Fax 609 513 3615 E-M omarkasule@yahoo.com




A. Causal Triangle

B. Risk

C. Cause

D. Disease Cause Associations

E. Criteria Of Causality



A. Definition

B. Classification Of Exposures

C. Measurement Of Exposures



A. Definition Of Determinants

B. Biological Determinants

C. Behavioral Determinants

D. Environmental Determinants

E. Social Determinants



A. Analytic Epidemiology

B. Hypothesis Testing

C. Preliminaries to Data Analysis

D. Procedures Used:



A. Tests Of Association On Means: 

B. Tests Of Association On Proportions: Single 2x2 Contingency Table

C. Tests Of Association On Proportions: Single: 2 X K Contingency Table:

D. Tests Of Association On Proportions: Stratified 2 X 2 Contingency Tables

E. Properties Of The Chi-Square Statistic:



A. Comparison Of Proportions In Contingency Table

B. Measures Of Excessive Risk:

C. Regression Effect Estimates

D. Properties Of The Odds Ratio:

E. Interaction and Effect Modification



A. Validity

B. Internal Validity

C. External Validity

D. Precision



A. Mis-classification bias

B. Selection bias

C. Confounding bias

D. Mis-specification bias

E. Survey error and sampling bias



The causal triangle






Disease risk is a probability

Risk factor =  known empirically to be involved in disease causation

Risk indicator = likely to be cause but are not yet confirmed



Data on causes from animal or human experiments/observations

Causes: causative or preventive

Sufficient cause = constellation of RFs that triggers disease

Necessary cause = (RF always part of the sufficient cause

Causes: weak and strong

Multi-causality of most diseases

Synergy = cooperative interaction in disease causation

Antagonism = causes acting against one another


The causal chain or causal pathway


Initiated by the main risk factor

Final stages are due to promoters


Disease and Putative Risk Factor

Statistically or non-statistical association

Statistical association may be causal or non-causal

1 disease with => RFs

1 cause with => different independent causes


Criteria of causality

Essential criteria



Time sequence

Biological plausibility


Back-up criteria

Dose-effect relationship



Evidence from intervention

Experimental evidence.



Exposure (personal attribute or environmental agent)

Physiological effect

Cause disease

Protect from disease


Description of exposures

Defined by subjective or objective data

Current or past exposures

Dichotomous (exposed vs unexposed)

Ranking by importance

Quantitatively or qualitatively


Instruments for measuring exposures


Personal interviews

Biochemical analyses of biological material

Physical and chemical analysis of the environment


Dimensions of exposure measurement

Nature of the exposure




Errors of exposure measurement

Differential errors bias the odds ratio

Non-differential errors attenuate effect)



Biological determinants

Demographic: age and gender


Behavioral determinants




Environmental determinants


Physical agents: heat, cold, and radiation


Social determinants

Socio-economic status


Race & ethnicity

Medical care.




Data analysis affects practical decisions

Construction of hypotheses

Testing hypotheses

2-sided test covers p1>p2 and p2>p1

sided test covers either p1>p2 or P2 > p1 and not both


Simple manual inspection of the data

Identifying outliers

Assessing the normality of data

Identifying commonsense relationships

Alert the investigator to errors in computer analysis


Data models for continuous data

Straight line regression

Non-linear regression



Data models for categorical data

Maximum likelihood

Logistic models


Two procedures of analytic epidemiology

Measures of association



linear correlation coefficient

linear regression coefficient


Measures of effect

Odds Ratio

Risk Ratio

Rate difference

Logistic regression coefficient


Trend analysis

Relationships missed by association mrasures

Relations missed by effect measures.



Validity = measure of accuracy

External validity = generalizability

Internal validity  = results of individual study not biased



Measures variation in the estimate

lack of random error = little variation in the estimate

Narrow CI = precise

Wide CI = imprecise


Reliability is reproducibility


Sources of bias

Misclassification bias

Selection bias

Confounding bias

Sampling bias


Types of bias

Negative bias = parameter estimate is below the true parameter

Positive bias = the parameter estimate is above the true parameter



Random (non-differential) errors lead to imprecise parameter estimates

Systematic (differential) errors lead to bias



The concept of the causal triangle (environment, host, and disease) has been used for many years to simplify epidemiological reasoning. Disease risk is a probability. A risk factor is known empirically to be involved in disease causation. Risk indicators are likely to be causes but are not yet confirmed. Data on causes can be obtained from animal or human experiments/observations. Causes may be defined as causative or preventive. A risk factor is described as sufficient when its mere presence will trigger the disease concerned. In practice a sufficient cause refers to a constellation of 2 or more risk factors since most diseases are multi-causal. One disease normally has more than 1 sufficient cause. There are some risk factors that are always present in all sufficient causes of the disease. These are referred to as necessary causes. Causes may be weak or strong. Causes may interact either cooperatively in disease causation (synergy) or act against one another (antagonism). The causal chain or causal pathway is multi-stage. It is initiated by the main risk factor. The final stages are due to promotors. Association of disease with a putative risk factor may be statistically or non-statistical. Statistical association can be causal or non-causal. One disease may have 2 or more co-factors. One disease may have 2 quite different independent causes. One cause leads to 2 different diseases. The criteria of causality are either essential criteria or back-up criteria. The essential causal criteria are four: specificity, strength, time sequence, and biological plausibility. The back-up causal criteria are five: dose-effect relationship, repetition, consistency, evidence from intervention, and experimental evidence.



An exposure is defined as a substance, phenomenon, or event that has a physiological effect, can cause or protect from disease. Exposures may be personal attributes or environmental agents, defined by subjective or objective data, current or past exposures. Exposures can be dichotomous (exposed vs unexposed), ranked according to importance, stratified. Categorization may be based on statistical distributions for example BMI. Exposures may be measured quantitatively or qualitatively. The following are instruments used to measure exposures: questionnaires, personal interviews, biochemical analyses of biological material, physical and chemical analysis of the environment. Measurement of an exposure involves three dimensions: nature of the exposure, the dose, and time. Differential errors in exposure measurement result in a biased odds ratio; the bias remains even of the sample size is increased. Non-differential errors make the odds ratio tend to the null value (attenuation of effect). Non differential error lowers study power and requires a larger sample size to detect a given difference. Measurement errors can be reduced by multiple assessments of the exposure such as repeat assessments of cholesterol. The effect measure can be adjusted to account for the effect of the error. The best approach is to use high quality control measures at the stage of data collection to minimize errors.



Biological determinants are demographic or genetic. Age and gender structure of a population have an impact on mortality and morbidity. Pre-disposition to many diseases is inherited.  Some diseases are known to be genetically-caused while the genetic basis of others is being unravelled. Behavioral determinants are lifestyle and nutrition. Environmental determinants are infections and physical agents such as heat, cold, and radiation. Social determinants are the socio-economic status, occupation, race, ethnicity, and medical care.



Data analysis affects practical decisions. It involves construction of hypotheses and testing them. The 2-sided test covers p1>p2 and p2>p1. The 1-sided test covers either p1>p2 or P2 > p1 and not both. The 2-sided test is preferentially used because it is more conservative. Simple manual inspection of the data is needed can help identify outliers, assess the normality of data, and identify commonsense relationships, and alert the investigator to errors in computer analysis. Data models for continuous data can be straight line regression, non-linear regression, or trends. Data models for categorical data are the maximum likelihood and the logistic models. Two procedures are employed in analytic epidemiology. The test for association is done first. The assessment of the effect measures is done after finding an association. Effect measures are useless in situations in which tests for association are negative. The common tests for association are: t-test, chi-square, the linear correlation coefficient, and the linear regression coefficient. The effect measures commonly employed are: Odds Ratio, Risk Ratio, Rate difference. Measures of trend can discover relationships that are not picked up by association and effect measures.



The tests below are used for continuous measurement data. The t-test is used for two sample means. Analysis of variance, ANOVA (F test) is used for more than 2 sample means. Multiple analysis of variance, MANOVA, is used to test for more than one factor. Linear regression is used in conjunction with the t test for data that requires modeling. Dummy variables in the regression model can be used to control for confounding factors like age and sex. The chisquare test is used to test association of 2 or more proportions in contingency tables. The exact test is used to test proportions for small sample sizes. The Mantel-Haenszel chi-square statistic is used to test for association in stratified 2 x 2 tables. The chi square statistic is valid if at least 80% of cells have more than 5 observed, at least 80% of cells have more than 1.0 expected, and there are at least 5 observed in 80% of cells. If the observations are not independent of one another as in paired or matched  studies, the McNemar chisquare test is used instead of the usual Pearson chisquare test. The chisquare works best for approximately Gaussian distributions.



The Mantel-Haenszel Odds Ratio is used for 2 proportions in single or stratified 2x2 contingency table. Logistic regression can be used as an alternative to the MH procedure. For paired proportions, a special form of the M-H OR and a special form of logistic regression called conditional logistic regression, are used. Excessive disease risk is measured by Attributable Risk, Attributable Risk Proportion, and Population Attributable Risk. Variation of an effect measure by levels of a third variable is called effect modification by epidemiologists and interaction by statisticians. Synergism/antagonism is when the interaction between two causative factors leads to an effect more than what is expected on the basis of additivity or subtractibility. Interaction can be conceptualized at 4 levels. Statistical (additive and multiplicative), biologic, public health, & decision making. The chi square for heterogeneity can be used to test for effect modification/interaction.



An epidemiological study should be considered as a sort of measurement with parameters for validity and precision. Validity is a measure of accuracy. Validity can be classified as internal validity and external validity. External validity is also called generalizability. Precision measures variation in the estimate. Reliability is reproducibility. Bias is defined technically as the situation in which the expectation of the parameter is not zero. The following types of bias are explained in the next unit: Misclassification bias, Selection bias, and confounding bias. Bias may move the effect parameter away from the null value or toward the null value. In negative bias the parameter estimate is below the true parameter. In positive bias the parameter estimate is above the true parameter. A study is not valid if it is biased. Systematic errors lead to bias and therefore invalid parameter estimates. Random errors lead to imprecise parameter estimates. Internal validity is concerned with the results of each individual study. Internal validity is impaired by study bias. External validity is generalizability of results. Traditionally results are generalized if the sample is representative of the population. In practice generalizability is achieved by looking at results of several studies each of which is individually internally valid. It is therefore not the objective of each individual study to be generalizable because that would require assembling a representative sample. Precision is a measure for lack of random error. An effect measure with a narrow confidence interval is said to be precise. An effect measure with a wide confidence interval in imprecise. Precision is increased in three ways: increasing the study size, increasing study efficiency, and care taken in measurement of variables to decrease mistakes.



Misclassification is inaccurate assignment of exposure or disease status. Random or non-differential misclassification of disease, measured by the 95% CI, biases the effect measure towards the null, underestimates the effect measure but does not introduce bias. Non-random or differential misclassification is a systematic error that biases the effect measures away from the null exaggerating or underestimating the effect measure. Positive association may become negative and negative associations association may become positive. Misclassification bias is classified as information bias, detection bias, and proto-pathic bias. Information bias is systematic incorrect measurement on response due to questionnaire defects, observer errors, respondent errors, instrument errors, diagnostic errors, and exposure mis-specification. Detection bias arises when disease or exposure are sought more vigorously in some groups than others. Protopathic bias arises when early signs of disease cause a change in behaviour with regard to the risk factor. Misclassification bias can be prevented by using double-blind techniques to decrease observer and respondent bias. Treatment of misclassification bias the probabilistic approach and measurement of inter-rater variation.


Selection bias arises when subjects included in the study differ in a systematic way from those not selected. Selection bias due to biological factors includes the Neyman fallacy and susceptibility bias. The Neyman fallacy arises when the risk factor is related to prognosis (survival) thus biasing prevalence studies. Susceptibility bias arises when susceptibility to disease is indirectly related to the risk factor. Selection bias due to disease ascertainment procedures includes publicity, exposure, diagnostic, detection, referral, self-selection, and Berkson biases. The Hawthorne self selection bias is also called the healthy worker effect since sick people are not employed or are dismissed. The Berkson fallacy arises due to differential admission of some cases to hospital in proportions such that studies based on the hospital give a wrong picture of disease-exposure relations in the community. Selection bias during data collection is represented by non-response bias and follow-up bias  Prevention: study design should avoid the causes of selection bias that have been mentioned. Treatment: there are no easy methods for adjustment for the effect of selection bias once it has occurred.


Confounding is mixing up of effects. Confounding bias arises when the disease-exposure relationship is disturbed by an extraneous factor called the confounding variable. The confounding variable is not actually involved in the exposure-disease relationship. It is however predictive of disease but is unequally distributed between exposure groups. Being related both to the disease and the risk factor, the confounding variable could lead to a spurious apparent relation between disease and exposure. A confounder must fulfil the following criteria: relation to both disease and exposure and not being part of the causal pathway, being a true risk factor for the disease, being associated to the exposure in the source population, must not be affected by either disease or exposure. Prevention of confounding at the design stage by eliminating the effect of the confounding factor can be achieved using 4 strategies: pair-matching, stratification, randomisation, and restriction. Care must be taken to deal only with true confounders. Adjusting for non-confounders reduces the precision of the study. Non-multivariate treatment of confounding employs standardization and stratified Mantel-Haenszel analysis. Multi-variate treatment of confounding employs multivariate adjustment procedures: multiple linear regression, linear discriminant function, and multiple logistic regression.


Mis-specification bias arises when a wrong statistical model is used. For example use of parametric methods for non-parametric data biases the findings.


Survey error and sampling bias: Total survey error is the sum of the sampling error and three non-sampling errors (measurement error, non-response error, and coverage error). Sampling errors are easier to estimate than non-sampling errors. Sampling error decreases with increasing sample size. Non-sampling errors may be systematic like non-coverage of the whole sample or they may be non-systematic. Non-systematic errors cause severe bias. Sampling bias, positive or negative,  arises when results from the sample are consistently wrong (biased) away from the true population parameter. The sources of bias are: incomplete or inappropriate sampling frame, use of a wrong sampling unit, non-response bias, measurement bias, coverage bias, and sampling bias. Sensitivity analysis can be carried out for the major types of bias often using simulations.

Professor Omar Hasan Kasule, Sr. August 2003