By Professor Omar Hasan Kasule Sr.



Sampling starts by defining a sampling frame (list of individuals to be sampled). Then specific methods is used to select the sample from the defined population. The sampling units are the people or objects to be sampled. A sampling frame can be looked at as the enumeration of the population by sampling units.



This is the simplest type of random sampling. In simple random sampling, any sample of size n has an equal chance of being selected from the population. In simple random sampling, all units in the population are at equal risk of being selected into the sample. The simple random sampling eliminates personal bias. This is because unlike the situation in convenience or quota sampling, the researcher has no way of pre-determining that a particular member of the population will be included in the sample. Much of statistics is concerned with the estimation of the magnitude of the sampling error. It is possible to compute the sampling error of the mean, proportions, and variance if the underlying sampling was simple random. The magnitude of the sampling error gives a measure of the precision of the parameters. Knowledge of this precision is necessary to interpret inferential findings.



In this type of sampling the whole population is divided into groups called strata. It forces the investigator to select some elements from each of the strata thus achieving some sort of balance for the whole sample. A pre-determined proportion or fraction of each stratum is randomly selected into the sample. Selection is carried out separately in each stratum using random selection. The sampling fraction from each stratum may be the same or may vary from stratum to stratum. The variation of sampling fractions enables deliberate over-sampling or under-sampling of some strata.  Another way of stating this is to give each stratum a weighting. The inclusion probabilities are different for the different strata. Those to be over-sampled have higher weighting than those to be under-samples.



This type of sampling is used when there is an ordered list ie the population is arranged in some definite and known order. The decision can then be made to include into the sample every nth unit where n may be any number. The first unit is selected at random and then you proceed according to the pre-defined pattern. Systematic sampling is less efficient and less accurate than simple random sampling if the sampling interval is the same as the pattern of natural variation in the population. It is more efficient that simple random sampling if the sampling interval is not the same periodicity as the population. This type of sampling will be invalid if there is a natural repeat order in the sample that repeats exactly every n elements where n is the sampling interval. Systematic sampling has the advantage that it is quick and is easy to use. The disadvantage of systematic sampling is that it requires assembling a complete sampling frame.



This is a random sample selected in 2 or more stages. The sample selected at the second stage is a subsample of that selected at the first stage. An example of a 5-stage multi-stage sampling may involve the following administrative units in descending order: city, neigborhood, block, household, individual. This is done for example when a random sample is selected from each of the 2 gender categories, male and female. Then random samples are selected from each age category of each gender category. If a sample of households is selected, that sample is called the primary sampling unit (PSU). Household members selected from each household randomly are called the secondary sampling unit (SSU). The resulting multi-stage sample has the advantage of being balanced with respect to gender, age, or household characteristics. Multi-stage sampling produces less efficient estimates of population parameters than simple random sampling. It is saves time and money thus becoming cheaper than simple random sampling. It does not require enumeration of all the sampling frame before start of the sampling process. It is especially convenient when the complete sampling frame is not known. It has the great advantage of ensuring balanced representation of the groups that may not occur with simple random sampling. It is possible to have a sampling scheme that combines stratified with 2-stage sampling.



Convenience or casual sampling is subjective. It is according to the whims of the investigator. There is no particular concern for objectivity or representativeness. It is purely subjective.


A quota sample is a representative sample in the sense that it is deliberately chosen to have the characteristics of the population. A fixed number to be selected from each category is fixed in advance. Each interviewer is given instructions about certain characteristics such as age, sex, SES and is asked to select fixed numbers for each category corresponding to the category's proportion in the population. This method is systematic at the level of the investigator but very subjective at the level of the interviewer. Bias is likely in quota sampling. The method cannot ensure that the sample is representative of the population. It is also not possible to think of all the relevant categories and their classifications in advance to enable correct categorization and determination of the proportions to be selected from each category. It is too expensive to carry out a preliminary study for the sole purpose of determining the categories.


Cluster sampling is easy, cheap but less precise. Instead of using individuals as sampling units, groups of individuals (clusters) are used. The clusters may be natural or artificial. For example instead of sampling individuals, households may be sampled. Clusters are normally selected as natural sub-groupings of the population. A random sample of clusters is selected and all elements of the cluster are included in the study sample. Cluster sampling can be viewed as a form of simple random sampling of clusters and not individual sampling units. Cluster sampling can also be looked at as a form of 2-stage sampling in which all elements of the groups drawn in the first stage are included in the study sample.  Cluster sampling proceeds by selecting geographical units like districts or zip codes. Then a house is selected at random in each unit. A cluster of given size is then formed around the index house. Sophisticated methods for this selection have been developed. For example the researcher may walk in a straight line in a pre-determined direction while counting until a pre-determined number of houses is counted. These houses together with the index house will then constitute the cluster. Similar clusters are formed in the other zip codes and members of the households are interviewed as study subjects. Cluster sampling has several advantages. There is no need to have a complete sampling frame for the whole population. Cluster sampling is easy, quick, and cheap. Clusters can be selected from the more acessible areas. Cluster sampling has some disadvantages. It is non-random. It is less precise than the simple random sample because units selected within each cluster are similar to another. Thus a cluster sample produces more similarity than there is in the actual population.


Randomization in experimental studies eg clinical trials. Randomization is an alternative to random sampling. In randomization you start with one group and randomly divide it up into two or more groups that are compared.


Epidemiological samples involve random sampling of human populations. There are basically three types of sampling schemes: cross-sectional, case control, and follow-up or cohort.



Samples are selected so that they can be used to collect data to answer specific questions. The size of the sample needed depends on the nature of the question of hypothesis being tested. The following are considerations in the determination of the sample size: the budget available for the study, the time within which results are needed, minimization of sampling error, and achieving pre-specified parameters of precision. The most important consideration is the precision of the estimates. There are therefore procedures and formulas for computing sample sizes. If the sample size is too small the study will not have sufficient power to answer the question under consideration accurately. If the sample size is bigger than is necessary there will be a waste of resource as information is collected from more persons that is needed. At the conceptual level, sample selection is a tool to study the heterogeneity of the population. If a population is perfectly homogenous, then a sample of 1 person however selected will be sufficient to study that population. If a population has several perfectly homogenous subgroups then selection of one element from each group will provide a sample that sufficiently describes the population. Similarly a sample of one group with all its elements will be sufficient to represent the population. The following presentation of sample size formulas suffers from a defect that some of the terms used will only be defined and discussed in later units. This section can therefore be deferred until units to have been covered. There are special computer programs such as EPI-INFO that can be used to compute sample sizes.


The incidence rate (IR) is a basic measure of disease occurrence. Incidence rate (IR) = # newly reported cases of a disease in a year / mid-year population.



Prevalence is a static concept that is a measure of state. It is a still-picture of the disease situation at a given point in time. Whereas incidence relates to events, prevalence relates to disease states at a point in time. The prevalence number is the number of cases of disease existing at the particular point in time. The prevalence proportion = # cases of illness at a particular time / # of individuals in the population at the same time.  Prevalence proportion is also called prevalence rate or point prevalence.. Prevalence is measured in cross-sectional studies. Only one observation at one point in time is needed in the determination of prevalence.



Measures of excess disease occurrence, also called measures of effect, are based on measures of association. Excess disease risk is measured as an absolute effect (Rate Difference or Risk Difference) or a relative effect (Relative Risk, Rate Ratio, Risk Ratio, Prevalence Ratio, Cumulative Incidence Ratio, Incidence density Ratio, Odds Ratio, and Standard Mortality Ratio).



Timing: Most countries hold decennial censuses once every 10 years.


Estimates: Inter-censual estimates are made every year in the period between 2 censuses. These estimates are based on the data of the previous census. The following additional information is also used: death rate, immigration, emigration, vital statistics of birth and death.


Reliability: Governments allocate a lot of resources to ensure that census information is reliable. Despite this some mistakes still occur. Some households/individuals are missed. Incomplete or inaccurate census forms may be submitted. Some persons may be counted twice in their place of usual residence and their residence on census day. Sampling techniques are used to compute the level of reliability of the census results.


Scope: The census covers demographic, social, economic, and health information. Each government department has its own data needs and the census organization has to strike the balance among competing needs of several stake-holders otherwise the census will be unwieldy collecting too much information to satisfy everybody. The information collected changes from census to census depending on the needs. Some information items do not change to be able to assess trends over decades.


Sources of errors: (a) counting: Normally the total count is not very far from accurate. Some subjects are counter twice whereas others are not counted at all. There is a tendency for these errors to balance out. (b) Age is often under-estimated. The correct approach is to record age at last birthday. (c) Occupational information is notoriously incomplete and inaccurate due to faulty recall especially for those who changed jobs frequently


Description of the population: Population composition is described by sex, race/ethnic group, place of birth, urban/rural distribution, marital conditions, socio-economic indicators (literacy, home ownership, occupation). 



Definition: Vital events are: births, deaths, Marriage & divorce, and some disease conditions. Collection of vital statistics was initially motivated by the administrative need of keeping a record of vital events that are of legal importance. It was only later with the growth of the public health discipline that the use of this data was understood. Even now a lot of the available data is not fully analyzed or utilized to understand public health phenomena.


Coverage: Most countries have legislation requiring mandatory reporting of vital events. However the effectiveness and efficiency of the registration vary. The items of information reportable vary by country and even within the same country by jurisdiction. The coverage and reliability of vital event reporting varies among countries and within jurisdictions of the same country. Established Market Economies have generally good coverage. Poor developing countries do not have the resources, manpower and finance, to maintain reliable systems of vital data collection. They have no strong enforcement mechanisms to ensure full registration. Data processing and report generation are also a problem.


Errors in vital statistics: vital data may be inaccurate; the usual causes of inaccuracy is misclassification and incomplete information. Reporting of births may not be complete where non-institutional deliveries are common. Uncomplicated home deliveries may not be reported. In cases of extra-marital births, the parents may prefer not to report the information. Deaths at home may not reported. Institutional deaths may not reported in the jurisdiction of usual residence. People may die away from their place of usual residence. Reporting of marriages and divorces has some problems. There are many registrars who may be civilian government marriage offices or religious authorities. The data from these various sources may not be centralized. Many marriages and divorces are informal and are never registered anywhere. Although reporting of specific morbidity data is mandatory, many physicians especially in the private sector are usually reluctant to report resulting into incomplete information.


Uses of vital statistics: Data on vital events is used for legal purposes, population estimates, and  health planning. The following are legal purposes fulfilled by vital events registration: establishing citizenship, payment of social welfare benefits, property or inheritance rights, establishment of paternity and legal financial support for offspring. The population distribution data is used for the following purposes: marketing, planning infra-structural developments (roads, schools, water, sewage, shops, recreational facilities), military planning, planning social security for workers and their dependents. The health planning functions are: planning number of hospital beds, planning of other health facilities, planning for health manpower, planning health insurance, emergency preparedness, and health budget allocations.



Records have the advantages of being cheap, requiring a shorter study time, accuracy, and a high response rate. Their disadvantages are unavailability, not covering the period of interest, and being incomplete. Diaries can be used to collect information about diet, sex, and exercise. They have the disadvantage that they require skill and commitment from the study subject. The following institutions routinely collect data about their clientele:


Medical facilities: Hospitals, health centers, and other health facilities have limited coverage because they collect data on a small segment of the population that comes to them. The following types of data are available: diseases and their treatment, deaths and their causes, health expenditure, and the demographic character of the catchment area.


Life and health Insurance companies: Unlike health care organizations, insurance companies collect background data bearing on the risk indicators of various disease conditions. They also record health events like surgery because of their impact on premiums.


Institutions: The following institutions collect routine information from their members: military, police, prisons, schools, and factories. Their coverage is limited only to their members. They have the advantage of pre-screening their recruits to make sure they are healthy. They thus have baseline and followup data. They have their own medical facilities where records of all inmates are kept. Their record keeping is efficient because they have strict measures to prevent mis-use of health services and absconding from duty on the basis of illness.  


Disease registries: Starting with cancer, the number of specific disease registries has grown phenomenally. There are registries for congenital anomalies, genetic anomalies, blood dyscrasias etc. Another tendency is the emergence of support groups or support networks that enable people suffering from the same disease to stay in touch. Pharmacies also maintain individual records of prescriptions. Some pharmacy net-works share data over a wide territory. Thus a lot of useful information is available in many data-bases.


Government: administrative records. Administrative records usually relate to public financial assistance or disability. They may not be medically accurate because those who make them are not medically trained. Data collection instruments are not designed with public health in mind with the result that a lot of health-related data is unusable.


Churches: In Europe, churches used to collect and record vital events. This was easy because churches performed marriages, baptisms, and burials thus covering the vital events of the human life cycle. These days with many people becoming non-practicing Christians these records are no longer complete or representative.



Epidemiological studies are undertaken for a specific purpose. They are of limited coverage. They are based on small samples not necessarily representative of the whole population.



Special surveys are studies with larger coverage of the population larger than epidemiological studies. Many are based on national samples. Health surveys cover symptoms, signs, health-related behavior, treatment, and expenditure. Nutritional surveys cover dietary intake (quality and quantity), anthropometric and biochemical measures of nutritional status. Socio-demographic surveys cover age, gender, dependency, contraceptive practice, family structure, employment status etc. Information on disease status is obtained from interviews, physical examinations, records, and pathology logs. Information on exposures is obtained from records, direct environmental measurements, and interviews. It is best to obtain exposure information from more than 1 source. Information on confounders is obtained from records and from interviews.







A decision must be made of what items to include in the questionnaire. This is guided by the hypothesis under study and knowledge of potential confounding factors. The question must be worded properly to make sure they are easy to understand, not biased, not threatening, not loaded, and have no assumptions. Double negatives should be avoided and each question should have only one concept. The order of the questions must be logical moving from the superficial to the more detailed. Embarrassing questions should be kept towards the end because they may spoil the whole interview. Closed questions are preferred to open questions. The content of a question may be one of the following: knowledge, attitude, belief, experience, behavior, and attributes. Questions should not be too long. The total number of questions must be appropriate. The format and layout of the questionnaire is important.


The reliability and validity of the questionnaire should be tested during the pilot study.

Before administering a questionnaire the investigator should be aware of some ethical issues. Informed consent must be obtained. In the course of the interview the investigator may get information that requires taking life-saving measures. Taking these measures will however compromise the confidentiality. Such a situation may arise in case of an interviewee who informs the interviewer that he is planning to commit suicide later that day. Such information may have to be conveyed immediately to the authorities concerned.


The following are common problems in questionnaires: ambiguous questions, questions that are not self-explanatory, two questions in one, use of unfamiliar words, asking for events that are difficult to remember, insufficient number of response categories, overlapping categories, questions that are too long, questions that have too many ideas, questions that require too much detail, leading questions, improper use of rating scales.



Data collection processes must be clearly defined in a written protocol which is the operational document of the study. Data collection is usually by questionnaire. The protocol should include the initial version of the questionnaire. This can be updated and improved after the pilot study. If a paper questionnaire is used data transfer into the electronic form will be necessary. The need for this could be obviated by direct on-line entry of data. The objectives of the data collection must be defined clearly. Operational decisions and planning depend on the definition of objectives. It is wrong to collect more data than what is necessary to satisfy the objectives. It is also wrong to collect data just in case it may turn out to be useful. The study population is identified. The method of sampling and the size of the sample are determined. Staff to be used must be trained. The training should go beyond telling them what they will do. They must have sufficient understanding of the study that they can detect serious mistakes and deviations. A pilot study to test methods and procedures should be carried out. However well a study is planned, things could go wrong once field work starts. A pilot study helps detect and correct such pitfalls. A quality control program must be part of the protocol from the beginning. Proxy or surrogate respondents must be used when the subject in handicapped or is not available. The next of kin is usually selected for this. Sometimes the subject and the proxy may disagree. In some case control studies, dead controls are selected for dead cases and proxies are interviewed for both series.



In a face-to-face interview, the interviewer reads out questions to the interviewee and completes the questionnaire. The interview may be structured or unstructured. The interviewer should make sure that circumstances of the interview are optimal in terms of place and time. The interviewers should be selected carefully and adequately trained. They should be given an interviewer’s manual to guide them. It is important that interviewers are continuously monitored. This method of data collection has the following advantages: (a) The interviewer can establish the identity of the respondent. In mailed questionnaire the answers may be from another person other than the intended respondent. (b) There are fewer item non-responses because of the presence of the interviewer who will encourage and may coax the respondent to answer all items. (c) The interviewer can clarify items that the respondent does not understand or is likely to misunderstand. (d) There is flexibility in the sequence of the items. (e) Open-ended questions are possible (e) Items irrelevant to the particular interviewee can be dropped thus saving time. Face-to-face questionnaire administration also has disadvantages: (a) It costs more in terms of time and money. The interviewer has to travel, search for, and spend time with the respondent. (b) a prior appointment is needed to ensure that the respondent will be available at the place and time of the proposed interview. (c) Personal chemistry may not work well. The interviewee may resent the interviewer on the basis of gender, ethnicity, or any other personal and behavioral characteristic. (d) The presence of the interviewer may influence interviewee responses in a subtle way. The interviewee may try to give responses that he thinks are acceptable to the interviewer on the basis of the interviewer's gender, race, SES, and suggestive questioning. Common errors in face to face interview are omiiting a question, too much or too little probing, failure to record information, and cheating by the interviewer.



Questionnaire administration by telephone has the following advantages: (a) Considerable savings in time and money. It is possible to conduct a nation-wide survey sitting in one office. (b) Has fewer item non-response because of the personal contact involved. (c) Skip patterns can be followed to save time  (c) difficult questions can be explained (d) interviewer bias is less than in face to face interview. The disadvantages of questionnaire administration by telephone are: (a) Selection bias may operate when the study sample includes only those who have telephones and the telephone numbers are listed. The problem of unlisted numbers can be overcome by use of random digit dialing. (b) Selection bias may arise due to the day and time of day that the telephone call is placed. Office workers will be missed in early morning calls. Workers on night shifts will be missed in evening calls.  (c) It is not possible to be sure whether the person at the other end of the line is the actual intended respondent. Telephone interview can be improved by use of computers. computer-assisted telephone interview can make the process quicker when the interviewer is prompted by the computer. The computer will work out the skip patterns and will alert the interviewer to responses that are inappropriate or contradictory.

Tekephone interview must also be supervised for optimal results. The supervisor should listen in as the interview is conducted.



In the method of questionnaire administration by mail, a questionnaire is mailed to the respondent's address. The respondent completes and returns the questionnaire in a pre-addressed and stamped envelope. Questionnaire administration by mail has 2 main advantages: (a) it is the cheapest method of data collection (b) There is no bias due to interviewer involvement. The disadvantages are: (a) low overall response (b) Higher item non-response (c) Delays in returning the questionnaire. The following measures are undertaken to increase response to mailed questionnaires: (a) sending the questionnaire with a personalized cover letter (b) promising a token of appreciation for return of the questionnaire. (c) Making the questionnaire anonymous by not including any information on the returned questionnaire that can be used to identify a particular individual. (d) Providing a self-addressed and stamped envelop for the response (e) using pre-coded questionnaires so that all the respondent has to do is to select responses (f) follow up by letter for those who delay in returning the questionnaires.




Data analysis consists of data editing, data summarization, estimation and interpretation. Simple manual inspection of the data is needed before applying the tests above. Indiscriminate application of  the tests to data leads to wrong or misleading conclusions. Acquiring familiarity with the data by simple manual inspection can help identify outliers, assess the normality of data distribution, and identify commonsense relationships among variables that could alert the investigator to errors in computer analysis.



A decision also must be made whether a 2-tail of 1-tail test is being used. The 2-sided test covers the joint testing of two inequalities between proportions, p1>p2 and p2>p1. The 1-sided test covers the testing of only one inequality, p1 > p2 or P2 > p1. The 2-sided test is preferentially used because it is more conservative.


The null hypothesis for a 2-sided test states that there is no association between the exposure and the disease outcome which also implies an odds ratio of unity, OR=1. The alternative hypothesis for a 2-sided test states that there is association between the exposure and the disease outcome; the association may be positive with OR>1.0 or negative with OR <1.0.


The null hypothesis for a 1-sided test states that there is either a negative or no association between the exposure and disease; OR=1.0 or OR<1.0. The alternative hypothesis for a 1-sided test states that there is a positive association between the exposure and disease outcome; OR>1.0.



Two procedures are employed in analytic epidemiology. The test for association is done first. The assessment of the effect measures is done after finding an association. Effect measures are useless in situations in which tests for association are negative. The test for association commonly employed are: t-test, chi-square, the linear correlation coefficient, and the linear regression coefficient. The effect measures commonly employed are: Odds Ratio, Risk Ratio, Rate difference. Measures of trend can discover relationships that are not picked up by association and effect measures



The synonyms for the cross-sectional study are, prevalence study or naturalistic sampling. Their objective is determination of prevalence of risk factor & prevalence of disease at a point in time. The point in time may be calendar time or may be a significant event such as birth or death.  Disease and exposure are ascertained simultaneously. Examples are the study of the relation between birth weight and SES or the relation between IMR and GDP. Usually cross-sectional studies are non-directional because they do not involve a time dimension. Sometimes a cross sectional study can have the character of a time-directional study if some information is collected.



Cross-sectional studies can be descriptive (information on one variable) or analytic (information on 2 variables) or both. Usually studies are both. The study may be done once or may be repeated. Cross sectional studies can also be classified as individual-based or as group based. In individual-based studies, information is collected about individuals. In group-based studies aggregate information is collected about groups of individuals. Group-based studies are alternatively called ecologic studies. An example is the study of the relation between GDP and IMR in which information is based on the country as a unit and not an individual.



Cross-sectional studies are used in community diagnosis, preliminary study of disease, assessment of health status, surveillance, and program evaluation. Cross-sectional studies can be used to identify community syndromes such as the combination of HT, CHD, and DM found in rich communities. They can also be used to identify groups with special needs. Cross sectional studies are employed in preliminary study of disease determinants as risk factors,  risk indicators, or risk markers. Their use as etiologic studies, however, is limited by lack of a temporal sequence between cause and disease outcome. Cross sectional studies can be used to study incidence and etiology for some diseases that have a clear onset that the subject can recall accurately. If this is not possible incidence can be determined by a repeat cross-sectional study provided care is taken to minimize left censoring (death or movement of diseased people in the interval of interis minimized. Epidemiological surveillance can be carried out by repeated cross-sectional studies to see changes in certain disease states. Cross sectional studies can be used in the evaluation of community health care as well as the evaluation of health intervention programs.



A cross-sectional study may include the whole population or may use a sample. The study may be based on sampling with individuals as sampling units or may use special groups, households, or neighborhoods. Ecological studies like the relationship between GDP and IMR use countries as sampling units. The design of a cross-sectional study is shown in table Figure #. The study sample is divided into 4 groups: exposed case, unexposed cases, exposed noncases and unexposed non cases. The total sample size,n = a+b+c+d; n is the only quantity fixed before data collection. None of the marginal totals is fixed.



One of  the following sampling methods can be used: simple random sampling, cluster sampling, systematic sampling, and multi-stage sampling.



Cases to be included in the study can be found cumulatively by using clinical examinations, interviews, or clinical records.



Clinical examination, questionnaires, personal interview, review of clinical records are some of the methods used to collect data.



The study design can be represented as follows


















The following descriptive statistics can be computed from a cross-sectional study: mean, standard deviation, median, percentile, quartiles, ratios and proportions. The 2 most important proportions are the prevalences of the risk factor and the prevalence of the disease. The prevalence of the risk factor is computed as n1/n. The prevalence of the disease is computed as m1/n. Prevalence can be defined with the total population as denominator or with the total number of individuals studied as the denominator. The following analytic statistics can be computed from cross-sectional studies: correlation coefficient, regression coefficient, odds ratio, and rate difference. The prevalence difference is computed as p1 – p0 = a/n1 - b/n0. The prevalence ratio is computed as p1/p0 = (a/n1) / (b/n0). The prevalence odds ratio is computed as POR = {p1(1 - p1)} / { p0(1 - p0)}.




A community intervention study is designed to test whether a certain public health intervention such as health education or water fluoridation has an effect on a given outcome measure. Two or more similar communities are randomly allocated to receive different interventions and the outcome is then measured. The intervention is carried out at a community-wide level. Random allocation ensures comparability. The population in which the intervention is undertaken is called the intervention population. The reference population serves as controls. In a community-based intervention study, the random allocation is based on the community and not the individual. This is considered a quasi-experimental design with less statistical power than allocation based on individuals. For best results it is best to restrict the study to certain age groups. Sometimes it is not feasible to assess outcome on all members of the intervention and reference populations. A sample survey of both populations before and after intervention may have to be done in such cases.



A trial in a community may involve allocating individuals to the intervention group and others to the reference group. Thus people in the same community or even family may belong to different groups with regard to the intervention being studied. Testing the efficacy of a new vaccine involves recruitment of healthy volunteers into the experimental study. They are randomized to 2 groups. One group receives the vaccine while the other does not. The two groups are the observed over a suitable time for development of the disease being prevented. Vaccine effectiveness is computed as the incidence rate of disease in the vaccinated / incidence rate of disease in the unvaccinated.



 The strength of the community intervention study can evaluate a public health intervention is natural field circumstances. It however suffers from 2 main weaknesses. Selection bias is likely to occur when allocation is by community. People in the control community may receive the intervention under study on their own because tight control as occurs in laboratory experimental or animal studies is not possible with humans.




When determining sample size we have to consider the Incidence of the outcome. If it is low, the study will have to use a large sample and will have to be of a longer duration which leads to a higher attrition.



For best results the intervention must be something new not common in the sudy area. Decisions must be made at the start of the study about the intensity and frequency of the intervention.  The intensity of the intervention must be considered when evaluating the results of the study. The controls may on their own take the intervention being studied without knowledge of the investigator.



The period of follow up must be selected carefully. If it is too short, not enough outcome data may be collected. If it is too long, a high rate of attrition will result. If more than one end-point is used, different follow-up periods should be allocated for each end-point. Drop-outs must be accounted for fully. Pre and post surveys in the experimental and comparison areas must be strictly comparable. Data collection can be by self-coding questionnaires or by measurements carried out by nurses. Procedures must be identical for both areas.



The end-point must be determined at the beginning. Quantitative criteria are best for end-point assessment. Interviews using questionnaires, physical and biochemical parameters, morbidity, and mortality may be used as end-points. Use of morbidity and mortality as end-points is not the best option because of existence of many competing causes of mortality and morbidity. Examination for and recording of the end-point must be blind. The assessment of the end-point may be based on longitudinal change or by repeated cross-sectional surveys.



Interpretation of the results may be complicated by secular trends. It is therefore recommended that the study duration be as short as is reasonable. Negative findings could be due to a number of reasons. The intervention effort may not be intense enough. The putative risk factor intervened against may not be causal. Intervention may be at the wrong age or the wrong season. Very short interventions may not register an impact. An inadequate sample size may also be a cause of negative findings. End-point assessment may be biased because diagnosis may be higher in the intervention area because of heightened interest and lead to looking harder for the end-point.

Omar Hasan Kasule, Sr. September 2001