Home

ISLAMIC MEDICAL EDUCATION RESOURCES-03

0109-REVIEW OF BIOSTATISTICS

By Professor Omar Hasan Kasule Sr.

DATA MANAGEMENT

A. DEFINITION OF TERMINOLOGY

A field/attribute/variable/variate is the characteristic measured for each member e.g name & weight. A value/element is the actual measurement or count like 5 cm, 10kg. A record/observation  is a collection of  all variables belonging to one individual. A file is a collection of records. A data-base is a collection of files. A data dictionary is an explanation or index of the data.

 

B. DATA CODING

Self-coding or pre-coded questionnaires are preferable to those requiring coding after data collection. Errors and inconsistencies could be introduced into the data during manual coding. A good pre-coded questionnaire can be produced after piloting the study.

 

C. DATA ENTRY

Both random and non-random errors could occur in data entry. The following methods can be used to detect such errors: (a) double entry techniques in which 2 data entry clerks enter the same data and a check is made by computer on items on which they differ. (b) The data entered in the computer could be checked manually against the original questionnaire. (c) Interactive data entry involves a computer data entry program could be programmed to detect entries with unacceptable values or that are logically inconsistent

 

D. DATA PROCESSING

Data editing is the process of correcting data collection and data entry errors. The data is 'cleaned' using logical and statistical checks. Range checks are used to detect entries whose values are outside what is expected; for example child height of 5 meters is clearly wrong. Consistency checks enable identifying errors such as recording presence of an enlarged prostate in a female. Among the functions of data editing is to make sure that all values are at the same level of precision (number of decimal places). This makes computations consistent and decreases rounding off errors.

 

Validation checks can be carried out using three methods: (a) logical checks (b) statistical checks involving actual plotting of data on a graph and visually detecting outlying values

 

Data transformation is the process of creating new derived variables preliminary to analysis. The transformations may be simple using ordinary arithmetical operators or more complex using mathematical transformations. New variables may be generated by using the following arithmetical operations: (a) carrying out mathematical operations on the old variables such as division or multiplication (b) combining 2 or more variables to generate a new one by addition, subtraction, multiplication or division.  New variables can also be generated by using mathematical transformations of variables: logarithmic, trigonometric, power, and z-transformations.

 

E. DATA PROBLEMS

Missing data can arise from data collection when no response was recorded at all or from data entry when the value was dropped accidentally. Missing data due to data entry errors is easy to correct. It is more difficult to go back and collect the missing data from the respondents and analysis may have to proceed with some data missing. It is better to have a code for missing data than to leave the field blank.

 

Coding and entry errors are common. The person entering data may make random mistakes in typing letters or digits. Systematic mistakes arise when there is an error in coding. Random mistakes are more difficult to detect. Systematic mistakes are easier to detect and correct.

 

Inconsistencies in data must be checked for. The values of some variables may not be consistent with values of other variables due to errors in one or several variables. For example there is inconsistency if a 4-year old child is reported to have 2 children.

 

Irregular patterns in data may indicate errors. The decision on whether irregular patterns exist is based on previous knowledge or familiarity with the type of information being studied.

 

Digit preference: Digits of values of variables are expected to be random. If data shows a predominance of one digit especially the last digit, strong suspicion of systematic error arises. The person measuring may have a tendency to estimate the last digit to a certain fixed value which creates the digit preference.

 

Outliers are values of the variable that lie outside the normal or expected range. Usually outliers are wrong values due to wrong measurements or wrong data recording. It should be noted however that not all outliers are wrong. There are true phenomena in nature that lie outside the range of the ordinary.

 

Rounding-off / significant figures errors: All measurements and recording of measured values involve rounding off to the nearest significant figure or the nearest decimal point. It is recommended to record values during measurement without any rounding off and round off only after making computations. This is because rounding off at intermediary stages introduces small rounding off errors that cumulate and introduce serious error in the final computed result.

 

Questions with multiple valid responses: Some variables may have more than one valid response a situation that introduces errors in data analysis because the analyst may not know which response is meaningful. This problem arises from poor questionnaire design and can be avoided by piloting the questionnaire.

 

Record duplication arises when data on one person is entered as two or more than two separate records. Analysis of data with duplicate records leads to misleading results since one member of the sample contributes more than once. Record duplication can be identified easily by sorting records by an identification number such that repeated records are found lying next to one another.


DATA SUMMARY and PRESENTATION AS DIAGRAMS: TABLES and GRAPHICS

A. DATA GROUPING

The objective of grouping is to summarize data for presentation (parsimony) while preserving a complete picture. The suitable number of classes is 10-20.. The following are desirable characteristics of classes: mutually exclusive, intervals equal in width, and intervals continuous throughout the distribution. Grouping error is defined as information loss due to grouping. Grouped data gives less detail than ungrouped data. The bigger the class interval, the bigger the grouping error..

 

B. DATA TABULATION

Tabulation has the objective of presenting and summarizing a lot of data in logical groupings and for 2 or more variables for visual inspection. A table can show the following summaries about data: cell frequency or cell number, cell number as a percentage of the overall total, cell number as a row percentage, cell number as a column percentage, cumulative frequency, cumulative frequency%, relative (proportional) frequency, and relative frequency %. Ideal tables are simple, easy to read, and correctly scaled. The layout of the table should make it easy to read and understand the numerical information. The table must be able to stand on its own ie understandable without reference to the text. The table must have a title/heading that should indicate its contents. Labeling must be complete and accurate: title, rows & columns, marginal & grand totals, are units of measurement. The field labels are in the margins of the table while the numerical data is in the cells that are in the body of the table. Footnotes may be used to explain the table. A contingency table can be presented in several configurations. The commonest is the 2 x 2 contingency table. Other configurations are the 2 X k table and the r x c table.

 

C. DIAGRAMS

ONE-WAY BAR DIAGRAMS

A bar diagram uses ‘bars’ to indicate frequency. The bars may be vertical in column charts or horizontal in row charts. Both forms of bar diagrams may be constructed in three dimensions. There are 2 types of bar diagram: the bar chart and the histogram. In the bar chart there are spaces between the bars. In the histogram the bars are lying side by side. The bar chart is best for discrete, nominal or ordinal data. The histogram is best for continuous data. Bar charts and histograms are generated from frequency tables discussed above. The area of the bar represents frequency ie the area of the bar is proportional to the frequency. If the class intervals are equal, the height of the bar represents frequency. If the class intervals are unequal the height of the bar does not represent frequency. Frequency can only be computed from the area of the bar.

 

PIE CHART

The pie chart / pie diagram diagrams show relative frequency % converted into angles of circle (called sector angle). The area of each sector is proportional to the frequency.

MAPS

Different variables or values of one variable can be indicated by use of different shading, cross-hatching, dotting, and colors.

 

LINE GRAPHS/FREQUENCY POLYGON

A frequency polygon is the plot of the frequency against the mid-point of the class interval. The points are joined by straight lines to produce a frequency polygon. If smoothed, a frequency curve is produced. The line graph can be used to show the following: frequency polygon, cumulative frequency curve, cumulative frequency % (also called the ogive), and moving averages. The line graph has two axes: the abscissa or x-axis is horizontal. The ordinate or y-axis is vertical. It is possible to plot several variables on the same graph. The graph can be smoothened manually or by computer. Time series and changes over time are shown easily by the line graph. Trends, cyclic and non-cyclic are easy to represent. Line graphs are sometimes superior to other methods of indicating trend such as moving averages and linear regression. The frequency polygon has the following advantages: (a) It shows trends better (b) It is best for continuous data (c) A comparative frequency polygon can show more than 1 distribution on the same page. The cumulative frequency curve has the additional advantage of being used to show and compare different sets of data because their respective medians, quartiles, percentiles, can be read off directly from the curve.

 

SCATTER DIAGRAM / SCATTER-GRAM

The scatter diagram is also called the x-y scatter.

 

PICTOGRAM

Pictures of the variable being measured as used instead of bars. The magnitude can be shown either by the size of the picture or the number of pictures.

 

C. SHAPES OF DISTRIBUTIONS

Unimodal has one peak and is most common in biology. Bi-modal has 2 peaks that are not necessarily of the same height. A perfectly symmetrical curve is bell-shaped and is centered on the mean. Skew to right (+ve skew) is more common than skew to the left.

 

D. MISLEADING DIAGRAMS

Poor labeling of scales is deceiving. Distortion of scales: making fluctuating appear stable or smoothing over a dip. Making the scale wider or narrower produces different graphical impressions. The wider scale gives the impression of less change and is less likely to suggest a relationship. Using a logarithmic scale may give a different impression from using a linear scale. There are some types of data that are best represented on the logarithmic scale. A double vertical scale can be used to show spurious association. Narrow and Wide Scales  give different impressions of the same data. Omitting zero/origin makes interpretation difficult. If plotting from zero is not possible for reasons of space, a broken line should be used to show discontinuity of the scale.


 

DISCRETE DATA SUMMARY: RATES, RATIOS, and PROPORTIONS

A.  RATES

Rates are events in a given population over a defined time period. A rate has 4 components: numerator, denominator, and time. The numerator of a rate is included in its denominator. The general formula of a rate is the total number of disease or characteristic in a given time period divided by the total number of persons at risk (those with disease + those without disease). In symbols this is written as a / (a+b)t where a= number of new cases, b= number without disease, and t= time of observation. Incidence is a type of rate. It describes a moving and dynamic picture of disease eg IMR, CBR, and IR. Crude rates are unweighted and are misleading. Comparison of crude rates in 2 populations is not possible. No valid inference based on crude rates is possible because of confounding. The following types of specific rates are commonly used: age-specific, sex-specific, place-specific, race-specific, and cause-specific rates. Adjusted /standardized rates: Adjustment for age, sex, or any other factor to remove confounding and allow comparison across populations. 

 

Standardization is a statistical technic that involves adjustment of a rate or a proportion for 1 or 2 confounding factors. There are approaches to standardization: direct standardization, indirect standardization, life expectancies that are age adjusted summaries of current mortality rates, and regression techniques. Both direct and indirect standardization involve the same principles but use different weights.

 

B. RATIOS

The general formula for a ratio is number of cases of a disease divided by the number without disease, a/b. Examples of ratios are: the proportional mortality ratio, the maternal mortality ratio, and the fetal death ratio. The proportional mortality ratio is the number of deaths in a year due to a specific disease divided by the total number of deaths in that year. This ratio is useful in occupational studies because it provides information on the relative importance of a specific cause of death. The maternal mortality ratio is the total number of maternal deaths divided by the total live births. The fetal death ratio.

 

C. PROPORTIONS

Proportions are used for enumeration. A proportion is the number of events expressed as a fraction of the total population at risk. It has only 2 components: the numerator and the denominator. The numerator is included in denominator. The time period is not defined but is somehow assumed. The general formula for a proportion is a/(a+b). Examples of proportions are: prevalence proportion, neonatal mortality proportion, and the perinatal mortality proportion. Prevalence describes a still/stationary picture of disease. Like rates, proportions can be crude, specific, and standard.


CONTINUOUS DATA SUMMARY 1: MEASURES OF CENTRAL TENDENCY

A. COMMON MEASURES

Four averages are commonly used: the mean, the mode, and the median. The arithmetic mean is considered the most useful measure of central tendency in data analysis. The median is gaining popularity. It is the basis of some non-parametric tests as will be discussed later. The mode has very little public health importance

 

B. ARITHMETIC MEAN

The arithmetic mean is defined as the sum of the observations' values divided by the total number of observations.  The arithmetic mean reflects the impact of all observations.

 

C. MODE

The mode for a set of values is defined as the commonest, most frequent, or most popular observation. The mode is rarely mentioned in scientific literature. The mode is used when the interest is in identifying the most prominent observation. The mode is easy to compute or determine.

 

D. MEDIAN

The median is defined as the middle observation in a ranked series such that 50% are above and 50% are below. The median is intuitively easy to understand as the value of the observation that is in the middle of the series. The median is best used for the following types of data: (a) erratically spaced data (b) data that has extreme values since the median is less affected by extreme values. In this aspect the median is superior to the arithmetic mean. The median for a set of values can be determined even if the value of the extreme observation is nor known exactly. This is another aspect in which the median is superior to the arithmetic mean. The arithmetic mean can not be computed for open-ended distributions. The median is used only for description and not analysis. Generally no further mathematical manipulation are possible using the median. However statistical methods based on the median have been developed recently and are gaining popularity.


CONTINUOUS DATA SUMMARY 2: MEASURES OF DISPERSION/VARIATION

A.  MEASURES OF VARIATION

Measures of variation can be classified as absolute or relative. The absolute measures include: range, inter-quartile range, mean deviation, variance, standard deviation, quantiles. The relative measures include: coefficient of variation (CV), standardized score (z-score). They can also be classified depending on how they are derived: based on the mean, based on quantiles, and others.

 

B. VARIANCE

The variance is defined as the sum of the squared deviations of each observation from the mean divided by the sample size, n. The divisor, n, is corrected as n-1 for small samples to correct for sampling error. In this formula n-1 represents the degrees of freedom.

 

C. STANDARD DEVIATION

The standard deviation is the square root of the variance described above. The standard deviation is the most frequently used measure of variation. A distinction must be made between the standard deviation (based on a population) and the standard error of the mean (based on a sample).

 

D. QUARTILES

If a set of observations is arranged in order of magnitude and is divided into 4 equal intervals, each of those parts is called a quartile.  The first and third quartiles are usually referred to in statistical work. The inter-quartile range is defined as the difference between the first quartile, Q1, and third quartile, Q3. 

 

E. PERCENTILES

Percentiles are also called centile scores. Centiles are a form of cumulative frequency and can be read off a cumulative frequency curve. If a set of observations is arrayed in order of magnitude and is then divided into 100 equal intervsla, each interval is called a percentile. The centile point is the score. Each centile point is given a rank eg 10th, 25th etc. In pharmacological experiments LD50 is the 50th percentile. Percentile scores have the advantage of being direct and very intelligible.

 

F. THE RANGE

The range is defined in two ways: (a) by stating the minimum and maximum values (b) by stating the difference between the maximum and the minimum values. For a complete definition both (a) and (b) should be used. The range is a simple measure with intuitive appeal. It is easy to compute. It is useful for preliminary or rough work. The range depends entirely on the extreme values. The range is affected by extreme values since it is based on only 2 observations. A big sample is likely to have a distorted range because. The big sample has a wider range because extreme values are more likely. The range is more sensitive to sampling fluctuations than the standard deviation. The range suffers from the additional disadvantage of no further mathematical manipulations.


PARAMETRIC DISCRETE DATA ANALYSIS USING PROPORTIONS

 

A. PRELIMINARY CONSIDERATIONS

NORMAL DISTRIBUTION OF THE DATA

The first step is to ascertain whether the data distribution follows an approximate Gaussian distribution. The approximate methods are most valid when the data is Gaussian.

 

EQUAL VARIANCES

It is possible to compute variances for proportions using the binomial theorem. The variances of proportions in the compared samples must be approximately equal for the statistical tests to be valid.

 

ADEQUACY OF THE SAMPLE SIZE

For approximate methods to be valid, the sample size must be adequate. There are special statistical procedures for ascertaining sample size.

 

B. DATA LAY-OUT

The data for approximate methods is laid out in the form of contingency tables: 2 x 2, 2 x k, m x n.. Visual inspection is recommended before application of statistical tests.

 

C. STATING THE HYPOTHESES

The null hypothesis and the alternative hypotheses must be stated clearly. The following formulations are acceptable. In Inference on 2 sample proportions using z or chi-square test, H0: sample proportion #1 - sample proportion #2 = 0. HA: sample proportion #1 > or < sample proportion #2.  In Inference on 3 or more sample means using the chi-square test: H0: sample proportion #1 = sample proportion  #2 = sample proportion #4 = sample proportion #…n

 

D. FIXING THE TESTING PARAMETERS

Testing parameters are fixed. For the p-value approach, the 5% or 0.05 level of significance is customarily used. There is nothing preventing using any other level like 2.5% or 10%.

 

TESTING FOR TWO BINOMIAL PROPORTIONS IN 2 x 2 TABLE USING c2

The chi-square is computed from the data using appropriate formulas. The chi-square for paired data is called the MacNemar chi-square. The formula of the Pearson chisquare for independent samples is {(Observed – expected)2 / E} ~ c1. The chi-square statistic computed as explained above is referred to the appropriate table to look up the p-value under the appropriate degrees of freedom. The general formula for degrees of freedom is: (rows - 1) x (columns - 1). The following decision rules are then used: If the p-value is larger than the level of significance of 0.05, the null hypothesis is not rejected. If the p-value is smaller than the level of significance of 0.05, the null hypothesis is rejected. There is a simple computational formula for the chisquare in 2 x 2 tables.


PARAMETRIC CONTINUOUS DATA ANALYSIS USING MEANS

 

A. THE T TEST STATISTIC

The student t-test is the most commonly used test statistic for inference on continuous numerical data. The t-test must fulfil the following conditions of validity:  (a) The samples compared must be normally distributed (b) the variances of samples compared must be approximately equal. The t-test is used uniformly for sample sizes below 60. It is also used for sample sizes above this if the population standard deviation is not known.

 

B.  THE F-STATISTIC

The F-test is a generalized test used in inference on 3 or more sample means. The procedures of the F-statistic are also generally called analysis of variance, ANOVA. ANOVA studies how the mean varies by group.

 

C. ASCERTAINING THE NORMAL DISTRIBUTION OF THE DATA

The first step is to ascertain whether the data distribution follows an approximate Gaussian distribution.

 

D. ASCERTAINING THE EQUALITY OF VARIANCES

The tests above require that the samples being compared have approximately equal variances. The tests do not perform optimally if the magnitudes of the variances vary wildly. Usually informal methods of testing equality of variances are carried out. It is not absolutely necessary that the variances be exactly equal. What is required is that they have the same order of magnitude or be approximately equal.

 

E. STATING THE TEST HYPOTHESES

The null hypothesis and the alternative hypotheses must be stated clearly. The following formulations are acceptable. Inference on 1 sample mean using z or t-tests: H0: sample mean - population mean = 0. HA: sample mean - population mean > or < 0. Inference on 2 sample means using z or t-test: H0: sample mean # - sample mean #2 = 0. HA: sample mean #1 - sample mean #2 > or < 0. Inference on 4 or more sample means using the F-test: H0: sample mean #1 = sample mean #2 = sample mean #4 = sample mean#…n

 

F. STATING THE TEST PARAMETERS

For the confidence interval approach the 95% bounds are used customarily. There is nothing to prevent 90% or 99% intervals from being used. For the p-value approach, the 5% or 0.05 level of significance is customarily used. There is nothing preventing using any other level like 2.5% or 10%.


 

NON-PARAMETRIC ANALYSIS OF CONTINUOUS DATA USING MEDIANS

A. DEFINITION AND NATURE

Non-parametric methods are used for data that is not normally distributed.

 

B. ADVANTAGES

These methods are simplicity itself. They are easy to understand and employ. They do not need complicated mathematical operations thus leading to rapid computation. They have few assumptions about the distribution of data. They can be used for non-Gaussian data. They can also be used for data whose distribution is not known because there is no need for normality assumptions.

 

C. DISADVANTAGES:

Non-parametric methods can be used efficiently for small data sets. With data sets that have many observations, the methods cannot be applied with ease. These methods are also not easy to use with complicated experimental designs. Non-parametric are less efficient than parametric methods for normally-distributed data. Hypothesis testing with non-parametric methods is less specific than hypothesis testing with parametric methods.

 

D. CHOICE BETWEEN PARAMETRIC AND NON-PARAMETRIC

Non-parametic methods should never be used where parametric methods are possible. Non-parametric should therefore be used only if the test for normality is negative. Non-parametric methods are also used in situations in which the distribution of the parent population is not known.

 

E. CORRESPONDENCE OF PARAMETRIC & NON PARAMETRIC

Situation

Parametric test

Non-parametric test

1 sample

z-test, t-test

Sign test

2 independent sample means

t-test

Rank Sum test

2 paired sample means

t-test

Signed Rank Test

3 or more independent sample means

ANOVA (1-way)

Kruskall Wallis

Multiple comparisons of means

ANOVA (2-way)

Friedman

Correlation

Pearson

Spearman

Comparing survival curves

Proportional hazards regression

Log rank test

 

Virtually each parametric test has an equivalent non-parametric one as shown in the table above. Note that the Mann-Whitney test gives results equivalent to those of the signed rank test. The Kendall gives results equivalent to those of the Spearman coefficient. The signed rank and rank sum tests are based on the median.

 


CORRELATION

A.  CORRELATION AS PRELIMINARY DATA ANALYSIS

Data analysis usually begins with preliminary exploration using linear correlation. A correlation matrix is used to explore for pairs of variables likely to be associated. Then more sophisticated methods are applied to define the relationships further.

 

B. EXPLORATION OF BIVARIATE RELATIONSHIP

Correlation describes the relation between 2 variables about the same person or object with no prior evidence of inter-dependence. Both variables are random. Correlation indicates only association. The association is not necessarily causative. Correlation measures the strength of bivariate relationship. It measures linear relation and not variability.

 

C. OBJECTIVES OF CORRELATION ANALYSIS

Correlation analysis has the following objectives: (a) describe the relation between x and y (b) predict y if x is known and vice versa (c) study trends (d) study the effect of a third factor like age on the relation between x and y.

 

D. THE SCATTERGRAM

The first step in correlation analysis is to inspect a plot of the data. This will give a visual impression of the data layout and identify out-liers.

 

E. CORRELATION COEFFICIENT

Pearson’s coefficient of correlation (product moments correlation), r, is the commonest statistic for linear correlation. It has a complicated formula but can be computed easily by modern computers. It essentially is a measure of the scatter of the data. The correlation coefficient cannot be interpreted correctly without looking at the scatter-gram. The correlation coefficient is not interpretable for small samples. The size of r may not matter. A small r may be significant while a big one may not be. The significance of r also depends on what is being measured. For some variables small values of r may be significant. In general Colton recommends the following interpretation of r. Values 0.25 - 0.50 indicate a fair degree of association. Values of 0.50 - 0.75 indicate moderate to fair relation. Values above 0.75 indicate good to excellent relation. Values of r = 0 indicate either no correlation or that the two variables are related in a non-linear way. High correlation coefficient usually arises when the dots hug the regression line closely (p. 165 Minium). Very high correlation coefficients are suspect and need to be checked carefully. In perfect positive correlation, r=1. In perfect negative correlation, r=-1. In cases of no correlation, r=0. In cases of no correlation with r=0, the scatter-plot is circular. The value of the coefficient does not change when the units in which x and y are measured. The change in unit must however be for both variables simultaneously.

 

 


REGRESSION

A. INDEPENDENT and DEPENDENT VARIABLES

Both correlation and regression address the relation between 2 variables. The scatter-gram is basic to both. In correlation both x and y are random. In regression x is independent (ie random) whereas y is dependent being determined by x. The outcome variable in regression is measured as means. The independent variable can be continuous or categorical. The dependent variable can be continuous or binary.

 

B. REGRESSION EQUATION

The mathematical model of simple linear regression is shown in the regression equation/regression function/regression line: y=a + bx where y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient, and x is the dependent/predictor variable. Both a and b are in a strict sense regression coefficients but the term is usually reserved for b only.

 

C. ASSUMPTIONS

The following assumptions are made for the validity of the regression model: (a) Linearity of the relation between x and y (b) Normal distribution of the y variable for any given value of x (c) Constant variance or homoscedacity ie the variance of y is the same for all xs (d) the deviations of y variables from the mean are independent for each value of x. The assumption of normality states that for any value of x, the distribution of y is normal. The assumption of homoscedacity states that the variances of y at various levels of x are approximately equal. Stated in another way, the variance of y is approximately constant for various values of x. This variance consists of both measurement error and biological variation. The regression model is not valid in a situation of heteroscedacity (variable variance). Heteroscedacity can be detected by plotting residuals against the fitted values of the regression line. The Bartlett and Levene Median Tests are sophisticated methods of testing for homoscedacity. Heterescedacity can be removed by rescaling y as yl where l = 1, , 0, -1

 

MULTIPLE LINEAR REGRESSION

Multivariate analysis determines the relative contribution of different causes to a single event. It also enables assessment of one variable while holding the rest of the variables constant. The regression equation: y=a+b1x1 + b2x2 + b4x4 where y is the dependent/response variable, a is the intercept, b1,b2,b4 are the slope/regression coefficients. Three procedures are used for fitting the multiple regression line: step-up, step-down, and step-wise.

 

LOGISTIC REGRESSION

Logistic regression is very useful in epidemiological analysis because of a dichotomized outcome variable.

Omar Hasan Kasule, Sr September 2001