**A. DEFINITION OF TERMINOLOGY**

A **field/attribute/variable/variate** is the characteristic measured for
each member e.g name & weight. A **value/element** is the actual measurement or
count like 5 cm, 10kg. A **record/observation**
is a collection of all variables belonging to one individual. A **file** is a collection of records. A **data-base** is a collection of files.
A **data dictionary** is an explanation or index of the data.

** **

**B. DATA CODING**

**Self-coding** or **pre-coded** questionnaires are preferable to those requiring coding after data collection.
Errors and inconsistencies could be introduced into the data during manual coding. A good pre-coded questionnaire can be produced
after piloting the study.

C. DATA ENTRY

Both random and non-random errors could occur in data entry. The following methods can be used to detect such errors:
(a) **double entry** techniques in which 2 data entry clerks enter the same data and a check is made by computer on items
on which they differ. (b) The data entered in the computer could be **checked manually** against the original questionnaire.
(c) **Interactive data entry** involves a computer data entry program could be programmed to detect entries with unacceptable
values or that are logically inconsistent

** **

**D. DATA PROCESSING **

**Data editing **is the process of correcting data collection and data
entry errors. The data is 'cleaned' using logical and statistical checks. Range checks are used to detect entries whose values
are outside what is expected; for example child height of 5 meters is clearly wrong. Consistency checks enable identifying
errors such as recording presence of an enlarged prostate in a female. Among the functions of data editing is to make sure
that all values are at the same level of precision (number of decimal places). This makes computations consistent and decreases
rounding off errors.

**Validation checks** can be carried out using three methods: (a) logical checks (b) statistical checks involving
actual plotting of data on a graph and visually detecting outlying values

**Data transformation **is the process of creating new derived variables
preliminary to analysis. The transformations may be simple using ordinary arithmetical operators or more complex using mathematical
transformations. New variables may be generated by using the following arithmetical operations: (a) carrying out mathematical
operations on the old variables such as division or multiplication (b) combining 2 or more variables to generate a new one
by addition, subtraction, multiplication or division. New variables can also
be generated by using mathematical transformations of variables: logarithmic, trigonometric, power, and z-transformations.

** **

**E. DATA PROBLEMS**

**Missing data** can arise from data collection when no response was recorded at all or from data entry when
the value was dropped accidentally. Missing data due to data entry errors is easy to correct. It is more difficult to go back
and collect the missing data from the respondents and analysis may have to proceed with some data missing. It is better to
have a code for missing data than to leave the field blank.

C**oding and entry errors **are
common. The person entering data may make random mistakes in typing letters or digits. Systematic mistakes arise when
there is an error in coding. Random mistakes are more difficult to detect. Systematic mistakes are easier to detect and correct.

I**nconsistencies** in data must be checked for. The values of some variables may not be consistent with values
of other variables due to errors in one or several variables. For example there is inconsistency if a 4-year old child is
reported to have 2 children.

**Irregular patterns** in data
may indicate errors. The decision on whether irregular patterns exist is based on previous knowledge or familiarity with the
type of information being studied.

D**igit preference: **Digits of values of variables are expected to be
random. If data shows a predominance of one digit especially the last digit, strong suspicion of systematic error arises.
The person measuring may have a tendency to estimate the last digit to a certain fixed value which creates the digit preference.

**Outliers** are values of the
variable that lie outside the normal or expected range. Usually outliers are wrong values due to wrong measurements or wrong
data recording. It should be noted however that not all outliers are wrong. There are true phenomena in nature that lie outside
the range of the ordinary.

R**ounding-off / significant
figures errors**: All measurements and recording of measured values involve rounding
off to the nearest significant figure or the nearest decimal point. It is recommended to record values during measurement
without any rounding off and round off only after making computations. This is because rounding off at intermediary stages
introduces small rounding off errors that cumulate and introduce serious error in the final computed result.

Q**uestions with multiple
valid responses: **Some variables may have more than one valid response a situation
that introduces errors in data analysis because the analyst may not know which response is meaningful. This problem arises
from poor questionnaire design and can be avoided by piloting the questionnaire.

**Record duplication** arises when data on one person is entered as two or more than two separate records. Analysis
of data with duplicate records leads to misleading results since one member of the sample contributes more than once. Record
duplication can be identified easily by sorting records by an identification number such that repeated records are found lying
next to one another.

**DATA SUMMARY and PRESENTATION AS DIAGRAMS: TABLES and GRAPHICS**

A. DATA GROUPING

The objective of grouping is to summarize data for presentation (parsimony) while preserving a complete picture.
The suitable number of classes is 10-20.. The following are desirable characteristics of classes: mutually exclusive, intervals
equal in width, and intervals continuous throughout the distribution. Grouping error is defined as information loss due to
grouping. Grouped data gives less detail than ungrouped data. The bigger the class interval, the bigger the grouping error..

B. DATA TABULATION

Tabulation has the objective of presenting and summarizing a lot of data in logical groupings and for 2 or more
variables for visual inspection. A table can show the following summaries about data: cell frequency or cell number, cell
number as a percentage of the overall total, cell number as a row percentage, cell number as a column percentage, cumulative
frequency, cumulative frequency%, relative (proportional) frequency, and relative frequency %. Ideal tables are simple, easy
to read, and correctly scaled. The layout of the table should make it easy to read and understand the numerical information.
The table must be able to stand on its own ie understandable without reference to the text. The table must have a title/heading
that should indicate its contents. Labeling must be complete and accurate: title, rows & columns, marginal & grand
totals, are units of measurement. The field labels are in the margins of the table while the numerical data is in the cells
that are in the body of the table. Footnotes may be used to explain the table. A contingency table can be presented in several
configurations. The commonest is the 2 x 2 contingency table. Other configurations are the 2 X k table and the r x c table.

C. DIAGRAMS

ONE-WAY BAR DIAGRAMS

A bar diagram uses ‘bars’ to indicate frequency. The bars
may be vertical in column charts or horizontal in row charts. Both forms of bar diagrams may be constructed in three dimensions.
There are 2 types of bar diagram: the bar chart and the histogram. In the bar chart there are spaces between the bars.
In the histogram the bars are lying side by side. The bar chart is best for discrete, nominal or ordinal data. The histogram
is best for continuous data. Bar charts and histograms are generated from frequency tables discussed above. The area of the
bar represents frequency ie the area of the bar is proportional to the frequency. If the class intervals are equal, the height
of the bar represents frequency. If the class intervals are unequal the height of the bar does not represent frequency. Frequency
can only be computed from the area of the bar.

PIE CHART

The pie chart / pie diagram diagrams show relative frequency %
converted into angles of circle (called sector angle). The area of each sector is proportional to the frequency.

MAPS

Different variables or values of one variable can be indicated by use of different shading, cross-hatching, dotting,
and colors.

LINE GRAPHS/FREQUENCY POLYGON

A frequency polygon is the plot of the frequency against the mid-point of the class interval. The points are joined
by straight lines to produce a frequency polygon. If smoothed, a frequency curve is produced. The line graph can be used to
show the following: frequency polygon, cumulative frequency curve, cumulative frequency % (also called the ogive), and moving
averages. The line graph has two axes: the abscissa or x-axis is horizontal. The ordinate or y-axis is vertical. It is possible
to plot several variables on the same graph. The graph can be smoothened manually or by computer. Time series and changes
over time are shown easily by the line graph. Trends, cyclic and non-cyclic are easy to represent. Line graphs are sometimes
superior to other methods of indicating trend such as moving averages and linear regression. The frequency polygon has the
following advantages: (a) It shows trends better (b) It is best for continuous data (c) A comparative frequency polygon can
show more than 1 distribution on the same page. The cumulative frequency curve has the additional advantage of being used
to show and compare different sets of data because their respective medians, quartiles, percentiles, can be read off directly
from the curve.

SCATTER DIAGRAM / SCATTER-GRAM

The scatter diagram is also called the x-y scatter.

PICTOGRAM

Pictures of the variable being measured as used instead of bars. The magnitude can be shown either by the size
of the picture or the number of pictures.

C. SHAPES OF DISTRIBUTIONS

Unimodal has one peak and is most common in biology. Bi-modal has 2 peaks that are not necessarily of the same
height. A perfectly symmetrical curve is bell-shaped and is centered on the mean. Skew to right (+ve skew) is more common
than skew to the left.

** **

**D. MISLEADING DIAGRAMS**

Poor labeling of scales is deceiving. Distortion of scales: making fluctuating appear stable or smoothing over
a dip. Making the scale wider or narrower produces different graphical impressions. The wider scale gives the impression of
less change and is less likely to suggest a relationship. Using a logarithmic scale may give a different impression from using
a linear scale. There are some types of data that are best represented on the logarithmic scale. A double vertical scale can
be used to show spurious association. Narrow and Wide Scales give different impressions
of the same data. Omitting zero/origin makes interpretation difficult. If plotting from zero is not possible for reasons of
space, a broken line should be used to show discontinuity of the scale.

** **

**DISCRETE DATA SUMMARY: RATES, RATIOS, and PROPORTIONS**

A. RATES

Rates are events in a given population over a defined time period. A rate has 4 components: numerator, denominator,
and time. The numerator of a rate is included in its denominator. The general formula of a rate is the total number of disease
or characteristic in a given time period divided by the total number of persons at risk (those with disease + those without
disease). In symbols this is written as a / (a+b)t where a= number of new cases, b= number without disease, and t= time of
observation. Incidence is a type of rate. It describes a moving and dynamic picture of disease eg IMR, CBR, and IR. Crude
rates are unweighted and are misleading. Comparison of crude rates in 2 populations is not possible. No valid inference based
on crude rates is possible because of confounding. The following types of specific rates are commonly used: age-specific,
sex-specific, place-specific, race-specific, and cause-specific rates. Adjusted /standardized rates: Adjustment for age, sex,
or any other factor to remove confounding and allow comparison across populations.

** **

**Standardization** is a statistical technic that involves adjustment of a rate or a proportion for 1 or 2 confounding
factors. There are approaches to standardization: direct standardization, indirect standardization, life expectancies that
are age adjusted summaries of current mortality rates, and regression techniques. Both direct and indirect standardization
involve the same principles but use different weights.

B. RATIOS

The general formula for a ratio is number of cases of a disease divided by the number without disease, a/b. Examples
of ratios are: the proportional mortality ratio, the maternal mortality ratio, and the fetal death ratio. The proportional
mortality ratio is the number of deaths in a year due to a specific disease divided by the total number of deaths in that
year. This ratio is useful in occupational studies because it provides information on the relative importance of a specific
cause of death. The maternal mortality ratio is the total number of maternal deaths divided by the total live births. The
fetal death ratio.

C. PROPORTIONS

Proportions are used for enumeration. A proportion is the number of events expressed as a fraction of the total
population at risk. It has only 2 components: the numerator and the denominator. The numerator is included in denominator.
The time period is not defined but is somehow assumed. The general formula for a proportion is a/(a+b). Examples of proportions
are: prevalence proportion, neonatal mortality proportion, and the perinatal mortality proportion. Prevalence describes a
still/stationary picture of disease. Like rates, proportions can be crude, specific, and standard.

**CONTINUOUS DATA SUMMARY 1: MEASURES OF CENTRAL TENDENCY**

**A. COMMON MEASURES**

Four averages are commonly used: the mean, the mode, and the median. The arithmetic mean is considered the most
useful measure of central tendency in data analysis. The median is gaining popularity. It is the basis of some non-parametric
tests as will be discussed later. The mode has very little public health importance

** **

**B. ARITHMETIC MEAN**

The arithmetic mean is defined as the sum of the observations' values divided by the total number of observations. The arithmetic mean reflects the impact of all observations.

C. MODE

The mode for a set of values is defined as the commonest, most frequent, or most popular observation. The mode
is rarely mentioned in scientific literature. The mode is used when the interest is in identifying the most prominent observation.
The mode is easy to compute or determine.

D. MEDIAN

The median is defined as the middle observation in a ranked series such that 50% are above and 50% are below. The
median is intuitively easy to understand as the value of the observation that is in the middle of the series. The median is
best used for the following types of data: (a) erratically spaced data (b) data that has extreme values since the median is
less affected by extreme values. In this aspect the median is superior to the arithmetic mean. The median for a set of values
can be determined even if the value of the extreme observation is nor known exactly. This is another aspect in which the median
is superior to the arithmetic mean. The arithmetic mean can not be computed for open-ended distributions. The median is used
only for description and not analysis. Generally no further mathematical manipulation are possible using the median. However
statistical methods based on the median have been developed recently and are gaining popularity.

**CONTINUOUS DATA SUMMARY 2: MEASURES OF DISPERSION/VARIATION**

**A. MEASURES OF VARIATION**

Measures of variation can be classified as absolute or relative. The absolute measures include: range, inter-quartile
range, mean deviation, variance, standard deviation, quantiles. The relative measures include: coefficient of variation (CV),
standardized score (z-score). They can also be classified depending on how they are derived: based on the mean, based on quantiles,
and others.

** **

**B. VARIANCE**

The variance is defined as the sum of the squared deviations of each observation from the mean divided by the sample
size, n. The divisor, n, is corrected as n-1 for small samples to correct for sampling error. In this formula n-1 represents
the degrees of freedom.

** **

**C. STANDARD DEVIATION **

The standard deviation is the square root of the variance described above. The standard deviation is the most frequently
used measure of variation. A distinction must be made between the standard deviation (based on a population) and the standard
error of the mean (based on a sample).

** **

**D. QUARTILES**

If a set of observations is arranged in order of magnitude and is divided into 4 equal intervals, each of those
parts is called a quartile. The first and third quartiles are usually referred
to in statistical work. The inter-quartile range is defined as the difference between the first quartile, Q1, and third quartile,
Q3.

** **

**E. PERCENTILES**

Percentiles are also called centile scores. Centiles are a form of cumulative frequency and can be read off a cumulative
frequency curve. If a set of observations is arrayed in order of magnitude and is then divided into 100 equal intervsla, each
interval is called a percentile. The centile point is the score. Each centile point is given a rank eg 10^{th}, 25^{th}
etc. In pharmacological experiments LD_{50} is the 50^{th} percentile. Percentile scores have the advantage
of being direct and very intelligible.

F. THE RANGE

The range is defined in two ways: (a) by stating the minimum and maximum values (b) by stating the difference between
the maximum and the minimum values. For a complete definition both (a) and (b) should be used. The range is a simple measure
with intuitive appeal. It is easy to compute. It is useful for preliminary or rough work. The range depends entirely on the
extreme values. The range is affected by extreme values since it is based on only 2 observations. A big sample is likely to
have a distorted range because. The big sample has a wider range because extreme values are more likely. The range is more
sensitive to sampling fluctuations than the standard deviation. The range suffers from the additional disadvantage of no further
mathematical manipulations.

PARAMETRIC DISCRETE DATA ANALYSIS USING PROPORTIONS

A. PRELIMINARY CONSIDERATIONS

NORMAL DISTRIBUTION OF THE DATA

The first step is to ascertain whether the data distribution follows an approximate Gaussian distribution. The
approximate methods are most valid when the data is Gaussian.

EQUAL VARIANCES

It is possible to compute variances for proportions using the binomial theorem. The variances of proportions in
the compared samples must be approximately equal for the statistical tests to be valid.

ADEQUACY OF THE SAMPLE SIZE

For approximate methods to be valid, the sample size must be adequate. There are special statistical procedures
for ascertaining sample size.

** **

**B. DATA LAY-OUT**

The data for approximate methods is laid out in the form of contingency tables: 2 x 2, 2 x k, m x n.. Visual inspection
is recommended before application of statistical tests.

** **

**C. STATING THE HYPOTHESES**

The null hypothesis and the alternative hypotheses must be stated clearly. The following formulations are acceptable.
In Inference on 2 sample proportions using z or chi-square test, H_{0}: sample proportion #1 - sample proportion #2
= 0. H_{A}: sample proportion #1 > or < sample proportion #2. In
Inference on 3 or more sample means using the chi-square test: H_{0}: sample proportion #1 = sample proportion #2 = sample proportion #4 = sample proportion #…n

** **

**D. FIXING THE TESTING PARAMETERS**

Testing parameters are fixed. For the p-value approach, the 5% or 0.05 level of significance is customarily used.
There is nothing preventing using any other level like 2.5% or 10%.

** **

**TESTING FOR TWO BINOMIAL PROPORTIONS IN 2 x 2 TABLE USING ****c**^{2}

The chi-square is computed from the data using appropriate formulas. The chi-square for paired data is called the
MacNemar chi-square. The formula of the Pearson chisquare for independent samples is å {(Observed
– expected)^{2} / E} ~ c_{1.} The chi-square statistic
computed as explained above is referred to the appropriate table to look up the p-value under the appropriate degrees of freedom.
The general formula for degrees of freedom is: (rows - 1) x (columns - 1). The following decision rules are then used: If
the p-value is larger than the level of significance of 0.05, the null hypothesis is not rejected. If the p-value is smaller
than the level of significance of 0.05, the null hypothesis is rejected. There is a simple computational formula for the chisquare
in 2 x 2 tables.

**PARAMETRIC CONTINUOUS DATA ANALYSIS USING MEANS**

A. THE T TEST STATISTIC

**The student t-test is the most commonly used test statistic for inference
on continuous numerical data. The t-test must fulfil the following conditions of validity:
(a) The samples compared must be normally distributed (b) the variances of samples compared must be approximately equal.
The t-test is used uniformly for sample sizes below 60. It is also used for sample sizes above this if the population standard
deviation is not known. **

B. THE F-STATISTIC

**The F-test is a generalized test used in inference on 3 or more sample means. The procedures of the F-statistic
are also generally called analysis of variance, ANOVA. ANOVA studies how the mean varies by group.**** **

** **

**C. ASCERTAINING THE NORMAL DISTRIBUTION OF THE DATA **

The first step is to ascertain whether the data distribution follows an approximate Gaussian distribution.

** **

**D. ASCERTAINING THE EQUALITY OF VARIANCES**

The tests above require that the samples being compared have approximately equal variances. The tests do not perform
optimally if the magnitudes of the variances vary wildly. Usually informal methods of testing equality of variances are carried
out. It is not absolutely necessary that the variances be exactly equal. What is required is that they have the same order
of magnitude or be approximately equal.

** **

**E. STATING THE TEST HYPOTHESES**

The null hypothesis and the alternative hypotheses must be stated clearly. The following formulations are acceptable.
Inference on 1 sample mean using z or t-tests: H_{0}: sample mean - population mean = 0. H_{A}: sample mean
- population mean > or < 0. Inference on 2 sample means using z or t-test: H_{0}: sample mean # - sample mean
#2 = 0. H_{A}: sample mean #1 - sample mean #2 > or < 0. Inference on 4 or more sample means using the F-test:
H_{0}: sample mean #1 = sample mean #2 = sample mean #4 = sample mean#…n

** **

**F. STATING THE TEST PARAMETERS**

For the confidence interval approach the 95% bounds are used customarily. There is nothing to prevent 90% or 99%
intervals from being used. For the p-value approach, the 5% or 0.05 level of significance is customarily used. There is nothing
preventing using any other level like 2.5% or 10%.

** **

NON-PARAMETRIC ANALYSIS OF CONTINUOUS DATA USING MEDIANS

**A. DEFINITION AND NATURE**

Non-parametric methods are used for data that is not normally distributed.

** **

**B. ADVANTAGES**

These methods are simplicity itself. They are easy to understand and employ. They do not need complicated mathematical
operations thus leading to rapid computation. They have few assumptions about the distribution of data. They can be used for
non-Gaussian data. They can also be used for data whose distribution is not known because there is no need for normality assumptions.

** **

**C. DISADVANTAGES: **

Non-parametric methods can be used efficiently for small data sets. With data sets that have many observations,
the methods cannot be applied with ease. These methods are also not easy to use with complicated experimental designs. Non-parametric
are less efficient than parametric methods for normally-distributed data. Hypothesis testing with non-parametric methods is
less specific than hypothesis testing with parametric methods.

** **

**D. CHOICE BETWEEN PARAMETRIC AND NON-PARAMETRIC**

Non-parametic methods should never be used where parametric methods are possible. Non-parametric should therefore
be used only if the test for normality is negative. Non-parametric methods are also used in situations in which the distribution
of the parent population is not known.

** **

**E. CORRESPONDENCE OF PARAMETRIC & NON PARAMETRIC**

Situation |
Parametric
test |
Non-parametric
test |

1
sample |
z-test,
t-test |
Sign
test |

2
independent sample means |
t-test |
Rank
Sum test |

2
paired sample means |
t-test |
Signed
Rank Test |

3
or more independent sample means |
ANOVA
(1-way) |
Kruskall
Wallis |

Multiple
comparisons of means |
ANOVA
(2-way) |
Friedman |

Correlation |
Pearson |
Spearman |

Comparing
survival curves |
Proportional
hazards regression |
Log
rank test |

Virtually each parametric test has an equivalent non-parametric one as shown in the table above. Note that the
Mann-Whitney test gives results equivalent to those of the signed rank test. The Kendall gives results equivalent to those of the Spearman coefficient. The signed rank and rank sum
tests are based on the median.

CORRELATION

**A. CORRELATION AS PRELIMINARY DATA ANALYSIS**

Data analysis usually begins with preliminary exploration using linear correlation. A correlation matrix is used
to explore for pairs of variables likely to be associated. Then more sophisticated methods are applied to define the relationships
further.

** **

**B. EXPLORATION OF BIVARIATE RELATIONSHIP**

Correlation describes the relation between 2 variables about the same person or object with no prior evidence of
inter-dependence. Both variables are random. Correlation indicates only association. The association is not necessarily causative.
Correlation measures the strength of bivariate relationship. It measures linear relation and not variability.

** **

**C. OBJECTIVES OF CORRELATION ANALYSIS**

Correlation analysis has the following objectives: (a) describe the relation between x and y (b) predict y if x
is known and vice versa (c) study trends (d) study the effect of a third factor like age on the relation between x and y.

** **

**D. THE SCATTERGRAM**

The first step in correlation analysis is to inspect a plot of the data. This will give a visual impression of
the data layout and identify out-liers.

** **

**E. CORRELATION COEFFICIENT**

Pearson’s coefficient of correlation (product moments correlation), r, is the commonest statistic for linear
correlation. It has a complicated formula but can be computed easily by modern computers. It essentially is a measure of the
scatter of the data. The correlation coefficient cannot be interpreted correctly without looking at the scatter-gram. The
correlation coefficient is not interpretable for small samples. The size of r may not matter. A small r may be significant
while a big one may not be. The significance of r also depends on what is being measured. For some variables small values
of r may be significant. In general Colton recommends the following interpretation
of r. Values 0.25 - 0.50 indicate a fair degree of association. Values of 0.50 - 0.75 indicate moderate to fair relation.
Values above 0.75 indicate good to excellent relation. Values of r = 0 indicate either no correlation or that the two variables
are related in a non-linear way. High correlation coefficient usually arises when the dots hug the regression line closely
(p. 165 Minium). Very high correlation coefficients are suspect and need to be checked carefully. In perfect positive correlation,
r=1. In perfect negative correlation, r=-1. In cases of no correlation, r=0. In cases of no correlation with r=0, the scatter-plot
is circular. The value of the coefficient does not change when the units in which x and y are measured. The change in unit
must however be for both variables simultaneously.

REGRESSION

**A. INDEPENDENT and DEPENDENT VARIABLES**

Both correlation and regression address the relation between 2 variables. The scatter-gram is basic to both. In
correlation both x and y are random. In regression x is independent (ie random) whereas y is dependent being determined by
x. The outcome variable in regression is measured as means. The independent variable can be continuous or categorical. The
dependent variable can be continuous or binary.

** **

**B. REGRESSION EQUATION**

The mathematical model of simple linear regression is shown in the regression equation/regression function/regression
line: y=a + bx where y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient, and
x is the dependent/predictor variable. Both a and b are in a strict sense regression coefficients but the term is usually
reserved for b only.

** **

**C. ASSUMPTIONS**

The following assumptions are made for the validity of the regression model: (a) Linearity of the relation between
x and y (b) Normal distribution of the y variable for any given value of x (c) Constant variance or homoscedacity ie the variance
of y is the same for all xs (d) the deviations of y variables from the mean are independent for each value of x. The assumption
of normality states that for any value of x, the distribution of y is normal. The assumption of homoscedacity states that
the variances of y at various levels of x are approximately equal. Stated in another way, the variance of y is approximately
constant for various values of x. This variance consists of both measurement error and biological variation. The regression
model is not valid in a situation of heteroscedacity (variable variance). Heteroscedacity can be detected by plotting residuals
against the fitted values of the regression line. The Bartlett and Levene Median Tests are sophisticated methods of testing
for homoscedacity. Heterescedacity can be removed by rescaling y as y^{l} where l = 1, ½, 0, -1

MULTIPLE LINEAR REGRESSION

Multivariate analysis determines the relative contribution of different causes to a single event. It also enables
assessment of one variable while holding the rest of the variables constant. The regression equation: y=a+b1x1 + b2x2 + b4x4
where y is the dependent/response variable, a is the intercept, b1,b2,b4 are the slope/regression coefficients. Three procedures
are used for fitting the multiple regression line: step-up, step-down, and **step-wise. **

** **

**LOGISTIC REGRESSION**

Logistic regression is very useful in epidemiological analysis because of a dichotomized outcome
variable.