Understanding variables and their properties is essential to understanding statistical analysis. A constant has only one unvarying value under all circumstances for example p and
c = speed of light. A random variable can be qualitative (descriptive with no intrinsic numerical value) or quantitative (with
intrinsic numerical value). Qualitative variables can be nominal (no specific order of magnitude), ordinal (specific order)
or ranked. A random quantitative variable results when numerical values are assigned to results of measurement or counting.
It is called a discrete random variable if the assignment is based on counting. It is called a continuous random variable
if the numerical assignment is based on measurement. The numerical continuous random variable can be expressed as fractions
and decimals. The numerical discrete can only be expressed as whole numbers. Choice of the technique of statistical analysis
depends on the type of variable. Many mistakes in data analysis arise from not knowing the difference between discrete and
continuous variables and wrongly applying the wrong statistical technique.

**PRELIMINARIES OF DATA ANALYSIS**

Simple manual inspection of the data is needed before applying sophisticated statistical tests.. Indiscriminate
application of the tests to data leads to wrong or misleading conclusions. Acquiring familiarity with the data by simple manual
inspection can help identify outliers, assess the normality of data distribution, and identify commonsense relationships among
variables that could alert the investigator to errors in computer analysis.

Data analysis is essentially construction and testing of hypotheses. Two procedures are employed in statistical
analysis. The test for association is done first. The assessment of the effect measures is done after finding an association.
Effect measures are useless in situations in which tests for association are negative. The tests for association commonly
employed are: t-test, chi-square, the linear correlation coefficient, and the linear regression coefficient. The effect measures
commonly employed are: Odds Ratio, Risk Ratio, Rate difference. Measures of trend can discover relationships that are not
picked up by association and effect measures

**TYPES OF ANALYSIS**

**Univariate analysis **is testing a hypothesis about one mean or one proportion.** **The t test is used to test hypotheses
about a single sample mean. The chisquare test is used to test hypotheses about a single sample proportion. Univariate testing
answers the question whether the given mean or proportion is significantly different from zero.

**Bivariate analysis** is testing the hypothesis whether two means or two proportions are significantly different from
one another. The choice of the statistical test for association in bivariate analysis is made according to Table #1

**Multivariate analysis** in its commonest form is essentially bivariate analysis with adjustment for extraneous
variables that confuse (or confound) the bivariate relation. Choice of statistical test of association for multi-variate analysis
is made according to table #2

**STATISTICAL MODELS IN DATA ANALYSIS**

Observations or raw data has to be fit to
a specific statistical model. Once the model is fit it can be used for prediction. There are basically three types of models:
probability models, likelihood models, and regression models. The **probability model** is deterministic
and stochastic. Probability models commonly used in statistical analysis are the binomial and the normal distributions. The
**likelihood model** derives the maximum likelihood estimator from the data. The
maximum likelihood estimate, MLE, is the most likely value of the parameter from the given data and is derived interactively.
The **regression model** may be a Poisson regression model or may be binomial logistic
regression model. The model allows modeling the interaction among confounders and the interaction between the exposure and
the confounders. It can be used to explore additive and synergistic relations.

Multivariate models solve 2 problems
that arose when stratified analysis was used. Stratified analysis breaks down when data is sparse with very low numbers in
some strata. Stratified analysis would be very cumbersome if it were used for more than 3 variables. There are three main
types of multivariate models: the linear model, the logistic model, and the proportional hazards model. The linear model is
E(Y) = b_{0} + å_{i=1 }b_{i}x_{i}. The binary logistic model is of the form ln(p/1-p) = e^{åi=1 }^{bixi}. The proportional hazards
regression relates hazard at a given time to risk factors such that y_{i }= ln{h_{i}(t) / h_{0}(t)}
= b_{1} x_{1i} + b_{2} x_{2i} + ….The
coefficients of proportional hazards regression are interpreted like coefficients of logistic regression.

**TABLE #1:**

**CHOICE OF STATISTICAL TECHNIQUE FOR BIVARIATE ANALYSIS****[i]**

** **

First variable |
Second Variable |
Test |

Continuous |
Dichotomous, unpaired |
2-sample t test |

Continuous |
Dichotomous, paired |
Paired t test ( 1 sample t test
after taking differences for each pair) |

Continuous |
Nominal (>= groups) |
1-way ANOVA |

Continuous |
Continuous |
Linear correlation (Pearson) or
linear regression |

Ordinal |
Dichotomous, unpaired |
Mann-Whitney U test or Chi-square
test for linear trend |

Ordinal |
Dichotomous, paired |
Wilcoxon test |

Ordinal |
Ordinal |
Spearman Correlation or Kendall Correlation |

Ordinal |
Continuous |
Categorize the continuous and
use Spearman correlation, Kendal correlation or the chi square test |

Dichtomous |
Dichotomous, unpaired |
Chi-square test or Fisher exact
probability test |

Dichotomous |
Dichotomous, paired |
McNemar chi-square test |

Dichtomous |
Nominal |
Chi-square test |

Nominal |
Nominal |
Chi-square test |

**TABLE #2:**

**CHOICE OF STATISTICAL TECHNIQUE FOR MULTIVARIATE ANALYSIS**^{1}

Dependent variable |
Independent Variables |
Test |

Continuous |
All categorical |
ANOVA (analysis of variance) |

Continuous |
Mixture of categorical and continuous |
ANCOVA (Analysis of covariance) |

Continuous |
All continuous |
Multiple linear regression |

Dichotomous |
All categorical |
Multiple logistic regression or
log-linear analysis |

Dichtomous |
Mixture of categorical and continuous |
Logistic regression |

Time-dependent
Dichotomous |
Mixture of categorical and continuous |
Cox’s proportinal hazards
model |

Dichotomous |
All continuous |
Logistic regression or discriminant
function analysis |

Nominal |
All categorical |
Log-linear analysis |

Nominal |
Mixture of categorical and continuous |
Group the continuous and perform
log linear analysis |

Nominal |
All continuous |
Discriminant function analysis
or categorize the continuous and perform log-linear analysis |

NB: Categorical includes nominal,
ordinal and dichotomous

[i] (Jekel et al Epidemiology, Biostatistics, and Preventive Medicine WB Saunders page 175):