Forensic Medicine

Monday, August 31, 2015

Biostatistics

·         Type I Error: Concluding that a difference exists when it does not.
Type II Error: Concluding that a difference does not exist when it does.

·         Degree of Freedom in Chi Square test = (c-1) x (r-1), where 'c' is the number of columns and 'r' is the number of rows. 
For 1 variable, it is (n-1)
For 2 variable, it is (n1+n2-2)

·         Bayes Theorem = the predictive accuracy of any test outcome that is less than a perfect diagnostic test is influenced by
  1. pretest likelihood of disease (Receiver Operating Characteristics)
  2. criteria used to define a test result for 3 Different Tests

·         Kappa (K) measures concordance between test results and gold standard
Analogous to Pearson correlation coefficient (r) for continuous data!

·         t-test checks difference between the means of 2 groups. Mr. T is mean.
·         ANOVA checks difference between the means of 3 or more groups. ANOVA = ANalysis Of VAriance of 3 or more variables.
·         Chi Square Test checks difference between 2 or more percentages or proportions of categorical outcomes (not mean values). Chi Square Test = compare percentages (%) or proportions.

·         Nominal and ordinal data are best described and represented in bar charts and pie graphs. Discrete and continuous data are best displayed as histograms (which appear like a bar chart, but with the bars touching each other), frequency polygons, scatter plots, box plots, and line graphs.

·         The measures of central tendency-the mean, the median, and the mode
The median is a more stable indicator of central tendency
·         Four commonly used measures of dispersion are the range, the interquartile range, the variance, and the standard deviation.

·         When the distribution of data is not symmetric and cannot be described by a Gaussian distribution, it is still possible to generalize regarding data distribution or dispersion. In this situation, Chebyshev's inequality can be used as a conservative estimation of the empirical rule discussed above. It states that [1-(1/k)2] of the data lie within k standard deviations of the mean. It can be very useful for application to skewed data sets.

·         Kurtosis refers to the appearance of the peak of a curve, as well as to its tail, relative to a normal distribution. Data distributions with high kurtosis generally exhibit high and steep peaks near the mean, with wider tails; data with low kurtosis exhibit broader and flatter peaks than the normal distribution. A Gaussian distributed curve has zero skew and zero kurtosis. In a kurtotic distribution, the variance of the data remains unchanged.

  1. Roughly 68% of the data fall within 1 standard deviation of the mean.
  2. Roughly 95% of the data fall within 2 standard deviations of the mean.
  3. Roughly 99.7% of the data fall within 3 standard deviations of the mean.

·         HYPOTHESIS TESTING:
  1. Alpha (a): probability of making a type I error
  2. Type I error: null rejection error (i.e., should not have rejected null)
  3. Beta (ß): probability of making a type II error
  4. Type II error: null acceptance error (i.e., should not have failed to reject the accepted null)
  5. Power = 1 - ß

·         The ROC is a graph that represents the relationship between sensitivity and specificity, with the "true positive rate" (i.e., sensitivity) appearing on the y-axis, and the "false positive rate" (1-specificity) appearing on the x-axis. This type of graph can help investigators assess the utility of a diagnostic test, for it can help determine the appropriate cutoff point for a screening test. In general, the point on the curve that lies closest to the left-hand top corner of the graph is taken as the point at which both sensitivity and specificity are maximized.
A useless ROC curve is a straight line with a slope of 1. The more the curve bends to the upper left of the graph, the better the test is said to perform.

·         Linear regression is a type of analysis used to describe the probability of outcome occurrence, a dependent variable, based upon the relationship between two or more independent continuous random variables; it is used to predict how changes in one (in the case of simple linear regression) or many (in the case of multiple linear regression) variables can affect the value of the dependent outcome of interest, represented as x.
Logistic regression is a variation of linear regression used to describe the relationship between two or more variables when the dependent outcome variable is dichotomous and the independent variables are of any type.

·         Survival analysis aims to determine probabilities of "survival" for individuals from a designated starting time to a later point; this interval is called the survival time. The endpoint under study is referred to as a failure. Failure does not always signify death, but may also define outcomes such as the development of a particular disease or a disease relapse.
Survival analysis requires an approach that is different from logistic and linear regression for two reasons: (1) the data lack a normal distribution (i.e., the distribution of survival data tends to be skewed to the right) and (2) data censoring (i.e., there are incomplete observation times due to loss to follow-up or patient withdrawal from a study). Potential tools for analysis include life tables, the Kaplan-Meier method (i.e., the product-limit method), the log-rank test, and Cox regression.

·         VARIABLES AND SCALES OF MEASUREMENT
  1. Variables take on various values.
  2. Independent variables are used to estimate values for dependent variables.
  3. Nominal scales are used to arbitrarily group qualitative data.
  4. Ordinal scales show rank but give no information about the distance between values.
  5. Interval scales have meaningful rank and spacing between values
  6. Ratio scales have a meaningful zero, which gives meaning to the ratio of two values.

·         The coefficient of variation is a measure of the variability of the values in a sample, relative to the value of the mean. Mathematically, it is simply the ratio of standard deviation to the mean. Of note, the coefficient of variation is truly meaningful only for values measured on a ratio scale.

·         Measurement error consists of systematic and random error.
Random errors increase the dispersion and are equally likely to fall on either side of the true value. Random errors are considered to occur by chance and thus are not predictable. All data sampling is subject to random error. As the number of measurements in a sample increases, the discrepancy caused by random error will decrease and the summation of random error will tend toward zero.
Systematic measurement error, also known as bias, is caused by a consistent fault in some aspect of the measurement process that causes the values to center around a value other than the true value. Increasing the number of observations has no effect on systematic error.
Precision is lack of random error, causing a close grouping in repeated measurements.
Validity is the lack of systematic error (i.e., bias) leading to measured values approaching the true value.   

·         PROBABILITY DISTRIBUTIONS
  1. Probability distributions describe the likelihood of dichotomous events or traits.
  2. They demonstrate how the occurrence of a dichotomous event becomes increasingly improbable (but not impossible) the further it is from the true incidence of that event or trait in the population.
  3. They have a mean probability of occurrence and a standard deviation about that mean.
  4. They are different for every combination of sample size and population incidence.

·         A frequency polygon is a graph formed by joining the midpoints of the tops of histogram columns with straight lines. A frequency polygon smoothes out abrupt changes that can occur in histogram plots and helps demonstrate continuity of a variable being analyzed.

·         Geographic distribution map, Also known as an epidemiologic map, a case map, or a choroplethic map, this type of presentation displays quantitative information in defined geographic areas.

·         The stem-and-leaf diagram, invented by John Tukey, is a unique method of summarizing data without losing the individual data points. The stems in these diagrams are the left-hand digits of the numeric data, and the leaves are the last digit to the right of the stem. The frequency of each stem value (which is the number of leaves) is given in a separate column.

·         A box-and-whisker plot is a visual representation of data with a box made at the value of the median, represented by a middle line, with two divisions. The vertical height of the lower division represents the first quartile (i.e., 25th percentile), and the upper division represents the third quartile (i.e., 75th percentile). The width of the box is arbitrary. Vertical lines are drawn from the minimum value to the lower box and from the maximum value to the upper box.

·         In order to make a and ß as small as possible, a is usually fixed first. Because ß is inversely related to a, ß will tend to get larger if a is made smaller at a fixed sample size. Consequently, for any designated a, increased sample sizes will lead to statistic tests with greater powers and smaller ß values.

·         Data mining refers to the potentially inappropriate use of repeated subgroup analysis on a data set until a relationship that is statistically significant is found. Remember that, by convention, at a = 0.05, there is a 1 in 20 chance that a finding will be statistically significant when, in fact, it is not (i.e., a type I error). Consequently, if one does repeated subgroup analysis of a data set, any findings of statistical significance should be confirmed, if possible, by a dedicated research study of their own.

·         A confidence interval is a range of likely values defined by upper and lower endpoints (i.e., confidence limits) within which the true value of an unknown population parameter is likely to fall, based on a preset confidence level.

·         In a Venn diagram, events are represented as simple geometric figures, with overlapping areas of events represented by intersections and unions of the figures. Two mutually exclusive events (A and B) are represented in a Venn diagram by two nonintersecting areas.

·         Sample size increases as variance increases.
Sample size increases as the significance level is made smaller (i.e., a decreases).
Sample size increases as the required power increases (i.e., 1-ß increases).
Sample size decreases as the absolute value of the distance between the null and alternative means increases.

·         A kappa of less than 0.6 is usually considered unacceptable, whereas a kappa greater than 0.80 is excellent. A weighted kappa can be used wherein some disagreements are considered worse than others (e.g., normal vs. highly malignant biopsy).
The most common version of the kappa statistic, Cohen's kappa, is calculated by comparing expected versus observed agreement.

·         Many tests assume normality (i.e., the data follow a normal or binomial distribution), and these are called parametric tests. Examples of parametric tests include Pearson's correlation and linear regression tests.
Tests that make no assumption of normality are termed nonparametric. Examples of nonparametric tests include the Wilcoxon signed rank test, the Mann-Whitney U test, the Kruskal-Wallis test, and the McNemar test.

·         PEARSON'S CORRELATION COEFFICIENT
  1. Both variables are continuous.
  2. Both variables are normally distributed.
  3. Data are paired.

·         One of the more common test methods for regression line fit employs the F-test. The test statistic is the variance ratio obtained from the ANOVA (analysis of variance) methodology.

·         SPEARMAN'S RANK CORRELATION
1.       No assumptions of normality are made for either variable.
2.       Variables are continuous or discrete.
3.       Data are paired.

·         WILCOXON SIGNED RANK TEST
  1. No assumptions of normality are made about the data (i.e., nonparametric).
  2. Data are paired, continuous, or discrete.
  3. We want to know whether or not the difference within each pair is significant.

·         MANN-WHITNEY U TEST
  1. Data are not assumed to follow a normal distribution.
  2. Data are not paired but represent measurements made in two groups that differ in some way (e.g., exposed vs. unexposed).
  3. We want to know whether or not the difference in the measurements between the two groups is statistically significant by comparing medians.

·         A method similar to Spearman's rank correlation is Kendall's tau. Like Spearman's rank correlation, it is applied to paired data, and the data are assumed to be continuous or discrete. There are no assumptions of normality. In most cases, the results of Spearman's rank correlation and Kendall's tau are comparable, but the latter is more difficult to calculate.

·         CHI-SQUARE TEST
1.       Variables are categoric.
2.       Data represent counts and can be represented in a table of r rows and c columns, an r × c contingency table.
3.       Cochran's criteria should be met to apply the basic chi-square test.
Cochran's criteria should be fulfilled if the chi-square test statistic is to be used as noted above. These criteria are
All expected values in each cell have a frequency count =1.
80% of expected values in each cell should be =5.
If Cochran's criteria for using the chi-square technique are not met (e.g., if the counts in some cells are too low), one can apply the Fisher exact test for 2 × 2 tables.

·         The Yates correction is a correction for continuity for 2 × 2 tables. Some experts believe that this correction is necessary, especially when frequency counts are low, because the chi-square statistic is a discrete value.
Mathematically, the Yates correction is simply accomplished by subtracting 0.5 from the absolute value of the difference between the observed (O) and expected (E) values before squaring the denominator.

·         According to Katz, "Multivariable analysis is a tool for determining the relative contributions of different causes to a single event or outcome."
USES OF MULTIVARIABLE ANALYSIS
  1. To quantify associations.
  2. To look for interaction between independent variables.
  3. To adjust for potential confounders in a controlled study.
  4. To develop models to predict values or probabilities of certain outcomes.

·         F statistics have 2 degrees of freedom. One for between-group variance and one for within-group variance.

·         The Cox proportional hazards model, which is a form of multivariable regression model used to analyze survival curves and clinic trials with a dichotomous outcome (e.g., dead vs. alive or diseased vs. disease-free). It allows comparison of subjects with differing characteristics and different starting and ending points of observation over time.

·         The hazard ratio is the measure of effect in survival analysis that describes the exposure-outcome relationship. A hazard ratio of 1 means no effect. A hazard ratio of 5 indicates that the exposed group has 5 times the hazard of the unexposed group. A hazard ratio of 1/5 indicates that the exposed group has one fifth the hazard of the unexposed group.

No comments:

Post a Comment