Univariate Summary

Menu locations:

Analysis_Descriptive_Univariate Summary;

Analysis_Descriptive_Weighted Univariate Summary.

This function provides measures of location and dispersion which describe the data in a worksheet column. You are given the number, arithmetic mean, sum, variance, standard deviation, standard error of the arithmetic mean, coefficient of variance, confidence interval for the arithmetic mean, geometric mean, coefficient of skewness, coefficient of kurtosis, maximum, upper quartile, median, lower quartile, minimum and range for each selected variable. You can also choose to calculate an additional quantile and this is appended to the results listed above. Incalculable results are displayed as missing data using an asterisk (*).

If you select more than one column of data to describe then you are given an option to save the results to worksheet columns. Saved columns of results represent the statistics, mean, median etc., and their rows represent the variables/columns you selected to describe.

Confidence limits (boundaries of the confidence interval) are given for the arithmetic mean. Please see quantile confidence interval for confidence intervals for the median and other measures of location.

Some related topics:

Please refer to one of the general textbooks listed in the reference sectionfor discussion of the application and relative merits of individual descriptive statistics.

Definitions

Valid data and missing data

For each worksheet column that you select, the number of valid data are the number of cells that can be interpreted as numbers, the remaining cells that can not be interpreted as numbers are counted as missing (e.g. empty cell, asterisk or text label). The sample size used in the calculations below is the number of valid data.

Sum, mean, variance, standard deviation, standard error and variance coefficient

- where Σ is the summation for all observations (x_i) in a sample, x bar is the sample (arithmetic) mean, n is the sample size, s² is the sample variance, s is the sample standard deviation, SEM is the standard error of the sample mean, upper and lower CL are the confidence limits of the confidence interval for the mean, t_{α, n-1} is the (100*a)% two tailed quantile from the Student t distribution with n-1 degrees of freedom, and VC is the variance coefficient.

Skewness and kurtosis

- where Σ is the summation for all observations (x_i) in a sample, x bar is the sample mean and n is the sample size. Note that there are other definitions of these coefficients used by some other statistical software. StatsDirect uses the standard definitions for which critical values are published in standard statistical tables (Pearson and Hartley, 1970; Stuart and Ord, 1994).

Geometric mean

The geometric mean is a useful measure of central tendency for samples that are log-normally distributed (i.e. the logarithms of the observations are from an approximately normal distribution). The geometric mean is not calculated for samples that contain negative values.

- where Σ is the summation for all observations (x_i) in a sample, ln is the natural (base e) logarithm, exp is the exponent (anti-logarithm for base e), GM is the sample geometric mean and n is the sample size.

Weights

If weights are selected then the weights that you supply are first normalised so that they sum to the total number of observations n:

- where v_i is a user supplied weight and w_i is the normalised weight.

The following formulae replace the mean, variance and moments calculations defined above when weights are used:

Median, quartiles and range

For samples that are not from an approximately normal distribution, for example when data are censored to remove very large and/or very small values, the following nonparametric statistics should be used in place of the arithmetic mean, its variance and the other parametric measures above.

Median (50th centile, quantile 0.5), lower quartile (25th centile, quantile 0.25) and upper quartile (75th centile, quantile 0.75) are defined generally as quantiles:

Two different quantile definitions (Weisberg, 1992; Gleason, 1997; Stuart and Ord, 1994 are used in the summary statistics (see also: quantiles): the first is the conventional quantile that is also used in the quantile confidence interval function and the second allows for weights:

Method 1:

- where p is a proportion, Q is the pth quantile (e.g. median is Q(0.5)), fix is the integer part of a real number, h is the fractional part of order statistic i, u is an observation from a sample after it has been ordered from smallest to largest value and n is the sample size.

Method 2:

- where p is a proportion, Q is the pth quantile (e.g. median is Q(0.5)), u is an observation from a sample after it has been ordered from smallest to largest value, n is the sample size, w is a weight normalised so that it sums to n and

Technical validation

The computational methods used in StatsDirect univariate summary statistics, including this function, provide 15 decimal places of precision. This is tested against known standards such as the reference data set used in the example below.

Example

Test workbook (Parametric worksheet: Michelson).

The data are 100 measurements of the speed (millions of meters per second) of light in air recorded by Michelson in 1879 (Dorsey, 1944). The American National Institute of Standards and Technology use these data as part of the Statistical Reference Datasets for testing statistical software (McCullough and Wilson, 1999; http://www.itl.nist.gov/div898/strd/).

Open the test workbook and select the "Michelson" column. Choose Univariate Summary from the Descriptive section of the analysis menu and click on OK when you see a list of descriptive statistics options.

Results from StatsDirect (with decimal places in Analysis_Options set to 12 and centile type 2 selected):

Descriptive statistics

Variables	Michelson
Valid data	100
Missing data	0
Sum	29985.24
Mean	299.8524
Variance	0.006242666667
Standard deviation	0.079010547819
Variance coefficient	0.000263498134
Standard error of mean	0.007901054782
Upper 95% CL of mean	299.868077406834
Lower 95% CL of mean	299.836722593166
Geometric mean	299.852389694496
Skewness	-0.01825961396
Kurtosis	3.263530532311
Maximum	300.07
Upper quartile	299.895
Median	299.85
Lower quartile	299.805
Minimum	299.62
Range	0.45
Centile 95	299.98
Centile 5	299.73