Statistics Lecture 2 Descriptive Statistics: Qualitative and Quantitative Data Items •David Bartl •Statistics •INM/BASTA Outline of the lecture •Bar chart •Histogram •Measures of central tendency (arithmetic mean, mode, median) •Measures of variability (range, variance, coefficient of variation) •Measures of data concentration (skewness, kurtosis) Statistics •The purpose of statistics is to present data in a comprehensive form. • •The goal is to analyse the information and reveal relations hidden in the data. • •There are two approaches: • • — Descriptive statistics (categorization, characteristics) → we shall deal with it now • • — Inductive statistics (assumptions about the origin of the data, probability distributions) → we shall deal with it later Data items = Variables Variable = data item Quantitative = numerical Qualitative = categorical Ordinal Nominal Discrete Continuous Example: Employees (a sample of the Dataset) ID Gender Age Marital Status Education Position Salary per Year Evaluation 5060 M 65 divorced secondary worker 258800 4 1030 M 60 divorced university manager 630000 2 3049 M 60 married primary operator 436600 5 5047 M 60 widowed primary+vocational worker 240600 3 5061 M 60 widowed primary+vocational worker 241800 1 5087 M 60 widowed secondary worker 239500 — 5133 F 60 married secondary worker 241100 4 5177 F 60 widowed secondary worker 239600 4 3030 F 58 widowed primary operator 422600 1 3014 F 56 widowed university operator 303600 3 5012 F 56 widowed primary+vocational worker 223100 4 5056 M 56 divorced primary worker 225200 5 5101 M 56 unmarried primary+vocational worker 224600 4 5106 M 56 married primary+vocational worker 226100 7 5146 F 56 married primary+vocational worker 224900 3 5153 M 56 divorced secondary worker 224500 4 5189 M 56 married primary+vocational worker 224600 1 5196 M 56 widowed primary+vocational worker 222800 3 1031 M 55 married university manager 429000 — 5016 M 55 divorced secondary administrative officer 259000 5 5021 F 55 married primary+vocational worker 220200 — 5062 F 55 widowed primary+vocational worker 221400 5 5107 M 55 divorced primary+vocational worker 220500 4 5154 F 55 widowed primary+vocational worker 219200 5 5195 M 55 married primary+vocational worker 219400 6 Methods to present data in a comprehensive form Methods to present data in a comprehensive form Bar chart & Histogram of frequencies •Bar chart • • — used for qualitative [categorical (nominal or ordinal)] data items • — can also be used for discrete numerical data items • — presents the frequencies of each category by the height of a rectangular bar (the height is proportional to the frequency) • — there are gaps between the bars, i.e. the bars are not adjacent Bar chart of frequencies for qualitative data items •Example: The dataset of the employees. • •We examine now the nominal data item “Position”: • •Table: Position Frequency (number) Relative frequency manager  10   5.0 % administrative officer  11   5.5 % operator  29  14.5 % worker 150  75.0 % TOTAL 200 100.0 % Bar chart of frequencies for qualitative data items •Bar chart – the frequencies (numbers) of the nominal data item “Position”: 20 60 40 100 80 0 10 manager administrative officer operator Position 160 140 120 worker 11 29 150 Frequency Bar chart for ordinal qualitative data items •Example: The dataset of the employees. • •We examine now the ordinal data item “Evaluation”: • •Table: Evaluation Frequency (number) Relative frequency    1 — very bad  19  11.243 %    2 — bad  20  11.834 %    3 — rather bad  47  27.811 %    4 — acceptable  32  18.935 %    5 — quite good  23  13.609 %    6 — good  16   9.467 %    7 — very good  12   7.101 % TOTAL 169 100.000 % (rounded to 3 decimal places) Bar chart of frequencies for qualitative data items •Bar chart – the frequencies (numbers) of the ordinal data item “Evaluation”: 5 15 10 25 20 0 19 5 6 3 Evaluation 40 35 30 4 20 47 32 Frequency 1 2 45 50 7 16 12 23 Measures of central tendency of the data item Measures of central tendency of the data item Measures of central tendency of the data item Measures of central tendency of the data item 19 20 47 32 16 12 23 169 Bar chart & Histogram of frequencies Histogram of frequencies for quantitative data items •Example: The dataset of the employees. • •We examine now the numerical data item “Age” considered as a continuous value: • •Table: Age interval Frequency (number) Cumulative frequency Relative frequency Cumulative relative frequency  0 < x ≤ 18  6   6  3.0 %   3.0 % 18 < x ≤ 23 30  36 15.0 %  18.0 % 23 < x ≤ 28 13  49  6.5 %  24.5 % 28 < x ≤ 33 40  89 20.0 %  44.5 % 33 < x ≤ 38 11 100  5.5 %  50.0 % 38 < x ≤ 43 48 148 24.0 %  74.0 % 43 < x ≤ 48 16 164  8.0 %  82.0 % 48 < x ≤ 53 26 190 13.0 %  95.0 % 53 < x ≤ 58  9 199  4.5 %  99.5 % 58 < x ≤ 63  0 199  0.0 %  99.5 % 63 < x ≤ 99  1 200  0.5 % 100.0 % 6 30 40 11 48 16 26 9 1 0 Histogram of frequencies for continuous data items •Histogram – the frequencies (numbers) of the continuous data item “Age”: 10 30 20 40 0 18 23 28 Age 60 50 33 Frequency 38 43 48 53 58 63 0 13 Histogram of frequencies for continuous data items •If the (ordinary) histogram is used to display relative frequencies, then • • — if the variable is continuous, the histogram gives an estimate of the underlying probability density • • — if the variable is discrete, the histogram gives an estimate of the underlying probability distribution • •If the cumulative histogram is used to display the cumulative relative frequencies and the variable is continuous, then the cumulative histogram gives an estimate of the cumulative distributive function. Histogram of frequencies for continuous data items The suggested number of the intervals in the histogram Histogram of frequencies for quantitative data items Histogram of frequencies for quantitative data items Histogram of frequencies for quantitative data items Salary interval Frequency (number) Cumulative frequency Relative frequency Cumulative relative frequency       0 < x ≤  70 000  0   0  0.0 %   0.0 %  70 000 < x ≤ 140 000 60  60 30.0 %  30.0 % 140 000 < x ≤ 210 000 69 129 34.5 %  64.5 % 210 000 < x ≤ 280 000 40 169 20.0 %  84.5 % 280 000 < x ≤ 350 000 15 184  7.5 %  92.0 % 350 000 < x ≤ 420 000  6 190  3.0 %  95.0 % 420 000 < x ≤ 490 000  5 195  2.5 %  97.5 % 490 000 < x ≤ 560 000  2 197  1.0 %  98.5 % 560 000 < x ≤ 630 000  2 199  1.0 %  99.5 % 630 000 < x ≤ 700 000  1 200  0.5 % 100.0 % 700 000 < x ≤ 999 999  0 200  0.0 % 100.0 % Histogram of frequencies for continuous data items •Histogram – the frequencies (numbers) of the continuous data item “Salary”: 10 30 20 50 40 0 Salary 80 70 60 Frequency 0 70000 140000 210000 280000 350000 420000 490000 560000 630000 700000 6 2 0 60 69 40 15 5 2 1 0 Measures of central tendency •Assume that a variable (data item) is numerical, i.e. quantitative, discrete or continuous. We then consider several measures of central tendency of the variable: • — Arithmetic mean • — Mode • — Median Population & Sample •Assume that we have a set (i.e. a “population”) of values of some phenomenon, which we observe / measure / study / deal with. In practice, this set may be very very large (e.g. some data item, the data units being all the people living on the Earth), thus unknown to us. Another example might be the set of all results of some experiment, yet the instances which we have not done yet. •Assume however, that the set exists (in theory at least) and that the set is finite (for simplicity). Population & Sample Population & Sample Population & Sample Arithmetic mean Arithmetic mean Arithmetic mean Median & Mode Sample Mean / Median / Mode in Excel •In Excel, use the functions: • • =AVERAGEA() to calculate the sample arithmetic mean • • =MEDIAN() to find the sample median • • =MODE.SNGL() to find one of the sample modes • • =MODE.MULT() to find many of the sample modes • (matrix function, press “Ctrl-Shift-Enter”) • • =MODE() to find one of the sample modes (the same as =MODE.SNGL(), deprecated) Example: Employees (a sample of the Dataset) ID Gender Age Marital Status Education Position Salary per Year Evaluation 5060 M 65 divorced secondary worker 258800 4 1030 M 60 divorced university manager 630000 2 3049 M 60 married primary operator 436600 5 5047 M 60 widowed primary+vocational worker 240600 3 5061 M 60 widowed primary+vocational worker 241800 1 5087 M 60 widowed secondary worker 239500 — 5133 F 60 married secondary worker 241100 4 5177 F 60 widowed secondary worker 239600 4 3030 F 58 widowed primary operator 422600 1 3014 F 56 widowed university operator 303600 3 5012 F 56 widowed primary+vocational worker 223100 4 5056 M 56 divorced primary worker 225200 5 5101 M 56 unmarried primary+vocational worker 224600 4 5106 M 56 married primary+vocational worker 226100 7 5146 F 56 married primary+vocational worker 224900 3 5153 M 56 divorced secondary worker 224500 4 5189 M 56 married primary+vocational worker 224600 1 5196 M 56 widowed primary+vocational worker 222800 3 1031 M 55 married university manager 429000 — 5016 M 55 divorced secondary administrative officer 259000 5 5021 F 55 married primary+vocational worker 220200 — 5062 F 55 widowed primary+vocational worker 221400 5 5107 M 55 divorced primary+vocational worker 220500 4 5154 F 55 widowed primary+vocational worker 219200 5 5195 M 55 married primary+vocational worker 219400 6 sample Example: Employees — data item “Age” The measures of the central tendency •Which of the measures of the central tendency are the best? •Consider the next example – monthly salaries in 2001 and 2002: Employee Salary in 2001 Salary in 2002 A 10 25 B 10 25 C 10 25 D 20 20 E 20 20 F 20 20 G 20 20 H 20 20 I 20 20 J 20 20 K 20 20 L 20 20 M 20 20 N 50 50 O 50 50 MEAN 22 25 MEDIAN 20 20 MODE 20 20 Measures of variability 135.7 Measures of variability •Assume that a variable (data item) is numerical, i.e. quantitative, discrete or continuous. We then consider several measures of variability of the variable: • — Range • — Variance (dispersion) • — Coefficient of variation Range Variance (dispersion) Variance (dispersion) Variance (dispersion) Variance (dispersion) Variance (dispersion) Standard deviation Variance (dispersion) & Standard deviation Coefficient of variation Example Example Sample Variance / Standard deviation •In Excel, use the functions: • • =VARA() to calculate the sample variance • • =STDEVA() to calculate the sample standard deviation • • • • =VAR.S() to calculate the sample variance (skipping text values) • • =VAR() to calculate the sample variance (skipping text values) • (the same as =VAR.S(), deprecated) Population Variance / Standard deviation •In Excel, use the functions: • • =VARPA() to calculate the population variance • • =STDEVPA() to calculate the population standard deviation • • • • =VAR.P() to calculate the population variance (skipping text values) Measures of data concentration •Assume that a variable (data item) is numerical, i.e. quantitative, discrete or continuous. We then consider several measures of data concentration of the variable: • — Skewness • — Kurtosis Skewness: Pearson’s moment coefficient of skewness Skewness: Properties and interpretation Skewness: Properties and interpretation Skewness: Properties and interpretation Skewness in Excel •In Excel, use the functions: • • =SKEW.P() to calculate the population skewness • • =SKEW() to calculate the sample skewness Skewness in Excel Kurtosis: Pearson’s moment coefficient of kurtosis Kurtosis: Properties and interpretation Kurtosis: Properties and interpretation Kurtosis: Properties and interpretation Excess kurtosis Kurtosis in Excel •In Excel, use the function: • • =KURT() to calculate the sample excess kurtosis Kurtosis in Excel Example 5 15 10 25 20 0 19 5 6 3 Evaluation 40 35 30 4 20 47 32 Frequency 1 2 45 50 7 16 12 23