Slezská univerzita v Opavě Obchodně podnikatelská fakulta v Karviné STATISTICAL METHODS FOR ECONOMISTS Filip Tošenovský Karviná 2014 Projekt OP VK č. CZ.1.07/2.2.00/28.0017 „Inovace studijních programů na Slezské univerzitě, Obchodně podnikatelské fakultě v Karviné“ Field: Statistics Annotation: This textbook presents to the reader important parts of statistical data analysis. The subject matter contained in the book focuses on statistical methods which constitute a standard part of scholarly materials used both at domestic and foreign universities. The methods include description of statistical characteristics, hypothesis testing, regression and correlation analysis, analysis of variance, and also other procedures abundantly used in industries for product quality control, such as design of experiments, Taguchi’s methods based on loss functions and control charts. Key words: Statistical characteristics, regression, correlation, hypothesis testing, analysis of variance, design of experiments, Taguchi’s methods. Author: Ing. Filip Tošenovský, Ph.D. Reviewers: Prof. RNDr. Josef Tošenovský, CSc., Ing. Elena Mielcová, Ph.D. ISBN 978-80-7510-033-7 - 3 - CONTENTS INTRODUCTION.................................................................................................................... 5 1 ESSENTIAL STATISTICAL TERMS, CHARACTERISTICS................................ 6 1.1 THE CASE OF A SINGLE VARIABLE ............................................................................................ 7 1.1.1 MEASURES OF CENTRAL TENDENCY ............................................................................ 7 1.1.2 MEASURES OF VARIABILITY ......................................................................................... 9 1.1.3 MEASURES OF DATA CONCENTRATION....................................................................... 11 1.1.4 GENERAL MOMENTS.................................................................................................. 12 1.2 THE CASE OF TWO VARIABLES................................................................................................ 13 2 HYPOTHESIS TESTING IN MARKETING............................................................ 18 2.1 TESTING STATISTICAL HYPOTHESES....................................................................................... 18 2.2 MARKETING STUDY ................................................................................................................ 23 2.3 MEDIAN TEST.......................................................................................................................... 31 2.4 CHI-SQUARED TESTS............................................................................................................... 32 2.4.1 TESTING A DISCRETE PROBABILITY DISTRIBUTION ..................................................... 32 2.4.2 CHI-SQUARED TEST OF INDEPENDENCE.................................................................... 34 3 REGRESSION ANALYSIS......................................................................................... 38 3.1 THE CONCEPT OF REGRESSION ANALYSIS............................................................................... 38 3.2 ESTIMATION OF REGRESSION COEFFICIENTS .......................................................................... 40 3.3 TESTING SIGNIFICANCE OF REGRESSION COEFFICIENTS ......................................................... 45 3.4 CONFIDENCE INTERVALS FOR REGRESSION COEFFICIENTS .................................................... 46 3.5 TESTING MODEL SIGNIFICANCE.............................................................................................. 46 4 CORRELATION ANALYSIS..................................................................................... 55 4.1 CORRELATION COEFFICIENT................................................................................................... 55 4.2 CORRELATION INDEX ............................................................................................................. 58 4.3 SPEARMAN’S RANK CORRELATION COEFFICIENT................................................................... 58 4.4 MULTIVARIATE DEPENDENCE- THE CASE OF TWO VARIABLES............................................... 60 5 METHODS FOR SALES PREDICTIONS ................................................................ 67 5.1 TIME SERIES............................................................................................................................ 67 5.2 TIME SERIES MODEL DECOMPOSITION.................................................................................... 68 5.2.1 TREND...................................................................................................................... 69 5.2.2 SEASONAL COMPONENT – THE CASE OF CONSTANT SEASONALITY .............................. 73 5.2.3 PROPERTIES OF THE RANDOM COMPONENT OF A REGRESSION MODEL ...................... 76 5.2.4 DURBIN-WATSON’S TEST.......................................................................................... 76 5.3 MOVING AVERAGES................................................................................................................ 79 5.3.1 SIMPLE MOVING AVERAGES....................................................................................... 79 5.4 MAKING PREDICTIONS WITH TIME SERIES MODELS ................................................................ 81 6 ANALYSIS OF VARIANCE ....................................................................................... 86 6.1 ONE-WAY ANOVA ................................................................................................................... 86 6.1.1 ANOVA HYPOTHESES................................................................................................. 88 6.1.2 A MEASURE OF DEPENDENCE ................................................................................... 92 - 4 - 7 TWO-WAY ANOVA AND LATIN SQUARES......................................................... 95 7.1 TWO-WAY ANOVA .................................................................................................................. 95 7.1.1 EFFECT OF FACTOR A............................................................................................... 96 7.1.2 EFFECT OF FACTOR B............................................................................................... 97 7.2 THREE-WAY ANOVA (LATIN SQUARES) .................................................................................. 99 8 FULL FACTORIAL EXPERIMENTAL PLANS................................................... 107 8.1 FOUNDATIONS OF EXPERIMENTING AND ITS APPLICATIONS ................................................ 107 8.2 EXPERIMENTAL PROCEDURE ................................................................................................ 108 8.3 EFFECT OF A FACTOR AND ITS SIGNIFICANCE....................................................................... 111 8.3.1 STATISTICAL TEST OD FACTOR SIGNIFICANCE.......................................................... 113 8.3.2 GRAPHICAL ASSESSMENT OF FACTOR SIGNIFICANCE ............................................... 114 8.3.3 GRAPH OF INTERACTIONS ....................................................................................... 115 8.4 REGRESSION MODEL OF THE 23 EXPERIMENT ....................................................................... 116 9 TWO-LEVEL FRACTIONAL PLAN...................................................................... 123 9.1 HALF PLANS.......................................................................................................................... 124 9.2 GRAPHICAL EVALUATION OF FACTOR EFFECT...................................................................... 126 10 TAGUCHI’S METHODS – LOSS FUNCTIONS.................................................... 134 10.1 DEFINITION AND PROPERTIES OF LOSS FUNCTIONS .............................................................. 134 10.2 LOSS FUNCTIONS FOR DIFFERENT TYPES OF TOLERANCES ................................................... 136 11 TAGUCHI’S METHODS: TOTAL QUALITY COSTS ........................................ 146 11.1 QUALITY COST MONITORING................................................................................................ 146 11.2 TAGUCHI’S APPROACH – THE CASE OF 100% PROCESS CONTROL ........................................ 147 11.3 THE CASE OF PROCESS CONTROL AFTER N UNITS ................................................................. 148 11.4 CONTROL CHARTS ................................................................................................................ 149 CONCLUSION..................................................................................................................... 156 REFERENCES..................................................................................................................... 157 APPENDIX 1 – TABLE FOR DURBIN-WATSON’S TEST........................................... 158 - 5 - INTRODUCTION The presented textbook serves as a study material for the course Statistical Methods for Economists taught at the Karviná-based School of Business Administration of the Silesian University. The course Statistical Methods for Economists, a follow-up to the course Statistics, stresses the importance of application of statistical methods in economic disciplines, such as marketing, management, production planning and quality management. The textbook is divided into twelve chapters, which corresponds to the usual twelve weeks of teaching the school term consists of. The chapters are more or less the same in terms of the extent of their contents and difficulty. The extent of each chapter corresponds to a two-hour lecture presented to full-time students at schools of economics. As part of a full-time study, the course lecture is accompanied by a seminar in which the explained subject matter is practised, using specific numerical examples and often a computer software, as well. However, the Silesian University part-time students may also use the textbook. Part-time studying is a form of study which, in the case of Statistical Methods for Economists, requires students to work regularly and persistently, be able to concentrate on the subject and take an active approach to solving problems on their own. This is where the textbook should help substitute the fine full-time teaching, and serve as a study material. Other literary resources listed at the end of the textbook may also be of additional help in this respect. To pass the course Statistical Methods for Economists successfully, it is assumed that students have passed the course Statistics in the first place. It is true that not all learnt in Statistics is necessary to master Statistical Methods for Economists because some of the subject matter presented earlier had a different purpose of being demonstrated. None the less, the ability of accurate and logical thinking experienced in the previous course will come in useful, and so will the ability to recognize mathematical symbols used and the knowledge of essentials of the probability theory and statistics. Returning to the course Statistical Methods for Economists, let us describe its contents in a greater detail. A more accurate name for the subject could be Selected Statistical Methods for Economists, or even more accurately: Selected Statistical Methods of Marketing, Management and Quality Control. These are the three major areas of interest the university students often encounter in real life when applying statistics. Chapter 1 of the textbook revises elementary terms used in statistics, chapters 2-7 deal with the application of statistics in marketing and management, and chapters 8-12 are devoted to statistics in production planning and quality control. The subject matter and the related problems are studied using Excel, as long as the given problem allows Excel to find the solution. Students have already become familiarized with Excel in the course Statistics. As was mentioned at the beginning of this introduction, the text is divided into twelve chapters. Each chapter requires about four to six hours of study. However, the reward waiting at the end of the study is worth it: it is the feeling that something significant has been overcome – an obstacle that separates the world of professionals from the world of nonprofessionals. With such knowledge, one can better analyse information we are all flooded with at present. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 6 - 1 ESSENTIAL STATISTICAL TERMS, CHARACTERISTICS All statistical methods work with certain terms. This way, both authors of the statistical theory and its users can communicate among themselves results of their analyses in a comprehensible way. To simplify the communication, specific terms are introduced in statistics. The advantage of this procedure lies in the fact that the terms are introduced only once, but their validity is permanent for all interested parties. Also, using the simplest terms, one may construct more sophisticated terms or methods. We shall revise as well as extend some of the terms introduced in the course Statistics, and define essential statistical characteristics, using the terms. The characteristics will in a convenient way summarize information contained in the data under scrutiny. We note that from now on, whenever we refer to a closed interval, we shall denote the interval with parentheses „[ ]”. The main objective of statistics is to analyse a certain data. Of course, there is a reason why data originated and is maintained. Its purpose is to help analyse the form and behaviour of a statistical variable the data is related to. An example of such a variable is the height of women in the Czech Republic, political preference of a citizen, gross domestic product of a country, an average of a ball bearing produced, etc. We shall be mainly interested in numerical variables that better suit the needs of mathematics. If this is our case, the analysed variable can take on different values (otherwise it wouldn’t be a variable, of course). The set of all values a variable can take on is called population. Since population is related to a specific variable, it is a relative term. For instance, if we are interested in political preferences of the Czechs, the population consists of all Czech citizens, and will not be usually available unless census takes place and its results are available to the public. On the other hand, if we are interested in school results of a specific group of students who attend the course Statistical Methods for Economists, the group will represent the population which will be easily within reach. Statisticians, however, more often than not do not have populations at hand, and in such cases all they can do is perform a sampling from the population, which results in having a data sample of the population. There is more than one way how to obtain a data sample. There are also branches of statistics whose sole purpose is to analyse various forms of data sampling. In statistics, we usually require that the sampling be random. Random sampling means that every element of the population has the same probability of being selected, the same probability of being present in the final data sample. More precisely, random data sample of size n is a random vector (ܺଵ, ܺଶ, … , ܺ௡ሻ where the random variables ܺ௜’s follow the same probability distribution (population), and are statistically independent. Such a sampling is required because it posseses certain „representative“ properties the theory of statistics relies on. If the data sample is available, we can analyse it with proper statistical methods, and based on this analysis we may formulate conclusions about the data structure of the population the data sample came from. Such conclusions constitute what is called statistical inference. If the population can be obtained, the only ambition of statistics might be to describe the population. Methods that serve this purpose form descriptive statistics. Descriptive statistics provides us with characteristics that describe the population with a single number. These characteristics are called population characteristics. A characteristic summarizes information about the data. If the population consists of two thousand values, it is certainly better to use a single number – a characteristic to get a rough idea about the population, rather than name all its values. This aggregation, however, is not flawless: there must necessarily be a loss of the Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 7 original information about the population since a single number cannot obviously reflect the entire amount of the original information. In case that only the data sample is available, not the entire population, the so-called sample characteristics are used to describe the data sample structure. It is a common habit to denote the population characteristics with Greek letters and the sample characteristics with Latin letters. In this way, an order is introduced to the notation, and all users of the theory know immediately whether they work with population characteristics or their sample counterparts. We shall now introduce other terms and new characteristics for data that represent values of a single statistical variable. Later, data that contains information about two statistical variables will also be handled. 1.1 THE CASE OF A SINGLE VARIABLE Let us have a population consisting of values ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡, where n is an integer, i.e. a finite number (we shall work with data of finite size only). Let X be a variable of interest. The numbers ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ are the values which the variable can take on. If we apply a random sampling to this population, we can regard the variable X as a (discrete) random variable. Although the population contains the values ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡, not all these values must necessarily be different. Some of them may repeat. In such cases, X takes on only k different values ‫ݔ‬ଵ ‫כ‬ , ‫ݔ‬ଶ ‫כ‬ , … , ‫ݔ‬௞ ‫כ‬ . The value ‫ݔ‬ଵ ‫כ‬ may appear in the population ݂ଵ times. The number ݂ଵ is called absolute frequency of the value ‫ݔ‬ଵ ‫כ‬ . Similarly, the value ‫ݔ‬ଶ ‫כ‬ appears in the population ݂ଶ times, the value ‫ݔ‬ଷ ‫כ‬ ... ݂ଷ times and so on…until the last value ‫ݔ‬௞ ‫כ‬ appears in the population ݂௞ times. Apart from absolute frequencies, we also work with other types of frequencies: a) relative frequency of appearance of value ‫ݔ‬௟ ‫כ‬ is given by the division ݂௟/݊, where ݊ ൌ ∑ ݂௜ ௞ ௜ୀଵ = ݂ଵ ൅ ݂ଶ ൅ ‫ڮ‬ ൅ ݂௞ denotes the population size. If we sort the values ‫ݔ‬ଵ ‫כ‬ , ‫ݔ‬ଶ ‫כ‬ , … , ‫ݔ‬௞ ‫כ‬ in the ascending order, we get a population where ‫ݔ‬ሺଵሻ ‫כ‬ ൑ ‫ݔ‬ሺଶሻ ‫כ‬ ൑ ‫ڮ‬ ൑ ‫ݔ‬ሺ௞ሻ ‫כ‬ holds. In this notation, ‫ݔ‬ሺଵሻ ‫כ‬ is the minimum of the set ‫ݔ‬ଵ ‫כ‬ , ‫ݔ‬ଶ ‫כ‬ , … , ‫ݔ‬௞ ‫כ‬ , ‫ݔ‬ሺଶሻ ‫כ‬ is the second smallest number in the set, and so on. If the value ‫ݔ‬ሺ௜ሻ ‫כ‬ has an absolute frequency of its occurence ݂௜ ‫כ‬ , we may introduce the following new terms: b) absolute cumulative frequency of ‫ݔ‬ሺ௟ሻ ‫כ‬ , which is given by the sum ∑ ݂௜ ‫כ‬௟ ௜ୀଵ . c) Relative cumulative frequency of ‫ݔ‬ሺ௟ሻ ‫כ‬ , given by the sum ∑ ݂௜ ‫כ‬௟ ௜ୀଵ /݊. The foregoing types of frequencies can be used both for population and data sample. 1.1.1 MEASURES OF CENTRAL TENDENCY Let us have a population consisting of values ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡. Population arithmetic mean ߤ is one of the most important measures of central tendency. It is defined as 1-1 ߤ ൌ ଵ ௡ ∑ ‫ݔ‬௜ ௡ ௜ୀଵ . Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 8 Since the mean is related to a population, it is called population mean. If we performed a data sampling of size m from the population and obtained values ‫ݔ‬෤ଵ, ‫ݔ‬෤ଶ, … , ‫ݔ‬෤௠ by the sampling, we could estimate the usually unknown population mean by sample mean 1-2 ‫ݔ‬ҧ ൌ ଵ ௠ ∑ ‫ݔ‬෤௜ ௠ ௜ୀଵ . Excel: In Excel, both characteristics can be calculated using the function mean() which requires only one parameter: a reference to the area in the Excel spreadsheet that contains the data. If we know the population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ contains only k different values: a value ‫ݔ‬ଵ ‫כ‬ exactly ݂ଵ times, a value ‫ݔ‬ଶ ‫כ‬ … ݂ଶ times, etc. ... and finally a value ‫ݔ‬௞ ‫כ‬ … ݂௞ times, we may rewrite 1-1 to 1-3 ߤ ൌ ଵ ௡ ∑ ‫ݔ‬௜ ‫כ‬ .௞ ௜ୀଵ ݂௜ ൌ ଵ ∑ ௙ೕ ೖ ೕసభ ∑ ‫ݔ‬௜ ‫כ‬ .௞ ௜ୀଵ ݂௜. Similarly, equation 1-2 can be rewritten, using the absolute frequencies with which the values ‫ݔ‬෤ଵ, ‫ݔ‬෤ଶ, … , ‫ݔ‬෤௠ appear in the data sample. Also, we may regard 1-3 as a special case of what is called weighted average. Weighted average of values ‫ݔ‬ଵ ‫כ‬ , ‫ݔ‬ଶ ‫כ‬ , … , ‫ݔ‬௞ ‫כ‬ with weights ‫ݓ‬ଵ, ‫ݓ‬ଶ, … , ‫ݓ‬௞ is defined as 1-4 ߤ௪ ൌ ଵ ∑ ௪ೕ ೖ ೕసభ ∑ ‫ݔ‬௜ ‫כ‬ .௞ ௜ୀଵ ‫ݓ‬௜. If the sum of weights is equal to one, it is clear that formulas 1-3 and 1-4 represent the same thing for ‫ݓ‬௜ ൌ ݂௜/ ∑ ݂௝ ௞ ௝ୀଵ , i = 1, 2,..., k. Another measure of central tendency is mode ‫ݔ‬ො, which is the value with the highest absolute frequency. This definition doesn’t guarantee uniqueness of the mode, however. Thus, it may happen that the data contains more than one mode. Yet another measure of central tendency is median, denoted ‫ݔ‬෤ or ‫ݔ‬ହ଴. We also talk about a middle value. Median is generally not the same as average or mean. For a data sample consisting of values ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡, we may calculate the median in the following steps: 1) we sort the data in the ascending order to get a data ‫ݔ‬ሺଵሻ ൑ ‫ݔ‬ሺଶሻ ൑ ‫ڮ‬ ൑ ‫ݔ‬ሺ௡ሻ, 2) we calculate ‫ݖ‬ ൌ ݊ · 0,5 ൅ 0,5, 3) if z is an integer (this is true when n is odd), then ‫ݔ‬෤ ൌ ‫ݔ‬ሺ௭ሻ. If z is not an integer (this happens when n is even), then ‫ݔ‬෤ ൌ ሺ‫ݔ‬ሺ௭ି଴,ହሻ ൅ ‫ݔ‬ሺ௭ା଴,ହሻሻ/2. Excel: Excel uses the function median() to determine median. The function requires only one parameter – a reference to the region in the Excel spreadsheet containing the data. Uniqueness of median is guaranteed by the definition in this case. We stress that to use the function correctly, each value of the data sample must be written out explicitly, i.e. it must not be a data region containing only different values of the sample together with their respective frequencies (see problem 1 how to proceed in this situation). Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 9 PROBLEM 1 Let a data sample contains number 7 with absolute frequency 234, number 9 with absolute frequency 672 and number 43 with absolute frequency 347. Calculate the relative frequency of each value, arithmetic mean of the sample, mode and median. SOLUTION Since mode is the value with the highest frequency, number 9 is the mode in this case. The relative frequency of 7 equals 234/(234+672+347), the relative frequency of 9 is 672/(234+672+347) and the relative frequency of 43 equals 347/(234+672+347). The arithmetic mean/average is 7 · 234 ൅ 9 · 672 ൅ 43 · 347 234 ൅ 672 ൅ 347 ൌ 18,04. The sample size is 1253 – an odd number. Therefore, the median is the 627th value in the sorted data sample, which is number 9. 1.1.2 MEASURES OF VARIABILITY Measures of central tendency summarize in a certain sense information about where on the real line the values of the observed variable X typically are. However, the nature of these measures is such that they don’t say anything about how far the values are from one another. For these purposes, measures of variability were introduced. They describe „a typical“ mutual deviation of the individual values of X. In case of a population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡, population variance ߪଶ , as one of the measures of variability, is defined as 1-5 ߪଶ ൌ ଵ ௡ ∑ ሺ‫ݔ‬௜ െ ߤሻଶ௡ ௜ୀଵ . If we only have a data sample ‫ݔ‬෤ଵ, ‫ݔ‬෤ଶ, … , ‫ݔ‬෤௠ of size m drawn from the population, we estimate the usually unknown population variance by sample variance ‫ݏ‬ଶ 1-6 ‫ݏ‬ଶ ൌ ଵ ሺ௠ିଵሻ ∑ ሺ‫ݔ‬෤௜ െ ‫ݔ‬ҧሻଶ௠ ௜ୀଵ , where ‫ݔ‬ҧ is the arithmetic mean of the values ‫ݔ‬෤ଵ, ‫ݔ‬෤ଶ, … , ‫ݔ‬෤௠. We note that 1-6 is the typical formula for sample variance, but not the only one. Equation 1-5 tells us that the variance is a mean squared deviation of individual values of X from the population average ߤ. Excel: To calculate the population variance, we use the Excel function varp() which demands only one parameter – the data region in the spreadsheet, for which the variance is to be determined. To calculate the sample variance, the function var() is used with the same argument. Just like in the case of the arithmetic mean, we may rewrite equation 1-5 or 1-6 to its equivalent form that works with absolute frequencies: If we know the population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ contains only k different values – a value ‫ݔ‬ଵ ‫כ‬ …݂ଵ times, a value ‫ݔ‬ଶ ‫כ‬ ...݂ଶ times,..., a value ‫ݔ‬௞ ‫כ‬ …݂௞ times, we may use the following formula instead of 1-5 Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 10 - 1-7 ߪଶ ൌ ଵ ∑ ௙ೕ ೖ ೕసభ ∑ ሺ‫ݔ‬௜ ‫כ‬ െ ߤሻଶ · ݂௜ ௞ ௜ୀଵ . The same logic/analogy applies to equation 1-6 if we use the absolute frequencies with which the data sample contains different values ‫ݔ‬෤ଵ, ‫ݔ‬෤ଶ, … , ‫ݔ‬෤௠. Another measure of variability is standard deviation, defined as the square root of variance. If the underlying measure is a population variance, we talk about population standard deviation ߪ. If the underlying measure is a sample variance, we arrive at sample standard deviation s by taking the square root. Range R, defined as ܴ ൌ ‫ݔ‬௠௔௫ െ ‫ݔ‬௠௜௡, where ‫ݔ‬௠௔௫ is the highest value in the data and ‫ݔ‬௠௜௡ is the lowest value in the data, also belongs among measures of variability. Excel: The highest number in the data may be found using the Excel function max(), while the lowest number can be obtained with the function min(). Both functions require as the parameter a reference to the section of the spreadsheet containing the data. We finalize the section on measures of variability with coefficient of variation V. If a population is available, we define its population coefficient of variation by equation 1-8 ܸ ൌ ఙ |ఓ| . If only a sample is within reach, the corresponding sample coefficient of variation is calculated, using the sample standard deviation and sample mean: 1-9 ܸ ൌ ௦ |௫ҧ| . The coefficient is suitable for situations when two data groups are to be compared in terms of their variability, but each of the groups relates to a variable with different physical units. Under these circuimstances, it is meaningless to use variance as a measure of variability because it would be calculated in different and squared physical units, making itself useless for any comparison. The higher the coefficient of variation, the higher the variability of the given data group. PROBLEM 2 Let a population contains number 7 with absolute frequency 234, number 9 with frequency 672 and number 43 with frequency 347. Calculate the population variance. SOLUTION We shall use formula 1-7. In the previous problem, which presented the same data group as a sample, we calculated the sample average 18,04. The same average appears here, but this time, it is a population average. According to 1-7, we have: ߪଶ ൌ ሺ7 െ 18,04ሻଶ · 234 ൅ ሺ9 െ 18,04ሻଶ · 672 ൅ ሺ43 െ 18,04ሻଶ · 347 234 ൅ 672 ൅ 347 ൌ 239,12. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 11 - 1.1.3 MEASURES OF DATA CONCENTRATION The last category of characteristics we are about to describe consists of measures that reflect in a sense to what extent or how the data under scrutiny are grouped together. Two major representatives of this category are kurtosis Ku and skewness Sk. If a population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ is available, population kurtosis is defined as 1-9 ‫ݑܭ‬ ൌ ∑ ሺ௫೔ିఓሻర೙ ೔సభ ௡ఙర . Formula 1-9 can also be written equivalently as 1-10 ‫ݑܭ‬ ൌ ∑ ሺ‫ݔ‬௜ ‫כ‬ െ ߤሻସ · ݂௜ ௞ ௜ୀଵ ߪସ ∑ ݂௝ ௞ ௝ୀଵ provided the population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ contains only k different values: a value ‫ݔ‬ଵ ‫כ‬ … ݂ଵ times, a value ‫ݔ‬ଶ ‫כ‬ … ݂ଶ times, ... , a value ‫ݔ‬௞ ‫כ‬ … ݂௞ times. If ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ represent only a data sample, we calculate sample kurtosis, using again equation 1-9 or 1-10, but replacing the population mean ߤ in the corresponding equation with the sample mean ‫ݔ‬ҧ, and also replacing the fourth power of the population standard deviation ߪସ with the fourth power of the sample standard deviation ‫ݏ‬ସ . It is clear from 1-9 and 1-10 that kurtosis is nonnegative. Also, the intepretation of the characteristic is such that the higher the kurtosis, the higher the concentration of the data that lie closer to the mean, as compared to the data that are farther from the mean. Formulas 1-9 and 1-10 sometimes appear in an altered form, with number 3 being substracted from 1-9 and 1-10. This modification compares the kurtosis of the analysed data with the kurtosis of normal distribution. The kurtosis of normal distribution is known to be 3, regardless of the parameters of the distribution. This means that if the modified kurtosis of the data is positive, the frequency distribution of the analysed data has a higher kurtosis than normal distribution. We note that such a modification of the kurtosis is not the only one. Excel: Excel offers the function kurt() for the enumeration of kurtosis. The function has only one parameter, which is a reference to the area of the Excel spreadsheet containing the analysed data. However, the function calculates yet another modification of kurtosis, which is not identical to the most often used definitions 1-9 or 1-10. None the less, the Excel modification can still be used for comparison of two data groups in terms of their kurtosis, and its interpretation remains the same. Population skewness is defined as 1-11 ܵ݇ ൌ ∑ ሺ௫೔ିఓሻయ೙ ೔సభ ௡ఙయ , or equivalently as 1-12 ܵ݇ ൌ ∑ ሺ௫೔ ‫כ‬ ିఓሻయ·௙೔ ೖ ೔సభ ఙయ ∑ ௙ೕ ೖ ೕసభ , Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 12 if the population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ contains only k different values: a value ‫ݔ‬ଵ ‫כ‬ … ݂ଵ times, a value ‫ݔ‬ଶ ‫כ‬ … ݂ଶ times, ... , a value ‫ݔ‬௞ ‫כ‬ … ݂௞ times. If we were to calculate sample skewness, the note made in the case of kurtosis applies here, as well: we use again formulas 1-11 or 1-12, with the population mean replaced by the sample mean, and the third power of the population standard deviation replaced by the third power of its sample counterpart ‫ݏ‬ଷ . As is clear from the defining formulas, skewness can take on any real value. If skewness is zero, it means the frequency distribution of the data is symmetric. More intuitively, concentration of smaller values is the same as that of higher values. If skewness is positive, we say the frequency distribution of the data is skewed to the right, and concentration of smaller values is stronger than concentration of higher values. Finally, if skewness is negative, we say the data distribution is skewed to the left, and concentration of higher values is stronger than concentration of smaller values. In the case of a nonzero skewness, the frequency distribution of the data is said to be asymmetric. Frequency distribution is often portrayed by a two-dimensional graph. The horizontal axis of the graph describes different values ‫ݔ‬ଵ ‫כ‬ , ‫ݔ‬ଶ ‫כ‬ , … , ‫ݔ‬௞ ‫כ‬ appearing in the data group, whereas the vertical axis of the graph measures the frequencies with which ‫ݔ‬ଵ ‫כ‬ , ‫ݔ‬ଶ ‫כ‬ , … , ‫ݔ‬௞ ‫כ‬ are contained in the data group. PROBLEM 3 A population contains the following values: 111 with absolute frequency 500, 222 with absolute frequency 400, 333 with absolute frequency 600 and 444 with absolute frequency 300. Calculate the population skewness. SOLUTION The population size is 1800. The population average is 265,166, the population variance equals 13880,14. Thus, the third power of the population standard deviation is 1635275. According to 1-12, we get ܵ݇ ൌ ሺ111 െ 265,16ሻଷ · 500 ൅ ‫ڮ‬ ൅ ሺ444 െ 265,16ሻଷ · 300 1635275 · 1800 ൌ 0,01. We may conclude the frequency distribution is almost perfectly symmetric in this case. 1.1.4 GENERAL MOMENTS General moments are characteristics that look at data structures from a different angle. There are several reasons why we work with general moments. One reason is that frequency distributions and moments are related to each other uniquely under certain conditions: Data groups with the same moments have the same frequency distributions and vice versa. What we are interested in, however, relates to another reason why we work with the moments: it is the fact that some of the foregoing characteristics can be calculated in a more elegant way, using these moments. For a population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡, we define the k-th general moment Mk by equation 1-13 ‫ܯ‬௞ ൌ ݊ିଵ · ∑ ‫ݔ‬௜ ௞௡ ௜ୀଵ , ݇ ൌ 1,2, … Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 13 Thus, the k-th moment is nothing but the average of the k-th power of the original data. If the data group contains only m different values ‫ݔ‬௜ ‫כ‬ , i = 1, 2, …, m with frequencies ݂௜, we may also enumerate 1-13 with the formula ‫ܯ‬௞ ൌ ݊ିଵ · ∑ ሺ‫ݔ‬௜ ‫כ‬ ሻ௞ · ݂௜ ௠ ௜ୀଵ , ݇ ൌ 1,2, … . Now, the following relations hold true: 1-14 ‫ܯ‬ଵ ൌ ߤ, ‫ܯ‬ଶ െ ‫ܯ‬ଵ ଶ ൌ ߪଶ , ߪିଷ · ሺ‫ܯ‬ଷ െ 3‫ܯ‬ଵ‫ܯ‬ଶ ൅ 2‫ܯ‬ଵ ଷሻ ൌ ܵ݇, ߪିସ · ሺ‫ܯ‬ସ െ 4‫ܯ‬ଷ‫ܯ‬ଵ ൅ 6‫ܯ‬ଶ‫ܯ‬ଵ ଷ െ 3‫ܯ‬ଵ ସሻ ൌ ‫.ݑܭ‬ PROBLEM 4 A group of data D contains value 11 with absolute frequency 4235 and value 254 with absolute frequency 6543. Calculate the first two general moments. SOLUTION According to 1-13, we have ‫ܯ‬ଵ ൌ ሺ4235 ൅ 6543ሻିଵ · ሺ11 · 4235 ൅ 254 · 6543ሻ ൌ 158,518, ‫ܯ‬ଶ ൌ ሺ4235 ൅ 6543ሻିଵ · ሺ11ଶ · 4235 ൅ 254ଶ · 6543ሻ ൌ 39213,27. 1.2 THE CASE OF TWO VARIABLES If we have a data group such that for each integer i = 1, 2, …, m and j = 1, 2, …, n it contains a pair of values ሺ‫ݔ‬௜, ‫ݕ‬௝ሻ or even more pairs with these two values, we are working with a data group of two statistical variables. Absolute frequency of the pair ሺ‫ݔ‬௜, ‫ݕ‬௝ሻ is called joint frequency of ሺ‫ݔ‬௜, ‫ݕ‬௝ሻ, and is denoted ݂௜௝. The size of the data group is ‫ݎ‬ ൌ ∑ ݂௜௝௜,௝ . The distribution of joint frequencies in the data group is described by a two-dimensional table called contingency table (see table 1). The heading of the table contains different categories of each variable, and the table itself contains joint frequencies with which the combinations of the categories occur in the data group. A variable y, for instance, may represent family status, while a second variable, say x, can describe attained educational level. The joint frequency ݂ଵଵ, for example, will then describe the number of individuals who attained the level of education ‫ݔ‬ଵ, and at the same time have the family status ‫ݕ‬ଵ. A similar statement holds true for the other joint frequencies. The last column of the table is usually reserved for the sum of the joint frequencies that lie in the same row. The sum is denoted ݂௜. if we work with the i-th row of the table. The last row of the table is reserved for the sum of the joint frequencies that lie in the same column. This sum is denoted ݂.௝ if we talk about the j-th column of the table. These summations are called marginal frequencies. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 14 Table 1: Contingency table y x y1 y2 ... yn x1 f11 f12 ... f1n f1· x2 f21 f22 ... f2n f2· ... ... ... ... ... ... xm fm1 fm2 ... fmn fm· f·1 f·2 f·n r Source: author’s If we assume that the table represents an entire population, we can calculate basic characteristics for the two variables, using the symbols introduced for different types of frequencies, i.e. we can calculate the population means and population variances, using the following formulas: 1. Population means 1 ,X i ij i j x f r µ = ∑ ∑ 1 .Y j ij j i y f r µ = ∑ ∑ 2. Population variances 2 21 ( )X i X ij i j x f r σ µ= −∑ ∑ 2 21 ( )Y j Y ij j i y f r σ µ= −∑ ∑ . On the other hand, if the table represented only a data sample, we would calculate the sample means and variances according to the formulas: 1. Sample means 1 ,i ij i j x x f r = ∑ ∑ 1 .j ij j i y y f r = ∑ ∑ 2. Sample variances 2 21 ( ) , 1 X i ij i j s x x f r = − − ∑ ∑ Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 15 - 2 21 ( ) . 1 Y j ij j i s y y f r = − − ∑ ∑ When working with two variables, the frequencies of which are given by the aforementioned contingency table, we also define another important characteristic called covariance. Population covariance, denoted ),cov( YX , of variables X and Y is defined by equation 1-15 1 1 cov( , ) ( )( ) .i X j Y ij i j ij X Y i j i j X Y x y f x y f r r µ µ µ µ= − − = − ⋅∑∑ ∑∑ If we work with a data sample of size n, ݊ ൒ 2, we also define sample covariance: 1-16 1 ( )( ) . 1 XY i j ij i j c x x y y f r = − − − ∑∑ PROBLEM 5 A variable X can take on values 3, 5, 4, 6, 7 and 9. For these values, values of another variable Y were measured: 1, 2, 7, 9, 11 and 13, respectively, i.e. 3 corresponds to 1, 5 corresponds to 2, etc. The absolute frequency of each value is one. Calculate the population covariance. SOLUTION We use 1-15, setting all the frequencies equal to one. The average of X is 5,66, the average of Y is 7,16. We get 1 (3 5,66) (1 7,16) ... (9 5,66) (13 7,16) cov( , ) ( )( ) 7,55. 6 i X j Y ij i j X Y x y f r µ µ − ⋅ − + + − ⋅ − = − − = =∑∑ Covariance is used to describe a mutual dependence of variables X and Y, the dependence taking the form of a line, i.e. we deal with a simple linear dependence. If the covariance is positive, we can say that a dependence in the form of a line exists between the two variables to a certain extent. The dependence is such that if one of the variables rises in value, the other rises to a certain extent, as well. On the contrary, if the covariance is negative, it signals existence of a comotion of the two variables but in opposite directions: if one of the variables rises in value, the other drops to an extent. In both cases, the movement of the other variable is to an extent proportionate to the change of the first variable. Zero covariance suggests there is no linear dependence between the variables. As we can see, it is the sign of the covariance that matters. The value of covariance is further transformed so that the transformed characteristic falls to closed interval [-1,1]. The transformation takes place to get a characteristic called paired correlation coefficient, which gives us a better interpretation as to how strong the linear dependence exists between the two variables. If we work with a population, we obtain population paired correlation coefficient through this transformation. If we work with a sample, the result is called sample paired correlation coefficient. The population paired correlation coefficient is of the form 1-17 , ),cov( YX YX σσ ρ ⋅ = Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 16 where Xσ is the population standard deviation of X and Yσ is the population standard deviation of Y. The sample paired correlation coefficient is given by 1-18 , YX XY ss c r ⋅ = where Xs is the sample standard deviation of X and Ys is the sample standard deviation of Y. Both the population and sample correlation coefficient can take on only values from interval [-1,1]. If the population paired correlation equals one, it means that there is an exact relation in the form of a rising line. If, on the contrary, the correlation equals minus one, there is an exact relation between the two variables in the form of a declining line. If the population correlation equals zero, the two variables are said to be uncorrelated. There are also other types of correlations, apart from the paired correlation. We shall discuss them in chapter 5. CONTROL TEST 1 The following questions concern a data group which contains different values of a variable X together with their corresponding absolute frequencies. Values of X Absolute frequencies 23 2345 34 6213 33 456 35 8876 37 12134 31 5436 16 445 Source: author’s a. Calculate the arithmetic average, median and mode of X. b. If the table represents a population, what do the variance, standard deviation, coefficient of variation and range look like? What would these characteristics look like if the table represented a sample? c. Calculate the first two general moments of X. d. The following table contains a data sample on two variables. X 3 4 5 1 6 7 8 Y 5 3 4 6 7 8 9 Source: author’s Determine the sample covariance. e. Estimate the paired correlation between X and Y, using the table above. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 17 Complete the statements: f. Median and average are measures of……………………….. . g. Standard deviation and range are measures of ……………………… . h. Skewness and kurtosis are measures of……………………….. . i. Coefficient of paired correlation takes on values from the interval…………….. . SOLUTIONS a. Using 1-3, the average is 33,85. There are 35905 values available, which is an odd number. Therefore, the median equals the 17953th value in the data group sorted in the ascending order. The following table contains the sorted data: Values Frequencies 16 445 23 2345 31 5436 33 456 34 6213 35 8876 37 12134 Source: author’s The numbers 16 to 34 form a data subgroup of size 14895. The numbers 16 to 35 make up a data subgroup of size 23771. Thus, the median is obviously equal to 35. Mode is 37 in the problem. b. The population variance is 16,56 according to 1-7. Its square root equals 4,07, which is the population standard deviation. The range is 37-16 = 21. The population coefficient of variation equals 4,07/33,85 = 0,12. The sample variance is (35905/35904).16,56 = 16,56. Given the relatively huge size of the data group, it is the same number as the population variance if the result is rounded to two decimal places. The calculation results from the relation between the two variances. The sample standard variation is 4,07, as well. The remaining sample characteristics will therefore be more or less the same (if we use the rounded numbers, otherwise, precisely speaking, they are not exactly the same). c. The first general moment = average = 33,85. The second general moment = 1162,56. d. The sample covariance = 3,166. e. The sample correlation = 0,608. f. Central tendency. g. Variability. h. Data concentration. i. [-1, 1]. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 18 - 2 HYPOTHESIS TESTING IN MARKETING The second chapter covers hypothesis testing, which is one of the most important techniques in statistics. The first part of the chapter revises some fundamental principles of hypothesis testing. A part of it was already covered in the course Statistics. Another section of the chapter describes statistical tests which could be considered to be elementary because this is how they are treated in many other scholarly texts. We shall also describe some other tests which suit particularly the needs of marketing. The technique of hypothesis testing is explained in a greater detail in accompanying examples. 2.1 TESTING STATISTICAL HYPOTHESES Statistical hypotheses constitute only a part of all scientific hypotheses. They are related to random variables, and we divide the set of such hypotheses to a subset of parametric hypotheses and a subset of nonparametric hypotheses. Parametric hypotheses deal with parameters of the probability distribution of a random variable (or the population of an observed statistical variable). Nonparametric hypotheses are not related to parameters of such a distribution, they are related to some other properties of the distribution, such as its shape, for instance, because we may be interested in whether the behaviour of a random variable can be described properly by a binomial distribution or a normal distribution. Every statistical test works with two hypotheses that stand against each other: a tested hypothesis (tested statement), called null hypothesis and denoted H0, and an alternative hypothesis, denoted H1. H1 is usually the negation of H0. What we usually have available for hypothesis testing is the result of a data sampling. Such a sampling can take the form of a marketing study or poll. Without sampling, which originates in a random way, it is not possible to perform a statistical test. Based on the result of the sampling, we are now to decide whether to accept or reject the null hypothesis. To make such a conclusion, we calculate what is called test criterion T, a function of the data gathered by the sampling. We also define a subset within the set of real numbers, the subset being called critical region. Different tests have different critical regions. If the test criterion T falls to the critical region, the null hypothesis is rejected. In the opposite case, the null hypothesis is accepted. The critical region is usually defined by a critical value K (or it can also be defined by a percentile). The critical value is either found in statistical tables or calculated with a suitable software (Excel, for instance). Let us note that by accepting the null hypothesis, we are not proving the validity of the tested statement. Testing hypotheses does not necessarily lead to the right conclusion, which is natural since it is a process based on the limited amount of information stored in the data sample we work with. Uncertainty in the conclusion from a statistical test is related to nivel of test ߙ, a parameter defined by whoever performs the statistical test (we will talk about this parameter in a moment). Let us point out again that the testing is based on the randomness of data sampling. In other words, it is based on the fact that the data to be used for the test were gathered independently of one another (independently in the statistical sense of the word). Whether this is true or not can also be tested [10]. For convenience, let us summarize the steps that lead to acceptance of the null hypothesis or its rejection. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 19 The general procedure of hypothesis testing 1. Formulate the null hypothesis H0 and the alternative hypothesis H1, 2. Calculate the test criterion T, 3. Find the critical value K for a given nivel of test ߙ (i.e. define the critical region C), 4. Compare K and T, i.e. determine whether T C∈ , and based on this, accept or reject the hypothesis H0. The conclusion and credibility of hypothesis testing If T C∈ , H0 is rejected. If T C∉ , H0 is accepted. Since the decision whether to reject or accept the null hypothesis depends on the limited amount of information contained in the corresponding data sample, we can make a mistake that is of one of the following two kinds: a. We reject the null hypothesis which actually holds true. By doing so, we make a mistake of the first kind. The probability that this mistake happens is denoted α, and is called nivel of test. b. We accept the null hypothesis which in fact is not true. By doing so, we make a mistake of the second kind the probability of which is denoted β. The probability 1- β is called power of the test. It is the probability that the null hypothesis will be correctly rejected. Nivel of test α is usually set at 0,05, 0,01 or less frequently at 0,1. If this is the case, we talk about a 5% nivel of test, a 1% nivel of test and a 10% nivel of test, respectively. Apart from the nivel of test, the so-called p-values are also used in hypothesis testing. These values are often a part of statistical software outputs. A p-value tells us the probability of getting or exceeding the test criterion. If the p-value is smaller than or equal to the defined nivel of test, the null hypothesis is rejected. In the opposite case, the null hypothesis is accepted. Basic statistical tests We shall now present the standard and frequently used statistical tests. These are (A) One-sample t – test. (B) Two-sample t – test with equal variances. (C) Two-sample t – test with unequal variances. (D) Paired t-test. (E) Two-sample F – test of variance equality. Each of the tests is described now by the four-step general procedure of hypothesis testing. The tests can also be performed in Excel if the Excel add-in module Data Analysis is selected. The module contains many statistical methods, including the standard statistical tests. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 20 (A) Testing the population mean (one-sample t – test) Let X = (X1,..., Xn) be a random sample from a normal distribution N(µ, σ2 ), where the variance σ2 is unknown. 1. We test the null hypothesis H0: µ = µ0 against the alternative H1: µ ≠ µ0, where µ0 is a given value. 2. The test criterion takes the form 0 , X T n S µ− = ⋅ where X = sample mean calculated from the values X1,..., Xn, S = sample standard deviation calculated from the values X1,..., Xn, µ0 = assumed population mean which is tested by the statistician n = sample size. 3. The critical value K is related to a Student’s distribution with n-1 degrees of freedom, and for a nivel of test α, it is denoted tn-1(α). The critical value is defined as the number that satisfies the relationship 1( ( ))nP X t α α−≥ = , where X is a random variable following the Student’s distribution. The critical value can be either found in statistical tables, or calculated in Excel using the function TINV(ߙ; n-1). The critical region of the test is C = (∞, -K]∪[K, +∞). 4. If T ≥ tn-1(α), H0 is rejected and H1 is accepted; in all the other cases, H0 is accepted. (B) Testing a difference between two population means (two-sample t-test with equal variances) Let there be two independent random samples of sizes n1 and n2, respectively, from normal distributions N(µ1, 2 1σ ) and N(µ2, 2 2σ ), respectively. The variances 2 1σ and 2 2σ are unknown, but 2 1σ = 2 2σ is assumed. 1. We test the null hypothesis H0: µ1 = µ2 vs. the alternative H1: µ1 ≠ µ2. 2. The test criterion T takes the form: ( ) ( ) ( )1 2 1 21 2 2 2 1 21 1 2 2 2 , 1 1 n n n nX X T n nn S n S ⋅ ⋅ + −− = ⋅ +− + − Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 21 where 1X is the sample mean calculated from the data obtained from N(µ1, 2 1σ ), 2X is the sample mean calculated from the data obtained from N(µ2, 2 2σ ). We also calculate the sample variance 2 1S of the first sample, and the sample variance 2 2S of the second sample. The numbers n1 and n2 represent the size of the first and second sample, respectively. 3. The critical value K is related to a Student’s distribution with n1+n2-2 degrees of freedom, and is denoted 1 2 2 ( )n nt α+ − for a nivel of test α. The value can be either found in statistical tables, or calculated in Excel with the function TINV(α; n1+n2-2). 4. If 1 2 2 ( )n nT t α+ −≥ , the null H0 is rejected and H1 is accepted; in the opposite case, the null hypothesis is accepted. (C) Testing a difference between two population means (two-sample t-test with unequal variances) Let there be two independent samples of sizes n1 and n2, respectively, from probability distributions N(µ1, 2 1σ ) and N(µ2, 2 2σ ), respectively. The sample means 21 , XX are calculated from the two samples, as well as their sample variances 2 2 1 2,S S , respectively. The population variances 2 1σ and 2 2σ are unknown, however, inequality 2 1σ ≠ 2 2σ is assumed this time. 1. We test the hypothesis H0: µ1 = µ2 vs. the alternative H1: µ1 ≠ µ2. 2. The test criterion takes the form: 1 2 1 2 X X T V V − = + , where 2 i i i S V n = , i = 1, 2. 3. The critical value K is calculated using the formula 1 21 1 2 1 1 2 ( ) ( )n nV t V t K V V α α− −⋅ + ⋅ = + , where ( )1 1nt α− and ( )2 1nt α− are critical values of a Student’s distribution with n1-1 and n2-1 degrees of freedom, respectively, both for a nivel of test α. The value K can be obtained, using the Excel function TINV(α; n1-1) for ( )1 1nt α− , and TINV(α; n2-1) for ( )2 1nt α− . 4. If T K≥ , H0 is rejected and H1 is accepted; in the opposite case, H0 is accepted. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 22 (D) Paired t-test Let 1 2, ,..., nX X X X= be a random sample from N(µ1, 2 1σ ), and 1 2, ,..., nY Y Y Y= be a random sample from N(µ2, 2 2σ ). The corresponding sample means are X and Y , respectively. 1. We test the hypothesis H0: µ1 = µ2 vs. the alternative H1: µ1 ≠ µ2. 2. The test criterion takes the form D D T n S = , where 2 1 1 ( ) 1 n D i i S D D n = = − − ∑ , i i iD X Y= − , i = 1, 2..., n, D X Y= − . 3. The critical value K = 1( )nt α− , and is related to a Student’s distribution with n-1 degrees of freedom, and a nive of test α. It can be obtained with the Excel function TINV(α; n-1). 4. If T K≥ , H0 is rejected and H1 accepted; in the opposite case, H0 is accepted. (E) Two-sample F – test of equality of variances Let us have two independent random samples from N(µ1, 2 1σ ) and N(µ2, 2 2σ ), respectively, of sizes n1 and n2, respectively. Let 2 1S and 2 2S be their respective sample variances. 1. We test the hypothesis that the population variances are the same, i.e. the hypothesis H0: 2 1σ = 2 2σ , against the alternative H1: 2 1σ ≠ 2 2σ . 2. The test criterion takes the form: 2 2 1 2 2 2 1 2 max( , ) min( , ) S S T S S = . 3. The critical value K = 1 21, 1( )n nF α− − can be found in statistical tables for a Fisher’s distribution with n1-1 and n2-1 degrees of freedom, and a nivel of test α. Alternatively, the critical value can also be calculated with the Excel function FINV(α; n1-1; n2-1). 4. If T K≥ , H0 is rejected and H1 is accepted; in the opposite case, H0 is accepted. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 23 As was mentioned previously, statistical hypotheses represent only a part of all scientic hypotheses – that part which concerns random variables. These hypotheses/tests include parametric and nonparametric tests. Parametric tests deal with parameter(s) of a given probability distribution. Nonparametric tests are concerned not with parameters of a distribution but other statistical properties of the distribution. It must be noted, however, that nonparametric tests are used as a term in a more general sense: the term is used for those tests that do not have to comply with as many mathematical conditions for their use as other tests do. As we saw earlier, t-tests, for instance, required several such conditions including the prerequisite that the random sample come from a normal distribution. There are situations when such prerequisite cannot be met, and the question how to proceed then naturally arises. There are more robust statistical tests that demand only very general conditions to be met for their justified use. This is when we talk about nonparametric tests, although we may use the tests to check the particular form of parameters of a probability distribution, as well. In order not to get entangled in this terminology, we shall continue to perceive the term nonparametric test as we did to this point, i.e. we shall view such a procedure as a method that tests other distribution properties than those related directly to the parameters of the distribution. 2.2 MARKETING STUDY To demonstrate the tests described above and introduce some new tests, let us draw on a marketing study named Studie. The study will be used now and then for the description of other statistical methods. The procedures that follow are accompanied by Excel calculations. Studie A company wants to bring out a new nonalcoholic beverage: a cola-based carbonated drink. There are three versions of the product that are to hit the market: Kafola, Kofikola and Kofolisima. A questionnaire-based poll was run, and answers from 47 respondents were gathered about the consumption of the new products. The results of the poll are contained in table 2 (the data represents a weekly consumption of the corresponding beverage in litres). Table 2: Studie poll results Respondent Sex Age Kafola Kofikola Kofolisima 1 m 20 1,1 0,7 0,5 2 f 34 1 0,2 0,1 3 f 43 0,8 0,1 0,2 4 f 21 1,2 0,6 0,3 5 m 39 1,1 0,1 0,4 6 f 51 0,4 0 0,2 7 m 19 0,9 0,9 0,3 8 f 45 0,3 0,2 0,2 9 f 48 1,2 0,1 0,4 10 f 21 1,4 0,4 0,2 11 f 52 0,4 0 0,3 12 f 22 1,2 0,6 0,4 13 m 62 0,2 0 0,2 14 f 47 0,6 0,2 0,1 15 m 23 0,9 0,8 0,2 16 m 35 0,9 0,1 0,4 Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 24 Respondent Sex Age Kafola Kofikola Kofolisima 17 m 22 1 0,9 0,1 18 m 38 0,5 0,2 0,2 19 f 41 0,4 0,1 0,1 20 f 21 0,9 0,7 0,2 21 f 40 0,2 0 0,3 22 f 20 0,8 0,6 0,3 23 m 19 1,1 0,9 0 24 m 39 1 0,1 0 25 m 19 0,9 1,1 0,4 26 f 38 0,2 0,2 0,5 27 f 20 1,3 1,5 0,3 28 f 37 0,4 0,1 0,8 29 m 20 1,3 0,8 0,2 30 f 41 0,1 0,2 0,1 31 m 42 0,2 0,1 0,2 32 f 20 0,9 0,9 0,3 33 m 43 1,2 0,2 0,1 34 f 21 0,9 0,7 0,2 35 m 44 0,1 0,1 0,1 36 f 45 0 0,1 0,2 37 m 46 0,1 0,2 0,1 38 m 22 1 0,9 0,2 39 m 42 0,4 0,8 0,3 40 m 41 0,1 0,1 0,4 41 f 22 1,1 0,5 0,2 42 f 40 0,2 0,1 0,1 43 f 21 1,3 0,8 0 44 m 39 0,4 0,9 0,2 45 f 20 1,1 0,1 0,1 46 m 20 1 0,2 0,3 47 f 21 0,8 0,1 0,4 Source: author’s We note that the number of respondents is not particularly high in this case. Marketing studies usually address hundreds of respondents. Of course, we are primarily interested in the principles of working with the marketing data. Those principles are the same regardless of how many respondents take part in the poll. We shall now perform the statistical tests described at the beginning of chapter two. The tests will try to answer various questions that could have been asked by the client who ordered the poll. Some of the tests will also be presented as a one-sided test. This is a kind of test the null hypothesis of which appears in the form of an inequality, not an equation. The alternative hypothesis remains the direct negation of the null hypothesis. We shall also use table 2 to describe and demonstrate other widely exploited statistical tests. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 25 PROBLEM 1 (one-sample t-test) Let us find out at five per cent nivel of test whether we can assume that the average weekly consumption of Kafola equals 0,7 litres. Thus, 0 : 0,7H µ = is tested against its alternative 1 : 0,7H µ ≠ ; 05,0=α . The test criterion satisfies 0,833 0,7 47 2,239. 0,166 T − = = The sample average used in the criterion is calculated from the data in the column „Kafola“, and it is equal to 0,833. The sample standard deviation is s = 0,166 (this may be calculated using its defining formula from chapter one, or by taking the square root of the Excel function var(), where the single argument is the reference to the column Kafola). Since it is possible to chain Excel functions, the function sqrt(var()) will give the standard deviation, as well. The sample size is 47, therefore we work with a t-distribution (or Student’s distribution) with 47-1 degrees of freedom. The critical value of the distribution is K = TINV(0,05, 46) = 2,012. Since the test criterion in absolute value exceeds the critical value, we reject the hypothesis that the average weekly consumption of Kafola equals 0,7 litres in the population. The one-sample t-test can also be performed in the form of a one-sided test. In such a case, we formulate the null hypothesis as H0: µ < µ0 and its alternative as H1: µ ≥ µ0 . We use the test (A) described at the beginning of this chapter, however, in a slightly modified form. 1. We test H0: µ < µ0 against H1: µ ≥ µ0 2. The test criterion takes the form 0X T n S µ− = ⋅ , where X = sample average, S = sample standard deviation, µ0 = assumption about the unknown population average; in our case, it is 0,7, n = sample size; in our case, this is 47. 3. The critical region of the one-sided test is interval C = [K, +∞) given by the critical value K of the Student’s distribution with n-1 degrees of freedom. The critical value in this case is such a value that the probability of exceeding it is equal to the preset nivel of test alpha. Given the definition of the Student’s distribution critical values, it means that K = tn-1(2α). This number can be calculated using the Excel function TINV(2α ; n-1). 4. If T K≥ , H0 is rejected and H1 accepted. In the opposite case, H0 is accepted. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 26 As can be seen, the one-sided version of the t-test is very similar to its two-sided version. The difference is in the calculation of the critical value and the conclusion of the test. In our case, if we formulate the null hypothesis H0: µ < 0,7 against the alternative H1: µ ≥ 0,7, the test criterion results in the same number, of course, but the critical value at five per cent nivel of test is equal to TINV(2·0,05; 46) = 1,68, and we again reject the tested hypothesis. We could also try to answer the question for what nivel of test the null hypothesis would be accepted in the one-sided version of the test. In our case, we accept the null hypothesis if and only if T < K. To answer the question, it is convenient to use the Excel function TDIST(K; n; tails). The function returns alfa which satisfies the equality ( )P X K α≥ = , where X follows a t-distribution with n degrees of freedom if the argument „tails“ is set at 2; alternatively, the function returns alfa satisfying the equation ( )P X K α≥ = , where X follows a t-distribution with n degrees of freedom if the argument „tails“ is set at 1. In our case, if we substitute K for T, and use the function TDIST, we find that the probability of exceeding T is equal to TDIST(2,239; 46; 1) = 0,015. Thus, if T < K is to hold, the critical value K must be such that the probability of exceeding it is smaller than 0,015. This probability, however, is called nivel of test. So the conclusion in the one-sided version of the test is such that any nivel of test smaller than 0,015 will lead to acceptance of the null hypothesis. PROBLEM 2 (two-sample t-test with equal variances) Let us demonstrate how to perform the two-sample t-test with equal variances. The equality of variances is assumed to hold true at the moment, although this should be tested, as well, using another statistical test (we will talk about the test later). Our objective now is to find out, using Studie, whether the average consumption of Kofikola is the same as that of Kofolisima. The nivel of test is five per cent. The test criterion satisfies ( ) ( ) ( )1 2 1 21 2 2 2 1 21 1 2 2 2 , 1 1 n n n nX X T n nn S n S ⋅ ⋅ + −− = ⋅ +− + − from which it is obvious that we need to calculate the sample averages in both data samples (the column „Kofikola“ represents one sample in this case, the column „Kofolisima“ is the other sample) and also the sample variances and samples sizes. These characteristics are contained in table 3. Again, the calculation of the characteristics can be carried out using the corresponding defining formulas or the Excel functions. Table 3: entry characteristics for the two-sample t-test Kofikola Kofolisima average 0,40851064 0,24042553 variance 0,14123034 0,02289547 sample size 47 47 Source: author’s Thus, using the data from table 3, the test criterion T = 2,844. The critical value K = TINV(0,05; 47+47-2) = 1,986. Therefore, the hypothesis of equal population means is rejected, the nivel of test being five per cent. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 27 Excel: The test can be performed, using the Excel analytical tools located at Data/Data Analysis or Tools/Data Analysis. If Excel is installed for the first time, it may happen that the Data Analysis module is hidden in the program. In such cases, the module must be installed through Tools/Add-ins (older Excel versions) or through File/Options/Add-ins (newer Excel versions). The module contains nineteen statistical methods of which five are related to statistical tests. The main advantage of the module is that the corresponding formulas don’t have to be constructed and calculated. Everything is done automatically by the module itself. Each test is integrated into a single dialogue window, and its results are presented in a unified table. If we run the module Data Analysis, a dialogue window with various statistical methods pops up. We select the two-sample t-test with equal variances, and confirm the option. In the subsequent window that Excel offers, we place the computer mouse to the sub-window Sample 1 and designate the area in the spreadsheet containing the data of interest – in our case, the data for the Kofikola consumption, for instance. Similarly, the Sample 2 sub-window will contain a reference to the area in the Excel spreadsheet containing the data on Kofolisima consumption. We keep the default alpha at 0,05 as well as the default output location, offered by the dialogue window. After confirming these options, Excel generates the following table 4 with all necessary results. Table 4: Excel output for the two-sample t-test with equal variances sample 1 sample 2 Mean 0,408510638 0,240425532 Variance 0,141230342 0,022895467 Sample size 47 47 : : : : : : t Stat 2,84439379 P(T<=t) (1) 0,002741733 t krit (1) 1,661585397 P(T<=t) (2) 0,005483465 t krit (2) 1,986086317 The table contains characteristics necessary for carrying out the test, and also the test criterion t Stat and the critical vaue of the two-sided test t krit (2). Both values confirm that our previous calculations were correct. The conclusion therefore is the same. PROBLEM 3 (F-test) If the two-sample t-test with equal variances is to be credible, we must confirm whether the assumption of equal population variances is correct. We shall now test the assumption at one per cent nivel of test. The test criterion related to this test is of the form 2 2 1 2 2 2 1 2 max( , ) min( , ) S S T S S = . Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 28 It’s a division of two sample variances. In our case, the sample variance for the weekly consumption of Kofikola is 2 1 0,141S = , and the sample variance for Kofolisima 2 2 0,0228S = . Therefore, T = 6,168. The critical value of the test K = FINV(0,01;47-1;47-1) = 2. This means the null hypothesis is rejected with a very small nivel of test. Excel: The same test can be realized using the Data Analysis module in Excel if we select the F-test of equal variances option in the module. In order for this procedure to give the same result, it is necessary that the data sample with the higher sample variance be used as Sample 1 in the dialogue window that follows the confirmation of the option of the F-test in the module. In our example, the higher variance relates to the data sample on Kofikola: 2 1 0,141S = . Therefore, Sample 2 option in the dialogue window will contain the reference to the data on Kofolisima. The nivel of test alpha is 0,05 by default. We shall reset the level at 0,01 for our purposes. Confirming the options, Excel returns results in the form of table 5. Table 5: Excel output on the F-test of equal variances F-test of equal variances Sample 1 Sample 2 Mean 0,408510638 0,240425532 Variance 0,141230342 0,022895467 Observations 47 47 Difference 46 46 F 6,168484848 P(F<=f) (1) 3,7416E-09 F krit (1) 2,006833595 Here, F stands for the test criterion and F krit (1) stands for the critical value of the test: the real number such that the probability of exceeding it by the test criterion is 0,01. In this case, the critical value equals 2,007. As we can see, the procedure used in the previous problem, where we worked with the twosample t-test with equal variances, was not appropriate, as it was based on the assumption of equal variances. This assumption has just been rejected. The appropriate procedure is to use the two-sample t-test with unequal variances, as demonstrated in the following problem. PROBLEM 4 (two-sample t-test with unequal variances) We said that in this case the test criterion takes the form 1 2 1 2 X X T V V − = + , where 2 /i i iV S n= , i = 1, 2, 2 iS = sample variance of the i-th sample and in = size of the i-th sample. Applying these formulas to our case of nonalcoholic beverages, we get Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 29 - 1 2 1 2 0,408 0,24 2,85. 0,14 / 47 0,0228 / 47 X X T V V − − = = = + + The critical value of the test is 1 21 1 2 1 1 2 ( ) ( ) 0,003 2,013 0,00048 2,013 2,013. 0,003 0,00048 n nV t V t K V V α α− −⋅ + ⋅ ⋅ + ⋅ = = = + + The conclusion of the test is such that we reject the null hypothesis of equal means. Excel: using the Data Analysis module in Excel, we can select the two-sample t-test with unequal variances option. Upon confirming the selection, we fill in the required information in the corresponding dialogue window: we insert references to the areas of the Excel spreadsheet containing the data on Sample 1 and Sample 2, just like in the case of the twosample t-test with equal variances. Also, we set the nivel of test at alpha (we leave it at five per cent here). The output of Excel calculations is contained in table 6. Table 6: Excel output of the two-sample t-test with unequal variances Two-sample t-test with unequal variances Sample 1 Sample 2 Means 0,408510638 0,240425532 Variances 0,141230342 0,022895467 Observations 47 47 : : : : t Stat 2,84439379 P(T<=t) (1) 0,00302417 t krit (1) 1,670219484 P(T<=t) (2) 0,006048339 t krit (2) 1,999623585 Let us comment on the results of table 6: t stat represents the test criterion, t krit (2) is the critical value. It is necessary to note that as in the case of several fundamental statistical characteristics (skewness and kurtosis, in particular), Excel carries out some calculations differently, compared to how such calculations are performed in rigorous statistical texts. The critical value of the test is calculated in more than one way, to be more precise. Different procedures aim to approximate differently the degrees of freedom of the Student’s distribution related to this test. Therefore, it is quite likely that the critical value for this test provided by Excel will differ from the one defined at the beginning of this chapter. Nonetheless, despite the difference in the critical value, the conclusion to our problem remains the same. Also, as demonstrated above, the difference in the critical values is certainly not severe. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 30 PROBLEM 5 (paired test) We emphasize again that while applying the two-sample t-tests with equal or unequal variances, we assume, among other things, that the two samples are independent of each other. This is a very important prerequisite. If this is not the case, it is better to apply the paired test. If the condition of independence is met, and the analyst applies the paired test instead of the two-sample t-test, nothing serious happens. However, such a procedure is not optimal because the paired test requires two samples of the same size, as opposed to what is required by the two-sample t-test. Thus, if the analyst wants to use the paired test, it might be the case that he or she will have to throw away some of the data provided the samples to be worked with are of different sizes. On the contrary, if the situation requires the paired test to be applied because of the conditions necessary for this test, and the analyst uses the twosample t-test instead, it will be a serious mistake, and the conclusions based on such a technique will be completely incredulous. We shall use the paired test now to find out whether the average weekly consumption of Kofikola is the same as the average weekly consumption of Kofolisima. First, let us substract the values on consumption, which are in the same row of table 2, and are in columns Kofikola or Kofolisima. Doing so, we get the substractions i i iD X Y= − , where iX is the consumption of Kofikola from the i-th row of table 2 and from column Kofikola, and iY is the consumption of Kofolisima from the i-th row of the same table and from column Kofolisima. The average substraction equals 0,408 0,24 0,168D X Y= − = − = . Secondly, let us calculate the sample standard deviation of the substractions 2 1 1 ( ) 0,4076. 1 n D i i S D D n = = − = − ∑ Here, n = 47 = sample size = number of rows in table 2. The test criterion of the paired test is (0,168/ 0,4076) 47 2,826. D D T n S = = ⋅ = The critical value K = 1( ) (0,05;47 1) 2,013.nt TINVα− = − = Since T K≥ , H0 is rejected. Excel: Selecting the Data/Data Analysis/Two-sample paired t-test option in Excel and using subsequently the Sample 1 and Sample 2 options for references to the Excel Spreadsheet area with data on Kofikola and Kofolisima, we get the following output (Table 7). Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 31 Table 7: Excel output for the paired test Two-sample paired test of means Sample 1 Sample 2 Averages 0,408510638 0,240425532 Variances 0,141230342 0,022895467 Observations 47 47 Pears. correlation -0,017650836 Hyp. Difference of means 0 t Stat 2,827157048 P(T<=t) (1) 0,003464955 t krit (1) 1,678660414 P(T<=t) (2) 0,006929909 t krit (2) 2,012895599 T stat corresponds to the test criterion of the paired test, t krit (2) is the critical value of the test. As is confirmed by the table, Excel calculations correspond to the calculations we made by hand. In the next section of the chapter, we shall present some other statistical tests. We will describe their purpose, and use the study material Studie to show how to proceed in their case. 2.3 MEDIAN TEST It is one of the tests that doesn’t require too many conditions for its use. As the name of the test suggests, median test tries to confirm or reject the hypothesis about the median of a probability distribution. If such a distribution possesses the property that its median is equal to its mean, the test can be regarded as an alternative to the one-sample t-test. The only condition that must be met for the median test to be justified is the requirement that the respective data sampling be made in a population with continuous probability distribution. Thus, normality is not required here, as in the case of one-sample t-test. Let us denote the unknown median as µ~, and the size of the data sample used to perform the test as n. We assume that n is large enough since the precision of the test we are about to describe improves with n increasing. 1. We test H0: 0 ~~ µµ = versus H1: 0 ~~ µµ ≠ . Here, 0µ% is a specific value defined by the statistican. 2. The test criterion is 2m n T n − = , where m is the number of observations in the data sample, which are smaller than 0 ~µ . 3. The critical value K = z1-α/2, where z1-α/2 is the critical value of the standard normal distribution N(0,1) for a nivel of test α, i.e. it is the real number z1-α/2 such that the Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 32 probability of exceeding it equals 1 / 2α− . The critical value can be found either in statistical tables or using the Excel function NORMSINV(1-α/2). 4. If T K≥ , H0 is rejected. In the opposite case, H0 is accepted. PROBLEM 6 (Median test) Let us test the hypothesis that the average age of the cola-based drink consumer is 33 years (this is supposed to be the median age). The nivel of test is five per cent. The data available for the test are contained in Studie. Looking at the data, we see that in 20 out of all the 47 cases, the age of the consumers is below 33. Therefore, 2 2 20 47 1,02. 47 m n T n − ⋅ − = = = The critical value K = z1-α/2 = NORMSINV(1-0,05/2) = 1,96. This leads to the conclusion that the null hypothesis is accepted. 2.4 CHI-SQUARED TESTS The last category of statistical tests we are going to deal with is related to chi-squared tests. We shall demonstrate two of these tests, as they are frequently used in social surveys. The first test focuses on the type of the probability distribution the data sample used for the test came from, the second test verifies the hypothesis of statistical independence of two random variables. Since a chi-squared distribution is exploited in the two tests, it is clear where the name of the tests originated. 2.4.1 TESTING A DISCRETE PROBABILITY DISTRIBUTION As is well known from mathematical statistics, the most frequently used probability distributions are of discrete or continuous type. The chi-squared test can be used for any of these two situations. For simplicity, we shall work with discrete distributions only. What follows is a theoretical description of the test and an example that demonstrates its purpose. Let X be an observed variable (not necessarily a numerical variable). A type of beverages consumed may serve as an example of such a variable. Let us assume that there are k different categories of the variable: 1 2, ,..., kX X X (k different types of beverages, for instance). Let ip represent the relative frequency of occurrence of the i-th category iX in the population. If we sample data randomly from this population, and write down the absolute frequencies with which different categories 1 2, ,..., kX X X occured in the sample, we can view the absolute frequencies as realizations of the random variables 1 2, ,..., kX X X . The expression ( ) 2 1 k i i i i X np T np= − = ∑ Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 33 is then a random variable itself, following approximately a chi-squared distribution with k-1 degrees of freedom. The higher the n, the more precise the approximation is. The absolute frequency of the i-th category iX in the sample is called empirical frequency, the term inp is called theoretical or expected frequency. This setting can be used to test a hypothesis about the ip ’s, i = 1, 2, …, k, concerning the variables 1 2, ,..., kX X X . This means from the statistical point of view that parameters of a probability distribution (multinomial in this case) are tested. To do the test, a null hypothesis about the parameters ip is formulated, a sampling is carried out from the corresponding population, and the aforementioned test criterion T is calculated. If 2 1( )kT χ α−≥ , where 2 1( )kχ α− is the critical value for a chi-squared distribution with k-1 degrees of freedom and a five per cent nivel of test, the null hypothesis is rejected. In the opposite case, when 2 1( )kT χ α−< , the null hypothesis is accepted. The critical value can be found either in statistical tables or by using the Excel function CHIINV(alpha, k-1). PROBLEM 7 (Chi-square test) To demonstrate the test, let us use the data on cola-based drinks. The population is represented by the Faculty of Business Administration of the Silesian University. We assume that only cola-based drinks are sold at the faculty. We are interested in whether it is true that all the three cola-based drinks are consumed in the same amounts. Statistically speaking, this means that we test whether the variable X, a cola-based drink, follows a uniform distribution. We set the nivel of test at five per cent. Table 8 depicts a random sampling result – absolute frequencies for each of the three drinks. Table 8: Frequencies of consumed drinks Number of bottles Kofola 87 Kofikola 93 Kofolisima 101 Source: author’s We have n = 87+93+101 = 281, k = 3. The null hypothesis is 0 1 2 3: 1/3H p p p= = = . The test criterion takes the form ( ) 2 2 2 2 1 (87 281/ 3) (93 281/ 3) (101 281/ 3) 1,053. 281/ 3 281/ 3 281/ 3 k i i i i X np T np= − − − − = = + + =∑ The critical value K = CHIINV(0,05,3-1) = 5,99. Therefore, we accept the null hypothesis about the equal consumption of all the three drinks. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 34 - 2.4.2 CHI-SQUARED TEST OF INDEPENDENCE There is also another problem related to chi-square testing, and a contingency table is constructed to solve the problem. Two variables are assumed: a variable A (sex status: male or female, for instance) and a second variable B (remuneration at work, for example). The variable A, a classification variable, exists in two forms A1 and A2. Similarly, the variable B exists in s possible forms 1 2, ,..., sB B B , 2s ≥ . A contingency table (see table 9) is constructed. Table 9: Contingency table for the chi-squared test of independence Categories of A / B B1 B2 B3 ... Bs Sum A1 n11 n12 n13 … n1s n1. A2 n21 n22 n23 … n2s n2. Sum n.1 n.2 n.3 … n.s n Source: author’s The symbol nij stands for the number of cases when the variable A took on the value (category) Ai, and at the same time the variable B reached the level (category) Bj. The symbol 1 s i ij j n n• = = ∑ expresses the number of cases when A fell in the i-th category, regardless of what category the variable B fell in, and the symbol 1 2j j jn n n• = + represents similarly the number of cases when B fell in the j-th category. These frequencies are called marginal frequencies. We want to know whether the two variables A and B are statistically independent. The test procedure 1. We are testing, for a nivel of test alpha, the hypothesis H0: A and B are independent variables vs. the alternative hypothesis H1: A a ndB are not independent. 2. The test criterion is of the form ( ) 2 2 1 1 Ts ij ij T j i ij n n T n= = − = ∑ ∑ , where ( ) /T ij i jn n n n• •= ⋅ are theoretical (expected) frequencies. The values ijn represent empirical frequencies acquired by random sampling. 3. The critical value is K = 2 1( )sχ α− . 4. Conclusion of the test: if T K≥ , H0 is rejected. In the opposite case, H0 is accepted. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 35 PROBLEM 8 Let A be a variable = sex status of respondents and another variable B = form of remuneration awarded to respondents in competitive sport events. Table 10 shows the numbers of respondents who were randomly selected, and fall into different categories classified by the two variables. Table 10: data on types of remuneration and sex status of respondents Sex/Remuneration Financial reward Soft drink Subtotal Men 78 42 120 Women 46 34 80 Subtotal 124 76 200 Source: author’s For example, the number 78 tells us that this is the number of respondents who were men and who said that they had received a financial reward for their sports achievement. The subtotals are calculated by the poll worker who is to process the data. Tables 11 and 12 contain the calculations of the theoretical frequencies (left table) and the terms appearing in the sum of the test criterion (right table). The symbol Eij stands for the empirical frequency and the symbol Oij describes the corresponding theoretical frequency. The empirical frequencies are calculated from table 10, taking the appropriate row and column marginal frequencies (subtotals¨), multiplying them and dividing by the number of all respondents (which is 200). Therefore, we get, for instance, 74,4 120 124 / 200= ⋅ for the first theoretical frequency, and a similar procedure applies to other theoretical frequencies, as well. Table 11: theoretical frequencies Table 12: terms for the test criterion Oij Fin. reward Soft drinks Men 74,4 45,6 Women 49,6 30,4 Source: author’s The final table 13 contains the test criterion T, degrees of freedom df = s-1 of the test, and the critical value K for 5% nivel of test. Table 13: The test criterion T and critical value K (Eij-Oij)^2/Oij 0,174193548 0,284210526 0,261290323 0,426315789 Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 36 T 1,14 alfa 0,05 df s-1 = 2-1 K 3,84 Source: author’s T = 1,14 and the critical value K = CHIINV(0,05; 2-1) = 3,84. Since the test criterion is smaller than the critical value, we accept the hypothesis that there is no relation between the form of reward and the sex status of the rewarded. As a final note, the chi-squared test can be realized in Excel, using the function CHITEST(actual; expected) which has two parameters: the parameter „actual“ is a reference to the Excel spreadsheet area containing the empirical frequencies, whereas the parameter „expected“ is a reference to the spreadsheet area with the expected/theoretical frequencies. The function returns a p-value, which means that the conclusion of the test is constructed as follows: if the p-value is smaller that the nivel of test alpha (or equal), the null hypothesis is rejected; if the p-value is greater than alpha, the null hypothesis is accepted. CONTROL TEST 2 a. The data appearing in the table below represents the result of a random sampling related to a variable Y. Using the one-sample t-test, find out whether the population mean of Y is 17,8, the nivel of test being five per cent. As part of the calculations, state the value of the test criterion and the critical value. Will the conclusion of the test change if the nivel of test at ten per cent is used instead? Provide the critical value for the latter case. Y 16 15 17 18 19 14 13 Source: author’s b. Let the data on two variables Y and X is available: X 6 25 17 18 29 4 15 Y 16 15 17 18 19 14 34 Source: author’s Find out, using the F-test, whether both samples came from populations with the same variance, the nivel of test being set again at five per cent. State the test criterion and critical value as part of your calculations. Check your results against those provided by the Data Analysis module of Excel. c. Perform the two-sample t-test with equal variances and confirm or reject the hypothesis that the variables X and Y have the same population mean. d. A poll concerning cell phone trademarks that are popular among customers led to the results shown in the following table. Set the nivel of test at ten per cent, and test the validity of the hypothesis that 25% of all customers use the Mobil1 cell phones, 33% of Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 37 all the customers incline to the Mobil2 cell phones and the remaining 42% of the customers prefer the Mobil3 trademark. Again, state the test criterion and critical value. Number of users Mobil1 2340 Mobil2 3124 Mobil3 3000 Source: author’s e. Using the test of independence, verify whether severity of car crash depends on sex status of car driver. Do so with one per cent nivel of test. As part of your calculation, state the test criterion, the critical value and the test conclusion. Available are following data: Man Woman Minor accidents 134 127 Accidents of intermediate severity 254 301 Severe accidents 14 4 Source: author’s SOLUTIONS a. Test criterion = -2,2. Critical value = 2,44. The test criterion in absolute value is smaller than the critical value, implying that the null hypothesis is accepted. If the nivel of test was ten per cent, the critical value would be 1,94. In the latter case, the null hypothesis would be rejected. b. Test criterion = 1,78. Critical value = 4,28. The null hypothesis on variance equality is accepted. c. Test criterion = 0,63. Critical value = 3,05. The null hypothesis on equality of means is accepted. d. Test criterion = 43,25. Critical value = 4,6. The hypothesis on uniform distribution is rejected. e. Test criterion = 8,65. Critical value = 9,21. We accept the hypothesis that severity of car crash and car driver sex status are two independent variables. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 38 - 3 REGRESSION ANALYSIS Regression analysis deals with dependence of a quantitative variable on one or more quantitative variables. In the case of one variable depending on another variable, we talk about simple regression, as opposed to the case when there are more explanatory variables. In the latter case, we talk about multiple regression. In this chapter, the reader should deepen their knowledge on regression presented in the course Statistics [5], in particular, as regards the multiple regression. Elementary regression terms and conditions are presented at the beginning of this chapter. Further, a formula for the calculation of regression coefficients is derived, as well as a statistical test verifying significance of the coefficients. At the end of the chapter, statistical significance of the entire regression model is tested. 3.1 THE CONCEPT OF REGRESSION ANALYSIS Regression analysis aims to find a mathematical relation – an equation which in a certain sense describes changes of a random variable Y dependent on changes of random variables X1, X2, …, Xk. We shall assume the standard case presented in literature, i.e. the case when only some values of the variables X1, X2,…, Xk are known or available. These realizations of the random variables are denoted xij = the i-th value of the j-th variable Xj. As far as he values are concerned, they are usually a part of a controlled experiment in which the analyst defines/selects the values of X1, X2,…, Xk, and then finds or measures the values of Y that corresponds to the values of X1, X2,…, Xk . The value of Y, measured or found for the i-th value of X1, X2,…, Xk , is denoted Yi. To give an example, let Y = GDP which is influenced by factors X1, X2,…, Xk. Different constellations of the factors will give a different GDP value, the behaviour of GDP being a random variable, as well, since it is almost certain that we will not be able to define k factors that describe it completely. Thus, GDP and so the variable Y , as well, will be in general a random variable, and its probability distribution will change as the level of the factors X1, X2,…, Xk changes. Therefore, we use the lower index i in the symbol Yi. In this controlled experiment, which tries to find a concrete form of the relation between Y and the variables X2, …, Xk on a specified subset of the set of all possible values of X1, X2, …, Xk, we assume that the relation takes the form ( )1 2, ,..., kY f X X X ε= + . Here, Y depends on the regression function f, which contains unknown parameters, and on the random term ε which completes the full description of the random behaviour of Y. The systematic part of the model ( )1 2, ,..., kf X X X is not able to provide the full description of the behaviour of Y. As was already outlined at the beginning of the chapter, the problem of finding an appropriate relation between the variables will be resolved for the case when the variable Y, the so-called dependent variable, depends on k independent variables or a vector ( )1,..., kX X X= . The systematic part ( )1 2, ,..., kf X X X can take on different forms: ( )1 2 1 2 1, ,..., kf X X X Xβ β= + , ( ) 2 1 2 0 1 1 2 2, ,..., kf X X X X Xβ β β= + + , etc. The parameters 0 1 2, ,β β β are unknown! Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 39 If the systematic part satisfies ( ) ( ) ( ) ( )1 2 1 1 2 2, ,..., ...k k kf X X X f X f X f Xβ β β= + + + , we talk about linear regression (linear in parameters), or about linear regression model. We usually consider the model: 3-1 ( )1 2 1 1 2 2, ,..., ...k k kf X X X X X Xβ β β= + + + . We shall start with its simplest form in which f is a linear function of one independent variable: 3-2 ( )1 1 2 1f X Xβ β= + . Thus, we consider the relation 1 2 1 1( )Y X f Xβ β ε ε= + + = + . Our situation is depicted in figure 1. Figure 1: Regression dependence in the form of a line The graph shows that the behaviour of Y is determined by the systematic part of the model, i.e. by the function f(X1) which reflects the effect of a single variable on the variable Y. However, it does not suffice to use the function f(X1) to describe the behaviour of Y, and it is necessary to add the influence of other factors – those which are represented by the term ߝ. And what was just said is also true for any specific value of X1, of course: for a value xi1, for example. At the point xi1, equation Yi = f(xi1) + εi holds true, where f(xi1) is a specific value. Nevertheless, even though f(xi1) is a specific value, we don’t know this value because the expression f(xi1) depends on unknown parameters. We may even know the exact mathematical form of the expression (for instance, we may know that it is a line), and we still won’t be able to evaluate it. The objective of regression is to estimate the unknown parameters. To make the estimation, we need to have some data. For the estimation of a regression line, for instance, data ( ) ( ) ( )11 1 21 2 1, y , , y ,..., , yn nx x x are usually available. In other words, n points from a plane are available. The first coordinates 11 21 1, ,..., nx x x of these points are the specified values of the independent variable X1. These values are defined by the analyst, and the corresponding y’s are obtained later. In our case, we obtained a single value of Y for each value of X1. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 40 The estimate of the function ( )1 2 1 1 2 2, ,..., ...k k kf X X X X X Xβ β β= + + + , calculated for the i-th value of the variables 1 2, ,..., kX X X , is denoted as 1 1 2 2 ...i i i k ikY b x b x b x= + + + ) . If Y depends on two variables X1, X2, the points, obtained experimentally, will be of the form: (x11, x12 ,y1) (x21, x22 ,y2) ... (xn1, xn2 ,yn). These points lie in a three-dimensional space, and are interspersed with a function of the form 1 1 2 2Y b x b x= + ) . The function approximates the relation between Y and the variables X1, X2. More generally, if Y depends on k variables X2,…,Xk, we assume that the following points from a k+1-dimensional space are available: (x11, x12,…, x1k, y1) (x21, x22,…, x2k, y2) ... (xn1, xn2,…, xnk, yn) These points are interspersed with a hyperplane of the form 1 1 2 2 ... k kY b x b x b x= + + + ) . This function approximates the relation between Y and the variables X1, X2,…, Xk. To justify the procedures to be explained, which result in an estimation of the unknown regression coefficients, it is imperative that the following conditions are satisfied. The conditions are related to the random part ε of the regression model: 1. Expected value of εi is zero, i.e. E(εi) = 0 for each i. 2. Variance of εi is constant, independent of i, i.e. Var(εi) = σ2 for each i. 3. Variables εi and εj are not correlated, i.e. Cov(εi, εj) = 0 for i ≠ j. 4. Variables εi ’s are normally distributed, i.e. εi ∼ N(0, σ2 ) for each i. As is usually the case, expected value is denoted as E, variance is denoted as Var and covariance uses the symbol Cov. If the reader forgot the symbols, we recommend revision of the foundations of statistics contained in the course Statistics. 3.2 ESTIMATION OF REGRESSION COEFFICIENTS We assume a regression function of the form ( )1 2 0 1 1, ,..., ...k k kf X X X X Xβ β β= + + + . We determine the i-th value of each explanatory variable and obtain a vector (xi1, xi2,…, xik). We do so for i = 1, 2,…, n, so we end up with n vectors (or points). We then find or measure a particular value of Y for each vector (xi1, xi2,…, xik), i = 1, 2,…, n. Since there are n vectors, we shall obtain n values of Y. This is all we have to make the estimation. Therefore, aside from the conditions 1-4, this is what greatly affects the quality of the resulting estimates. Of course, we talk about estimates because we only work with a data sample. The vector of unknown regression parameters 0( ,..., )kβ β β= r corresponds to the set of all possible points of Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 41 the form (xi1, xi2,…, xik, yi). Estimating the unknown parameters, we end up with an estimator 0( ,..., )kb b b= r . The vector 0( ,..., )kb b b= r is obtained by 3-3 ( ) 1 . .T T T b X X X Y − = r r , where X, the matrix of regressors, satisfies 3-4 11 12 1 21 22 2 1 2 1 ... 1 ... ... ... ... ... ... 1 ... k k n n nk x x x x x x X x x x      =       , and 3-5 1 2(Y ,Y ,...,Y )T nY = r . Symbol ZT denotes the transposition of matrix Z, Z-1 means the inverse of Z. To evaluate 3-3, the following data, as mentioned previously, must be available (x11, x12,…, x1k, y1) ... (xn1, xn2,…, xnk, yn). PROBLEM 1 Estimate dependence of electricity consumption Y on power-supply distance X1 and amount of electricity supplied X2. The regression function is assumed to be of the form ( )1 2 0 1 1 2 2, .f X X X Xβ β β= + + Available are the following data: Table 14: data for problem 1 X1 X2 Y 1,2 3,6 3,2 1,3 3,7 3,3 1,3 3,8 3,4 1,4 3,8 3,5 1,4 3,9 3,6 1,5 3,9 3,6 1,5 4 3,7 1,6 4 3,8 1,6 4,1 3,9 1,7 4,2 4 Source: author’s Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 42 - SOLUTION Table 14 represents points which are used to construct the matrices X and Y: 1 1,2 3,6 1 1,3 3,7 1 1,3 3,8 1 1,4 3,8 1 1,4 3,9 1 1,5 3,9 1 1,5 4 1 1,6 4 1 1,6 4,1 1 1,7 4,2 X               =                  3,2 3,3 3,4 3,5 3,6 3,6 3,7 3,8 3,9 4 Y               =                  r We shall now calculate the vector 0 1 2 b b b b     =       r in several steps, using 3-3: 1 1,2 3,6 1 1,3 3,7 1 1,3 3,8 1 1,4 3,8 1 1 1 1 1 1 1 1 1 1 10 14,5 39 1 1,4 3,9 1,2 1,3 1,3 1,4 1,4 1,5 1,5 1,6 1,6 1,7 14,5 21,25 56 1 1,5 3,9 3,6 3,7 3,8 3,8 3,9 3,9 4 4 4,1 4,2 1 1,5 4 1 1,6 4 1 1,6 4,1 1 1,7 4,2 T X X                ⋅ = ⋅ =                    ,8 . 39 56,8 152,4          ( ) 1 245,2 108 103 108 60 50 . 103 50 45 T X X − −   ⋅ = −   − −  Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 43 - 3,2 3,3 3,4 3,5 1 1 1 1 1 1 1 1 1 1 36 3,6 1,2 1,3 1,3 1,4 1,4 1,5 1,5 1,6 1,6 1,7 52,56 . 3,6 3,6 3,7 3,8 3,8 3,9 3,9 4 4 4,1 4,2 140,82 3,7 3,8 3,9 4 T X Y                   ⋅ = ⋅ =                        Using 3-3, we now obtain: ( ) 0 1 1 2 245,2 108 103 36 0,78 108 60 50 52,56 0,60 . 103 50 45 140,82 0,90 T T T b b X X X Y b b − − −               = ⋅ ⋅ ⋅ = − ⋅ = =               − −        r r Therefore, the resulting regression function is 1 20,78 0,60 0,90Y x x= − + + r . Theoretical values Theoretical values 1 2, ,..., nY Y Y ) ) ) are calculated from the relation 1 20,78 0,60 0,90Y x x= − + + ) by inserting specific values x1 and x2 to the equation: 1 0,78 0,60 1, 2 0,90 3,6 3,18Y = − + ⋅ + ⋅ = ) , 2 0, 78 0,60 1,3 0,90 3,7 3,33Y = − + ⋅ + ⋅ = ) , ... 10 0, 78 0, 60 1, 7 0,90 4, 2 4, 02Y = − + ⋅ + ⋅ = ) . In matrix form, the resulting values can be written as 1 1,2 3,6 3,18 1 1,3 3,7 3,33 1 1,3 3,8 3,42 1 1,4 3,8 3,48 0,78 1 1,4 3,9 3,57 0,6 1 1,5 3,9 3,63 0,9 1 1,5 4 3,72 1 1,6 4 3,78 1 1,6 4,1 3,87 1 1,7 4,2 4,02 T Y X b                        −      = ⋅ = ⋅ =                                   r) 1 2 10 ˆ ˆ . . . . . . . ˆ y y y               =                   Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 44 Vector of residuals The difference between the theoretical value and empirical value is called residual. In vector form, the difference e Y Y= − r )r represents the vector of residuals. In our example, we have: 3,2 3,18 0,02 3,3 3,33 0,03 3,4 3,42 0,02 3,5 3,48 0,02 3,6 3,57 0,03 3,6 3,63 0,03 3,7 3,72 0,02 3,9 3,78 0,02 3,9 3,87 0,03 4 4,02 0,02 e Y Y          −         −                = − = − =     −                              −     r )r 1 2 10 . . . . . . . e e e               =                        The differences are calculated by substracting the corresponding vector coordinates. Variance of the estimates of regression coefficients Since finding regression coefficients results in estimates of the unknown population coefficients, it is convenient to introduce variances of the estimates. The variances reflect the precision of the estimates, and can be found on the main diagonal of the matrix 3-6 ( ) 12 ( ) ,T Var b s X X − = ⋅ ⋅ r where 2 2 1 n i i e s n k = = − ∑ is an estimate of the variance of ߝ. Also, ie = i-th residual, n = sample size (number of points we work with), k = number of parameters in the regression model. For our example, we have: 2 2 1 0,006 0,0008571. 10 3 n i i e s n k = = = = − − ∑ Thus, ( ) . 0386,00429,00883,0 0429,00514,00926,0 0883,00926,02102,0 4550103 5060108 1031082,245 0008571,0)( 12           − − − =           −− − − ⋅=⋅⋅= − XXsbVar T r The main diagonal of the rightmost matrix contains the coefficient variances: Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 45 - s2 (b0) = 0,2102, and the standard deviation estimate s(b0) = 0,4584. s2 (b1) = 0,0514, and the standard deviation estimate s(b1) = 0,2267. s2 (b2) = 0,0386, and the standard deviation estimate s(b2) = 0,1965. When the regression model and the variances of the coefficients are estimated, the result is usually expressed in the form of an equation with the estimated standard deviations, i.e. the square roots of the estimated variances, written under the estimated coefficients that appear in the equation: 1 20, 7 8 0, 6 0 0, 9 0Y x x= − + + ) (0,4584) (0,2267) (0,1965) It may happen that the order of magnitude of the differences among the coefficients can be quite substantial. For instance, it can be the case that b1 = 200 and b2 = 0,02. It is then reasonable to ask the question whether there is any point in adding the small coefficient to the model. To find the answer to this question, one can use the following statistical test. 3.3 TESTING SIGNIFICANCE OF REGRESSION COEFFICIENTS The structure of the test is as follows: 1. The tested hypothesis is: H0: βi = 0, H1: βi ≠ 0. 2. The test criterion is ( ) i i b T s b = , where bi is the estimate of βi, s(bi) is the estimated standard deviation of bi . 3. The critical value K = tn-k(α) for a nivel of test α. 4. If T > K, we reject the hypothesis H0 and accept the alternative hypothesis H1 according to which βi is considered to be nonzero or statistically significant. In the opposite case, the null hypothesis is accepted, and the tested parameter is thought to be equal to zero or statistically insignificant. In our case, we have 1 1 1 0,60 2,65 ( ) 0,0514 b T s b = = = , 2 2 2 4,58 ( ) b T s b = = , Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 46 where tn-k(α) = t10-3 (0,05) = 2,365. Since T1 > 2,365 and also T2 > 2,365, both population coefficients are statistically significant, and should be considered in the model. As was said before, one needs to have a sample of points (xj1, xj2,…, xjk, yj) to find the vector of coefficients ( )kbbb ,...,1= r , the estimate of the coefficients ( )1,..., kβ β β= r . In other words, the vector of observations Y and the matrix of regressors X must be known. In practice, tha analyst has to decide how to select the values xij and how many of them should be included in the model. Answers to these questions have a major impact on the final estimates of the coefficients because the subsequent calculations are only a routine procedure. There is a statistical discipline which, among other things, aims to answer these questions. The discipline is called design of experiments. We shall analyse some fundamental designs of experiments in later chapters of this textbook. 3.4 CONFIDENCE INTERVALS FOR REGRESSION COEFFICIENTS Confidence intervals for the parameters β1,…,βk are intervals in which the parameters lie with probability 1-α. Each of these intervals is of the form 3-7 )]().(),().([ ipniipni bstbbstb αα −− +− , where bi = estimate of βi, s(bi) = estimated standard deviation of bi, tn-p(α) = critical value of a Student’s distribution, n = sample size (number of points), p = number of parameters in the model, α = nivel of test, The unknown parameter βi lies in the interval 3-7 with probability 1-α. It is necessary that the random term ߝ of the regression model is normally distributed for the interval to hold true. 3.5 TESTING MODEL SIGNIFICANCE Significance of the regression model can be verified by the following test, which again requires that the random term of the model is normally distributed. The structure of the test is as follows: 1. The hypothesis is: 0...:H 210 ==== kβββ , or, using vectors, 0:H0 rr =β . 0:H1 rr ≠β . 2. The test criterion: , )1/( )/(ˆ −− = knS kS T e Y Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 47 - where 2 ˆ 1 1 1 ( ) , n n i iY i i S Y Y Y Y n= = = − =∑ ∑ ) , 2 2 1 1 ( ) . n n e i i i i i S Y Y e = = = − =∑ ∑ ) 3. The critical value is K = )(1, α−−knkF , where )(1, α−−knkF is the critical value of a Fisher’s distribution with df1 = k a df2 = n-k-1 degrees of freedom. In Excel, one can obtain the value with the function FINV(α; df1; df2). 4. If KT ≥ , H0 is rejected. In the opposite case, H0 is accepted. In our example, we have: 5,346 )310/(006,0 )2/(594,0 = − =T , K = F2,7(0,05) = 4,73. Since T exceeds the critical value K, H0 is rejected and the model is viewed as satisfactory, i.e. we reject the hypothesis that all regression coefficients except the constant term β0 are zeros. SUMMARY This chapter dealt with regression analysis which aims to find a relation between a quantitative variable Y, or an explained/dependent variable, and other quantitative variables called explanatory/independent variables. In particular, we were concerned with linear regression models (linear in parameters). At the end of the chapter, different statistical tests were presented, regarding statistical significance of the regression coefficients and the model as a whole, as well as confidence intervals for the estimates of the coefficients. The text was accompanied by examples. The following terms were explained: linear regression, estimates of regression coefficients, theoretical value, residual, variance and standard deviation of regression coefficients, test of regression coefficients, test of the model, confidence intervals for regression coefficients. Some more examples follow. PROBLEM 2 a) Estimate the regression coefficients of the model 0 1 1 2 2Y b b x b x= + + ) , b) Calculate the theoretical values of Y, c) Calculate the residuals of the model, d) Calculate the variance of the coefficient estimates, e) Test significance of the coefficients. f) For the following entry values       = 5121 3111 0X , predict the dependent variable Y0. Perform the tasks a)-f), using the following data: Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 48 y x1 x2 10 1 0 25 3 -1 32 4 0 43 5 1 58 7 -1 62 8 0 67 10 -1 71 10 2 Source: author’s SOLUTION a. The estimates of the coefficients:           = 35 2710 368 YX T ,           =                     −− −− − = 26,0 59,6 47,6 35 2710 368 . 60840240 4064384 2403842887 4664 1T b r . b. The theoretical values: ( )13,06 , 25,98 , 32,83, 39,68 , 52,34 , 59,19 , 72,11, 72,89Y = ) . c. The residuals: ( )89,1,11,5,81,2,66,5,32,3,83,0,98,0,06,3 −−−−−=e r . d. The variances of the coefficient estimates: Since 65,912 =∑i ie ,           =           − −− − − = 39,2...... ...25,0... ......35,11 60840240 4064384 2403842887 4664 1 . 38 65,91 )(bVar r . e. The test of coefficients: Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 49 - 92,1 37,3 47,6 )0( 0 0 === bs b T , 18,13 5,0 59,6 1 ==T , T2 = 0,17 . K = 571,2)05,0()( 38 == −− tt pn α . Statistical significance pertains only to 1β because T1 > K. f. Two predictions of Y0 (at two different points of X): Since       = 5121 3111 0X ,       =                 = 85,86 74,79 26,0 59,6 47,6 . 5121 3111 0Y . PROBLEM 3 Find out whether production depends on corporate investments. A potential dependence is reflected by the parameter 1β in the regression function with two unknown parameters. We know, based on twelve data, that the estimate of 1β is 1 2,1622.b = We also know the standard deviation of the estimate is 1( ) 0,615516.s b = Verify or reject the existence of the dependence by testing the hypothesis: 1 0β = . The nivel of test is five per cent. SOLUTION Since 1 1 , ( ) b T s b = we get 2,1622 3,513. 0,615516 T = = The appropritate critical value, related to a Student’s distribution with 12 – 2 = 10 degrees of freedom, is 10 (0,05) 2,228t = . Since 3,513 > 2,228, we reject the null hypothesis about the zero value of the coefficient. Thus, the coefficient is statistically significant, and the dependence may be assumed to exist. PROBLEM 4 (HOTEL SERVICES) Find the linear regression model which describes a dependence of total monthly revenues Y (in tens of thousands of crowns) of a hotel on revenues 1X generated by the catering services Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 50 of the hotel (in tens of thousands of crowns) and on 2X which is a product of the number of beds at the hotel and the number of days in a given month. The data is in the following table. Data Y 1x 2x 12,0 2,0 150 8,0 1,2 94 76,4 14,8 811 17,0 8,3 254 21,3 8,4 399 10,0 3,0 95 12,5 4,8 149 97,3 15,6 312 88,0 16,1 952 25,0 11,5 247 38,6 14,2 400 47,3 14,0 312 Source: author’s SOLUTION The vectors for the dependent and independent variables have the following form: 12,0 1 2,0 150 8,0 1 1,2 94 76,4 1 14,8 811 17,0 1 8,3 254 21,3 1 8,4 399 10,0 1 3,0 95 , 12,5 1 4,8 149 97,3 1 15,6 312 88,0 1 16,1 952 25,0 1 11,5 247 38,6 1 14,2 400 47,3 1 14,0 312 Y X                             = =                              .                 Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 51 - Therefore ( ) ( ) 1 1 12,0 113,90 4175,0 453,4 113,9 1428,43 51958,5 , 6006,8 , 4175,0 51958,50 2266001 230647,8 0,343 0,02629 0,0003 0,026 0,006234 0,000094 , 0,00003 0,000094 0,00000266 T T T T T X X X Y X X X X X − −        = =           − −   = − −   − −  ⋅( ) 9,126450 3,729273 . 0,033091 T Y b −   = =      r The regression function is 1 29,126450 3,729273 0,033091 .Y x x= − + + ) PROBLEM 5 Test statistical significance of the coefficients from the previous example: 21,ββ . The nivel of test is five per cent. SOLUTION First of all, the standard deviations of the coefficients are calculated. Their values are: .0283,0)(,371,1)( 21 == bsbs The corresponding test criterion is )( j j bs b T = , and so .1693,1 0283,0 033091,0 ,7201,2 371,1 729273,3 21 ==== TT The critical value of the test is found for a Student’s distribution with 12 – 3 = 9 degrees of freedom and the five per cent nivel of test: .26,2)05,0(9 =t Comparing the test criterions with the critical value, we see that the null hypothesis concerning the zero value of the coefficient 1β is rejected. The case of the other parameter is different, however. The parameter 2β seems to be insignificant and the second variable should be removed from the model. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 52 CONTROL TEST 3 Yes/No answers: 3.1 Regression analysis examines dependence among quantitative variables? 3.2 Deviation of an empirical value of Y from its theoretical value, modelled by a regression function, is called residual? 3.3 Regression analysis deals only with linear functions? 3.4 The test of significance of regression coefficients requires the critical value of a normal distribution? 3.5 The null hypothesis in the test of model significance is: 0...:H 210 ≠≠≠≠ kβββ ? 3.6 The classical regression model assumes that random terms in the model have _______ expected value and __________ variance. 3.7 The test examining the zero value of an individual regression coefficient is called __________ 3.8 If the model 0 1 1 ... k kY b b x b x= + + + ) contains the term 0b , the first column of the matrix of regressors X consists of value(s) __________ 3.9 Regression analysis exploits dependence among__________ variables. 3.10 Variances of estimated regression coefficients can be found on __________ __________ of the matrix ( ) ( ) 12 − = XXsbVar T r . 3.11 A personnel department gathered the following data on age (X) of 20 randomly selected employees and the amount of time (Y) they spent out of work due to health reasons. x y x y 20 4 58 20 35 14 46 13 35 15 43 16 34 10 33 10 32 10 29 10 28 9 36 11 25 12 48 14 46 15 55 15 38 15 36 14 50 16 19 6 Source: author’s Estimate regression coefficients of the model 0 1Y b b x= + ) . 3.12 A statistical office examined dependence of yearly savings on yearly income. Both variables are related to families with two children. The result of the survey is in the following table. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 53 - Income (thousands of crowns) 104 125 1 146 1 167 1 111 1 135 1 189 1 196 2 205 2 210 1 170 230 Savings (thousands of crowns) 6 5,6 9 9,2 1 14 8 8 9 9,1 2 20,5 2 29 2 23,2 3 38,5 2 25 40 Source: author’s Find the linear regression model that explains a dependence of savings on income. Using the model, estimate the savings of a family whose yearly income is 205 thousand crowns. 3.13 Eight families were randomly selected from national accounting records. Their gross yearly income (= explanatory variable x, measured in crowns) and their yearly expenses on industrial products (= explained variable Y, measured in crowns) were analysed. The results are in the table. x 211399 306502 250251 264138 274060 297046 328645 249987 Y 42276 72341 49852 53827 54914 60409 71729 47997 Source: author’s a. Estimate the linear regression function which describes a dependence of expenses on income. b. Calculate the theoretical expenses of a family with income exceeding 300 thousand crowns. 3.14 a. Using the data of problem 3.13, calculate the standard deviation of the estimates ,ib i = 0, 1. b. Using the data of problem 3.13, calculate the test criterion used to test the statistical insignificance of 1b . 3.15 The following data is available on the production of France (= variable Y in millions of euros), its amount of fixed capital (= variable X1 in millions of euros) and employment (= variable X2 in thousands of people). Estimate regression coefficients of the model 2211021 ),( xxXXf βββ ++= , where 1 2( , )Y f X X ε= + . Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 54 Economic sector Yi 1ix 2ix Agriculture 288443 18781 1055 Food and drinks 393828 13990 551 Power 330300 33813 223 Semi-products 602182 32022 1101 Production equipment 426720 19520 965 Household goods 34008 1258 49 Transportation means 185887 10462 358 Consumer products 427766 16392 1030 Construction 436926 19828 1472 Trade 495319 36354 2691 Transportation 417147 58196 1268 Market services 1002132 116083 4617 Insurance services 61827 2053 158 Financial services 709297 6908 441 Non-market services 840622 136923 6148 Source: author’s SOLUTIONS 3.1 yes 3.2 yes 3.3 no 3.4 no 3.5 no 3.6 zero, constant 3.7 t-test 3.8 ones 3.9 quantitative 3.10 the main diagonal 3.11 1,394 0,296Y x= + ) , 3.12 26,399 0,274Y x= − + ) ; 29 711 crowns 3.13 a. 19599,4 0,2796 ,Y x= − + ) b. for 300000 ,ix crowns= we get 64298iY = ) crowns. At least 64298 is the answer. 3.14 a. 1( ) 0,03375,s b = b. The test criterion T = 8,284, the critical value ⇒= 447,2)05,0(6t we accept the hypothesis H1 on dependence of expenses on income. 3.15 1 2263684,7 2,2331 66,7912 .Y x x= + + ) Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 55 - 4 CORRELATION ANALYSIS In the previous chapter, we were solving the problem of finding a functional relation which would describe dependence of one variable Y on other explanatory variables represented by a vector X. Mathematically speaking, the relations were linear in parameters. In this chapter, we will be preoccupied with the problem of measuring intensity of dependence among variables. There is more than one way how to do it. Perhaps the simplest way of measuring dependence has to do with what is called correlation analysis. Correlation analysis is closely related to regression [2], as it benefits from the theory of linear regression models. The objective of correlation is different, however. It does not seek a reasonable form of relations among variables because it a priori assumes that relations are linear in parameters and even in variables, as well. Instead, it focuses on the construction of measures of dependence among the variables. This chapter is accompanied by examples to give a better understanding of the subject. After studying correlation, the reader is advised to calculate the problems at the end of the chapter. 4.1 CORRELATION COEFFICIENT In the simplest form, we study dependence between two random variables Y and X. If this is the case, paired correlation coefficient ρxy is used to measure the level of linear dependence between the two variables. The coefficient is defined as 4-1 ( , ) ( ). ( ) xy Cov X Y X Y ρ σ σ = for ( ) 0, ( ) 0X Yσ σ> > , 0= otherwise. Here, ( , ) ( ) ( ) ( )Cov X Y E XY E X E Y= − ⋅ is the covariance of the random variables X and Y, the characteristic having been defined in chapter one. Also, ( )Xσ and ( )Yσ are the standard deviations of X and Y, respectively. The symbol E stands for the expected value of a random variable. Expected value was explained in the course Statistics. The paired correlation coefficient is an element of the closed interval [ 1,1]− , i.e. [ 1,1]xyρ ∈ − . If xyρ = 0, we say that the variables X and Y are uncorrelated. If xyρ = 1 or xyρ = -1, an exact functional relation exists between X and Y, the function being a line. If xyρ = 1, the line is increasing. If xyρ = -1, the line is decreasing. If xyρ = 0, we can only conclude that the variables are uncorrelated. We cannot say that they are (statistically) independent. Although independent variables are uncorrelated, the opposite statement is not generally true. PROBLEM 1 Let us calculate the correlation coefficient xyρ for the data in table 15: Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 56 Table 15: Entry data for correlation analysis X -2 -1 0 1 2 Y 4 1 0 1 4 Source: author’s All the pairs occur with the same probability p. The following table 16 adds some preparatory calculations to make things easier. Table 16: Preliminary calculations Xi Yi Xi.Yi -2 4 -8 -1 1 -1 0 0 0 1 1 1 2 4 8 Σxi = 0 Σyi = 10 Σxi.yi = 0 We have 2 ( , ) ( ) ( ) ( ) 0i i i i i i i Cov X Y E XY E X E Y p x y p x y= − ⋅ = − =∑ ∑ ∑ , and thus xyρ = 0. At the same time, it can be seen that the two variables are not independent – on the contrary, they are even perfectly dependent, one of the variables being the second power of the other. Formula 4-1 defines the population/theoretical correlation coefficient, which in most cases cannot be calculated since the population characteristics ( , )Cov X Y , ( )Xσ and ( )Yσ will most likely be unknown. The presented example is devised artifically, of course. For these reasons, the population coefficient is in practice usually estimated by its sample version rxy , which is calculated from a data sample. The sample correlation coefficient is defined by 4-2 2 2 2 2 . . . . ( ) . ( ) i i i i xy i i i i n x y x y r n x x n y y − =    − −    ∑ ∑ ∑ ∑ ∑ ∑ ∑ . To decide reasonably whether there is any amount of linear dependence between Y and X , the correlation is tested for significance. The null hypothesis of the test states that ρxy = 0, and the alternative hypothesis says the opposite – there exists a nonzero correlation. To perform the test, the sample correlation coefficient is used. Testing zero value of paired population correlation 1. The null hypothesis is H0: ρxy = 0 vs. the alternative H1: ρxy ≠ 0. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 57 - 2. The test criterion is of the form: T = 2 . 2 1 xy xy r n r − − , where n = number of sample pairs (xi, yi). 3. The critical value of the test for a nivel of test alpha is K = tn-2(α). Thus, it concerns a Student’s distribution with n-2 degrees of freedom. 4. If T < K, H0 is accepted, i.e. Y is not linearly dependent on X. In the opposite case, H1 is accepted, which means that Y is (at least to a certain extent)) linearly dependent on X. Let us note that there are conditions that must be satisfied for the test to be valid: the main condition is that the pairs were drawn from a two-dimensional normal distribution. PROBLEM 2 Let us have the following sample pairs made up of values xi and yi (the first two columns of table 17): Table 17: Result of a random sampling, preparatory calculations xi yi xi yi xi 2 yi 2 -2 -5 10 4 25 -1 -3 3 1 9 0 0 0 0 0 1 1 1 1 1 2 4 8 4 16 Σ = 0 Σ = -3 Σ = 22 Σ = 10 Σ = 51 Source: author’s Using 4-2, we obtain 2 5.22 0.( 3) 0,9918. (5.10 0).(5.51 ( 3) ) xyr − − = = − − − This value is a clear signal that there probably is a linear relationship between Y and X. However, the sample size is very small, and so we prefer to test the (in)significance of the correlation, using a one per cent nivel of test. T = 2 0,9918. 5 2 1,718 13,443. 0,0161 0,9918 − = = − Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 58 Since α = 0,01, the critical value is K = t5-2(0,01) = TINV(0,01; 3) = 5,84. As T > K, there is a significant (nonzero) linear dependence of Y on X . This is also confirmed by the p-value of the test, which is TDIST(13,443; 3; 2) = 0,00089. The value is substantially smaller than the nivel of test, suggesting that the correlation is significant for nivels of test 0,00089 and higher, i.e. for all reasonable nivels of test. 4.2 CORRELATION INDEX If the regression function, based on which the correlation of two variables is assessed, isn’t linear, it is possible to measure a dependence of two variables with correlation index: 4-3 ˆY xy Y S I S = , where 2 ˆ 1 ( ) n iY i S Y Y = = −∑ ) , 2 1 ( ) . n Y i i S Y Y = = −∑ The symbols were already used in regression analysis. The calculation of Ixy is more laborous than that of rxy because the regression function has to be found first, so that the theoretical values of the dependent variable iY ) are available. The values iY are measured, and their average is Y . The theoretical values iY ) correspond to the values of the explanatory variable X appearing in the regression, as was described in the chapter on regression analysis: ( )i iY f x= ) . The index satisfies 0 ≤ Ixy ≤ 1. Discussions about the resulting value of the index are similar to those about rxy, testing its significance is not performed, however. Ixy can also be used for the case of a regression line. It is then identical to the absolute value of the paired correlation coefficient rxy. 4.3 SPEARMAN’S RANK CORRELATION COEFFICIENT If ranks of two variables X ,Y are known, not their original values, Spearman’s rank correlation coefficient rs is used instead to measure dependence of the variables: 4-4 2 2 6 1 ( 1) i i S d r n n ⋅ = − − ∑ . Here, id is the difference of the i-th ranks of X and Y , and n is the number of numerical pairs of the two variables that are available through a data sampling. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 59 PROBLEM 3 Products were sorted by their quality, the sorting having been implemented by two committees: specialists were on one committee, laymen selected from the general public on another committee. Determine whether the resulting assessments of the product quality depends on what committee is considered for this purpose. Here, the dependence is understood in the sense of a correlation. Entry data are in table 18, together with the differences in the rankings. Table 18: product rankings Product Ranks by laymen Ranks by specialists di di 2 1 7 8 -1 1 2 9 9 0 0 3 8 7 1 1 4 10 10 0 0 5 6 6 0 0 6 5 4 1 1 7 3 5 -2 4 8 4 3 1 1 9 2 2 0 0 10 1 1 0 0 Source: author’s 2 2 6 6.8 1 1 0,95. ( 1) 10.99 i i S d r n n = − = − = − ∑ The coefficient can be used to test statistical independence (not only correlation !) of the two variables: 1. The tested hypothesis is H0: X, Y are independent vs. H1: X, Y are not independent. 2. The test criterion is: ( 1) ST n r= − ⋅ . 3. The critical value K is that of the standard normal distribution N(0,1). For a nivel of test alpha, the value is calculated with the Excel function NORMSINV(1-alpha/2). 4. If T K≥ , we reject H0. In the opposite case, we accept H0. If H0 is accepted, we know the variables are independent, and thus uncorrelated as well. If the null hypothesis is rejected, we know the variables are not independent, but we don’t know if they are uncorrelated or not. The test is approximately valid provided that 30n ≥ and the random vector (X,Y) follows a two-dimensional continuous probability distribution. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 60 - 4.4 MULTIVARIATE DEPENDENCE- THE CASE OF TWO VARIABLES If we want to examine the linear dependence of a variable Y on variables 1 2, ,..., pX X X , p>1, we use either: a. coefficients of partial correlation, b. coefficients of multivariate correlation. Ad a. Partial correlation coefficient 1 2 ,..., pyx x xr • measures the intensity of the linear dependence of Y on X1 provided a certain effect of variables 2,..., pX X is removed. These are the variables listed behind the symbol „·“. Partial correlation tries to solve the problem that the effect of X1 might be distorted by the contemporaneous effects of variables 2,..., pX X . We shall restrict our analysis to the case of p = 2. The coefficient of partial correlation appears again in two forms – as a population coefficient and a sample coefficient. In the latter case, and for p = 2, the sample version is calculated as ( )( ) 1 2 1 2 1 2 2 1 2 2 2 1 1 yx yx x x yx x yx x x r r r r r r • − = − − , or ( )( ) 2 1 1 2 2 1 1 1 2 2 2 1 1 yx yx x x yx x yx x x r r r r r r • − = − − . In both cases, the coefficients take on a value from interval [-1,1]. As can be seen from 4-5 and 4-6, to calculate the partial correlation, it is necessary to evaluate various combinations of paired correlations. Since we talk about sample partial correlation, it is possible to test the significance of its population version. Testing statistical significance of partial correlation (p = 2): 1. 1 20 : 0yx xH ρ • = , 1 21 : 0yx xH ρ • ≠ . 2. The test criterion is: 1 2 1 2 2 3 1 yx x yx x r n T r • • − = − . 3. The critical value for a nivel of test alpha is K = ( )α3−nt = TINV(α,n-3). 4. IF ≥T ( )3nt α− , the coefficient of partial correlation is significant, i.e. nonzero. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 61 The test is valid provided the random vector (Y, X1, X2) follows a three-dimensional normal distribution. It is also assumed that n > 3. Ad b. Coefficient of multiple correlation measures dependence of a variable Y on all explanatory variables pXXX ,...,, 21 . If two explanatory variables are considered, the sample version of the coefficient satisfies 1 1 2 1 2 2 1 2 1 2 2 2 2 2 1 yx yx yx x x yx y x x x x r r r r r r r • − + = − , 1 2 0 1y x xr •≤ ≤ . The significance of the coefficient can also be tested: Testing statistical significance of multiple correlation: 1. 1 20 : 0y x xH ρ • = vs. 1 21 : 0y x xH ρ • ≠ . 2. The test criterion: ( ) ( ) 1 2 1 2 2 2 3 2 1 y x x y x x r n T r • • ⋅ − = ⋅ − . 3. The critical value of the test is related to a Fisher’s distribution this time. The distribution has 2 and n-3 degrees of freedom: the critical value is written as ( )2, 3nF α− for a nivel of test alpha. In Excel, it can be obtained with the function FINV(α,2,n-3). 4. If T ≥ ( )2, 3nF α− , the coefficient of multiple correlation is statistically significant (the null hypothesis is rejected). In the opposite case, it is statistically insignificant (the null hypothesis is accepted). The test is valid on condition that the random vector (Y, X1, X2) follows a three-dimensional normal distribution. Is is also assumed that n > 3. SUMMARY In this chapter, we became familiarized with another important statistical term: correlation analysis. We learnt how to compute the correlation coefficient, correlation index and the Spearman’s rank correlation coefficient. The end of the chapter discussed partial correlation and multiple correlation. The theory and examples worked with the case when two explanatory variables are considered because the calculations become more cumbersome if more explanatory variables are considered. The text is now followed by more examples. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 62 PROBLEM 4 Check if there is any linear dependence among the following variables. Calculate the partial correlation coefficients and test the smaller one on significance. The nivel of test is 5%. Also, calculate the coefficient of multiple correlation and test its significance for 5% nivel of test. Use table 19 for the calculations. The dependent variable is Y, the other variables are explanatory. Table 19: Entry data for problem 4 Y 12 8 76,4 17 21,3 10 X1 2 1,2 14,8 8,3 8,4 3 X2 150 94 811 254 399 95 Y 12,5 97,3 88 25 38,6 47,3 X1 4,8 15,6 16,1 11,5 14,2 14 X2 149 312 952 247 400 312 Source: author’s SOLUTION The paired correlations are as follows: 1 2 1 2 0,85, 0,75, 0,73.yx yx x xr r r= = = Inserting them in equations ( )( ) 1 2 1 2 1 2 2 1 2 2 2 1 1 yx yx x x yx x yx x x r r r r r r • − = − − , ( )( ) 2 1 1 2 2 1 1 1 2 2 2 1 1 yx yx x x yx x yx x x r r r r r r • − = − − , we get the partial correlations: 1 2 0,67yx xr • = , 2 1 0,36yx xr • = . Testing of 2 1 0,36yx xr • = : 1. 2 10 : 0yx xH ρ • = vs. 2 11 : 0yx xH ρ • ≠ . 2. The test criterion: 2 1 2 1 2 12 3 1 yx x yx x r T r • • − = − = 1,16. 3. The critical value: ( )12 3 0,05 2,262t − = . Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 63 - 4. Since |T| < t9(0,05), we accept the null hypothesis anad conclude the partial correlation is insignificant. In the end, we evaluate the multiple correlation coefficient and test its significance. Using equation 4-7, we get: 1 1 2 1 2 2 1 2 1 2 2 2 2 2 0,87. 1 yx yx yx x x yx y x x x x r r r r r r r • − + = = − As we can see, the value is greater than all the paired correlation coefficients. Testing the significance, we have: 1. 1 20 : 0y x xH ρ • = (no linear dependence) vs. 1 21 : 0y x xH ρ • ≠ . 2. The test criterion: ( ) ( ) 1 2 1 2 2 2 12 3 2 1 y x x y x x r T r • • − = ⋅ − = 14,54. 3. The critical value = FINV(0,05,2,9) = 4,26. 4. Since the test criterion falls in the critical region, we reject the null hypothesis, and we conclude that there is a combined linear effect of the X’s on Y. PROBLEM 5 The sample coefficient of paired correlation 0,23xyr = has been calculated, based on a data sample of size n = 25. Verify for 1% nivel of test whether there is a linear indepdence between the variables X and Y in the population. SOLUTION The test criterion satisfies 2 2 1 xy xy r n T r − = − 2 0,23 25 2 1,133. 1 0,23 − = = − The critical value of the test, found either in statistical tables or in Excel, is equal to t23(0,01) = 2,8. Since 1,133 < 2,8, we cannot reject the null hypothesis, i.e. existence of linear dependence has not been proved. PROBLEM 6 Canard Company has been monitoring a potential dependence of its operational costs per unit of production Y on total production X (in thousands of pieces). Table 20: Costs Y and production X of Canard Company Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 64 ix 60 71 92 144 192 306 iy 5157 2620 1986 1582 1100 954 ix 437 481 747 989 1383 iy 729 456 200 196 110 Source: author’s Calculate the correlation index provided that a hyperbolic dependence of the form a Y b X ε= + + is assumed. Here, a and b are unknown regression coefficients. SOLUTION The least squares method is used to estimate the coefficients a and b. The estimates ˆˆ,a b are then inserted in the regression equation, and the theoretical values ˆˆ ˆ( / )i iy a x b= + are found for different values of ix . Also, the average value y , calculated from all the empirical values of the dependent variable Y, must be evaluated. Equation 4-3 then gives ( ) ( ) 2 2 ˆ 19813814 0,945. 22155242 i yx i y y I y y − = = = − ∑ ∑ The index suggests a high level of dependence of the costs on the production, which is not surprising, of course. However, what we are mainly interested in is that it is the hyperbolic equation that seems to describe the relation well, as suggested by the index. CONTROL TEST 4 Yes/No answers: 4.1 Correlation coefficient measures a dependence of Y on X? 4.2 Correlation coefficient can take on any value from interval [0,1] ? 4.3 The null hypothesis of the test that verifies significance of the correlation coefficient assumes that two variables are uncorrelated? 4.4 It is much simpler to calculate the index of correlation than the paired correlation coefficient? 4.5 Spearman’s correlation coefficient can take on any value from interval [ 1,1]− ? Complete the statement: Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 65 - 4.6 Correlation analysis seeks a measure of __________ 4.7 If 1xyr = , then the line which describes the relation is __________ 4.8 The correlation index can take on any value from interval __________ 4.9 If values of variables X, Y represent ranks, then __________ correlation coefficient is used to describe a linear dependence between the two variables. 4.10 If Y is linearly dependent on ( )mXXXX ,...,, 21= , we use the coefficient of __________ __________to measure the dependence. 4.11 Calculate the paired correlation between coal extraction (in thousands of tons) and costs per ton of the extracted coal (in crowns). The available data are in the table. Mine number č. ix iy 1 350 37 2 351 38 3 329 38 4 329 38,5 5 327 37,5 6 322 39,1 7 321 39,6 8 316 42,1 9 298 42,9 10 286 43,5 ∑ 3229 396,2 Source: author’s 4.12 National accounts were used for a random selection of eight families. The accounts show the gross yearly income X of the families (in crowns) and their yearly expenses Y on consumer products. The data are in the table below. Calculate the correlation index for the case the linear dependence takes the form of a line, and calculate the paired correlation coefficient, as well. ix 211399 306502 250251 264138 iy 42276 72341 49852 53827 ix 274060 297046 328645 249987 iy 59914 60409 71729 47997 Source: author’s Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 66 - 4.13 Ten films were presented to the jury at a film festival, and viewers also participated in the rating of the movies. The final ranking is in the table. Film A B C D E F G H I J Rankings by jury 5 7 9 1 2 8 3 4 6 10 Rankings by viewers 1 6 4 3 8 7 2 5 10 9 Source: author’s Estimate the correlation between the rankings, using Spearman’s correlation coefficient. Test significance of the coefficient for 5% nivel of test. 4.14 Calculate the coefficient of multiple correlation for the data in the next table. The data describes a dependence of output volume Y on fixed capital X1 and employment X2. Economic sector iy 1ix 2ix Agriculture 288443 18781 1055 Food and Drinks 393828 13990 551 Power sector 330300 33813 223 Semi-products 602182 32022 1101 Production equipment 426720 19520 965 Household goods 34008 1258 49 Transit means 185887 10462 358 Consumer goods 427766 16392 1030 Construction sector 436926 19828 1472 Source: author’s SOLUTIONS 4.1 yes 4.2 no 4.3 yes 4.4 no 4.5 yes 4.6 measures dependence 4.7 rising 4.8 [0,1] 4.9 Spearman’s 4.10 Multiple correlation 4.11 0,8967yxr = − 4.12 0,9196yx yxr I= = 4.13 Spearman’s correlation 0,38sr = , 3,44T = , 1,64.K = The coefficient is significant. 4.14 1 2 2 0,6069.y x xr • = Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 67 - 5 METHODS FOR SALES PREDICTIONS Time series theory represents today a very important part of econometrics. The theory enables us to describe systems that change their behaviour in time. It is necessary to say that the dynamics of these systems deepens as the world globalization advances. National economy is a typical example of where the time series analysis is exploited. We might be interested in monthly movements of the aggregate price levels, published by the national statistical office, currency exchange rate closing quotes, etc. Time series, however, do not originate only in economy, but in other spheres of human activity, as well. For instance, birthrates and death rates are observed in demography, maximal nad minimal temperatures are recorded in meteorology, or blood pressure readings of a patient, as part of preventive checkups, are documented by the doctor. The time series analysis aims to understand the mechanism which generated the series in question. Understanding the mechanism allows one, to an extent, to control the functioning of the system which generated the series, and thus to set a desired future course of the system by defining appropriately its input parameters. One may also use the insight into the mechanism for prediction of the future behaviour of the system. The system which created a time series is described by a mathematical model. The time series theory is very extensive. It may as well be the most extensive branch of statistics, as some scholars note. In this chapter, we shall deal with the so-called classical time series theory. If we recall the general form of a regression model, the equation consisted of a systematic part and a random part. Whereas the systematic part reflected the systematic effect of the most important factors on the modelled variable, the random part represented the effect of all other and less important factors the separate influence of which is hard, if not impossible, to capture. Different approaches exist, regarding how to build a model that will approximate the mechanism of generating the time series. The classical approach focuses on the systematic part of the regression model. The classical analysis assumes that the systematic part of the model can be decomposed into several elements of a specific type, which will shed more light on the origin of the series. It is understood that it is easier to detect these elements when they are separated rather than when their effect is aggregated. Another reason that stands behind the decomposition is the effort to discover potential seasonality, since it is a common practice to deprive the series of seasonality for different reasons. In our case, to keep things simpler, we shall also assume that the mathematical model describing the time series containts only one explanatory variable t which represents a point in time. 5.1 TIME SERIES A time series {yt} (TS) is a sequence of values that represent a realization of a sequence of random variables. We shall be mainly interested in economic time series, and in time development of sales, in particular. It is usually assumed for a time series that: • the main factor of change is the time t, • equidistant time intervals, i.e. there is always the same time distance between any two neighbouring points in time t and t’, for which we get values yt and yt’ of the series. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 68 A mathematical model is used to describe the time series. The main objective of building the model is to use it for prediction of the future values of the series. Two types of predictions are distinguished: point prediction and interval prediction. 5.2 TIME SERIES MODEL DECOMPOSITION It is assumed that the model of a time series can be decomposed into four elements which describe different aspects of the time development of the analysed variable: • trend component Tt, • seasonal component St, • cyclical component Ct, • random component tε . The trend component describes the fundamental character of the time development of the series. It tells us the essential character of its movement (whether it rises or drops, whether its level will eventually taper off or accelerate upwards, etc.). The trend expresses the systematic and long-term effect of factors that keep affecting the series in the same way. The trend is either rising or declining. If it’s neither rising nor declining, it is a series with no trend. The seasonal and cyclical components, which combined form the periodic component, capture regular oscilations of the series. The former relates to oscilations that take place within one year. These oscilations repeat every year, at the same moment. The oscilations can be attributed to natural phenomena (different seasons of the year – spring, summer, autumn, winter), or social habits (construction activities are more pronounced in the summer than in winter, for instance). The important feature of seasonality is that the regular oscilations of the same type take place every twelve months at the latest. The cyclical component represents the effect of factors that give rise to longer-term oscilations. We also talk about oscilations around the trend, and the time delay between two oscilations is more than twelve months. It is usually difficult to describe mathematically the cyclical component. Therefore, it is sometimes not included in the time series model at all. The difficulty is given by the fact that the cyclical oscilations are not as regular, and they often vary in their intensity. The trend, seasonal and cyclical components form the deterministic component of the model. It is usually assumed that their combined effect is a result of their addition, i.e. the following model is assumed 5-1 , 0, 1, 2,...t t t t tY T S C tε= + + + = ± ± . In this case, we talk about the additive model of a time series. Two special cases of 5-1 are worth mentioning, as they often appear in economic applications: 1) the case without the periodic component, i.e. the case S Ct t= = 0 , so that we have 5-2 , 0, 1, 2,...t t tY T tε= + = ± ± . In the other case, 0tC = is assumed, which turns 5-1 into 5-3 , 0, 1, 2,...t t t ty T S tε= + + = ± ± . Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 69 This is a time series model with seasonality. Apart from the additive structure of 5-1, there is also a multiplicative version of the model: 5-4 , 0, 1, 2,...t t t t ty T S C tε= ⋅ ⋅ ⋅ = ± ± . The main objective of the time series analysis is to quantify the individual components of the time series model. The stochastic process that generated the time series can also be analysed with other mathematical means, including the so-called adaptive approaches, such as moving averages and exponential smoothing. We shall talk about moving averages later. The following modelling techniques focus on the practical aspect of working with time series, as is the case of other statistical methods in this course. The reader who is more interested in the time series analysis may take specialized courses that cover the subject in a greater detail. 5.2.1 TREND As was mentioned already, time t is now assumed to be the only factor determining the dynamics of the analysed variable. The assumption, although largely simplifying the reality, makes it much more straightforward to model the time series under consideration and separate its individual components. One of these components – the trend – is the most important part the series. Let us assume the model can be written as in 5-2. The trend ܶ௧ of this model is very often described by a linear function, polynomial of degree two, exponential function, modified exponential function, logistic curve or Gompertz’s curve. The functions differ in their complexity, which further affects the way their parameters are estimated. If we work with a linear function or polynomial of degree two, both these function are linear in parameters, and thus their parameters can be estimated with the least squares method explained in the chapter on regression. The case of the other functions is different, since the mathematical description of the curves is not linear in parameters any more. Therefore, a different method must be used to find estimates of the unknown model parameters. Linear trend (or polynomial of degree two) Assuming a linear trend, the model 5-2 can be written as 5-5 0 1 , 0, 1, 2,...t tY t tβ β ε= + + = ± ± . Estimates of the parameters ߚ଴, ߚଵ are obtained with the least squares method. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 70 PROBLEM 1 The following table 21 contains a time series data. Table 21: A time series for problem 1 yt 14,1 15,3 17,7 18,2 20,5 22,8 23,4 25,5 27,9 28,9 31 33,1 35,2 t 1 2 3 4 5 6 7 8 9 10 11 12 13 Source: author’s Excel: you can have the series depicted in Excel, together with the trend calculated by the least squares method. The output is in figure 2. Figure 2: The time series and its trend in the form of a line To get the figure in Excel (version 2010), the following steps must be taken: highlight the area of Excel containing the time series data Yt, t = 1, 2, …, n, and select Insert →→→→ Graph →→→→ XY dot →→→→ with straight connecting lines. This procedure will draw the graph of the original series. If it is necessary to convert the scale on the x-axis of the graph into values 1,2, …, n, click on the graph and press the right button of the computer mouse, choose „Select data“ and „Adjust the x-axis“, as proposed in the dialogue window. If you click on the graph again, you will see Graph tools at the top of Excel and Trend line in the corresponding window. Here, you may select the linear trend line and also its equation from the options. One can proceed similarly in the case of polynomials of degree two, the equation of which is ܶ௧ ൌ ߚ଴ ൅ ߚଵ‫ݐ‬ ൅ ߚଶ‫ݐ‬ଶ , ‫ݐ‬ ൌ 0, േ1, േ2, …., or in the case of polynomials of higher degrees. y = 1,7495x + 11,877 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 Series Series Lineární (Series) Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 71 Logistic trend Logistic curves belong to the set of S-curves. This type of curves is often used in situations where a certain business cycle occurs. These cycles experience phases including the phase of saturation, which is what logistic curves capture well because they have a horizontal asymptote. For instance, when a new product is brought to the market, it is expected that it will take some time before customers register the product and try it out. Thus, at the early stages of the product marketing, the product sales volumes will go up slowly. Later, however, when the product comes into use, the sales growth curve will probably be steeper as more customers start to move from older versions of the product to the new version. When the sales volumes reach their climax, the product will dominate the market, as it becomes a fad. Later, competitors will catch up, and the sales volumes of the product will start to weaken. All these stages can be depicted by the logistic curve (figure 3). Figure 3: The S-shape of a logistic trend Logistic trend is given by equation 1 0 1 , 0, 0, t = 1,2,... 1 t t T κ κ β β β = > > + . The function has an S-curve shape for ߚଵ ൏ 1, ߚ଴ ൐ 1. The unknown parameters of the trend can be estimated using a method of selected points: Let the length of the time series be ܶ, where ܶ is an odd number, and let us select chronologically the first, the middle (the p-th, say) and the last observation of the series. Then the parameter estimates satisfy 5-6 ܾଵ ൌ ඨ ሺ1/‫ݕ‬௣ሻ െ ሺ1/‫ݕ‬்ሻ ሺ1/‫ݕ‬ଵሻ െ ሺ1/‫ݕ‬௣ሻ ೛షభ , ݇ ൌ ‫ݕ‬ଵሺ1 െ ܾଵ ௣ିଵ ሻ ൫‫ݕ‬ଵ ‫ݕ‬௣⁄ ൯ െ ܾଵ ௣ିଵ, ܾ଴ ൌ ሺ1/‫ݕ‬ଵሻ െ ሺ1/‫ݕ‬௣ሻ ܾଵ െ ܾଵ ௣ · ݇. If T is even, it is necessary to select other values of the series: for example, the first, the p-th and the r-th such that r-pൌp-1, and then formula 5-6 can be used again. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 72 PROBLEM 2 Table 22 contains values of a time series. Describe the series with a logistic trend, using the method of selected points. Table 22: Values of a time series to be captured by a logistic trend t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 yt 0,4 0,6 0,6 0,7 0,7 0,8 0,9 0,9 1 1,1 1,2 1,2 1,8 1,4 1,6 1,7 t 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 yt 2,3 2,5 2,2 2,6 2,3 2,6 2,7 2,8 3,2 3 3,1 3,1 3,3 3,4 3,6 Source: author’s The method of selected points gives ܾଵ ൌ ඨ ሺ1/‫ݕ‬௣ሻ െ ሺ1/‫ݕ‬்ሻ ሺ1/‫ݕ‬ଵሻ െ ሺ1/‫ݕ‬௣ሻ ೛షభ ൌ ඨ ሺ1/1,73ሻ െ ሺ1/3,56ሻ ሺ1/0,35ሻ െ ሺ1/1,73ሻ భఱ ൌ 0,872996836, ݇ ൌ ‫ݕ‬ଵሺ1 െ ܾଵ ௣ିଵ ሻ ൫‫ݕ‬ଵ ‫ݕ‬௣⁄ ൯ െ ܾଵ ௣ିଵ ൌ 4,2309685, ܾ଴ ൌ ሺ1/‫ݕ‬ଵሻ െ ሺ1/‫ݕ‬௣ሻ ܾଵ െ ܾଵ ௣ · ݇ ൌ 12,70162866. Thus, the final model of the series is ‫ݕ‬ො௧ ൌ 4,23 1 ൅ 12,7 · 0,87௧ . Table 23 contains the theoretical values of the series generated by the final model. This series and the original series are shown in figure 4 for visual comparison. Table 23: The original and modelled (theoretical) time series t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 yt 0,4 0,6 0,6 0,7 0,7 0,8 0,9 0,9 1 1,1 1,2 1,2 1,8 1,4 1,56 1,73 ࢟ෝ࢚ 0,4 0,4 0,5 0,5 0,6 0,7 0,7 0,8 0,9 1 1,1 1,3 1,4 1,5 1,64 1,79 t 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 yt 2,3 2,5 2,2 2,6 2,3 2,6 2,7 2,8 3,2 3 3,1 3,1 3,3 3,44 3,56 ࢟ෝ࢚ 1,9 2,1 2,2 2,4 2,5 2,7 2,8 2,9 3 3,2 3,3 3,4 3,5 3,54 3,62 Source: author’s Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 73 Figure 4: The original series and its logistic model There are more trends to choose from, of course, such as exponential trend, modified exponential trend, Gompertz’s curve, etc. As we have said at the beginning of the chapter, the readers interested particularly in this subject, may take a specialized course devoted to time series. We note that a suitable trend can be selected, using differences of the original data: ∆ଵ ‫ݕ‬௧ ൌ ‫ݕ‬௧ െ ‫ݕ‬௧ିଵ, ∆ଶ ‫ݕ‬௧ ൌ ∆ଵ ‫ݕ‬௧ െ ∆ଵ ‫ݕ‬௧ିଵ. The selection rule based on the differences is as follows: Table 24: Trend selection Criterion Trend 1 yt ≈ constant Linear 1 yt ≈ linear, 2 yt ≈ constant Quadratic yt - yt-1 ≈ Gauss curve Logistic 5.2.2 SEASONAL COMPONENT – THE CASE OF CONSTANT SEASONALITY Effect of seasonal factors can be described not only by a suitable moving average (to be discussed later), but also by a mathematical curve to which the principles of regression are applied. This concept is based on expanding the regression function, which contains the trend component already, by adding the seasonal component to it, the component depending on unknown parameters to be estimated, as well. As a result, more parameters of the regression function in question are to be estimated – some of these parameters relate to the trend component, while others belong to the seasonal component. The seasonal component is represented by auxiliary variables ‫ݔ‬௜, i = 2, 3, ..., s, where s is the number of seasons the model works with. Each variable ‫ݔ‬௜ is a dichotomous variable, which means that it can only take on values 0 and 1. The variable ‫ݔ‬௜ takes on value 1 for the i – th season and 0 otherwise (for all the other seasons). There are s-1 auxiliary variables in the model because otherwise, if the number was equal to s, perfect multicollinearity would exist, which is a problem that inhibits one from estimating the model parameters (the inverse of the matrix ்ܺ ܺ ceases to exist). Therefore, the effect of one of the seasons is incorporated into the absolute term of the model. 0 0,5 1 1,5 2 2,5 3 3,5 4 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Series Model Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 74 The expanded model is of the form ‫ݕ‬௧ ൌ ܶ௧ ൅ ߙଶ‫ݔ‬ଶ ൅ ߙଷ‫ݔ‬ଷ ൅ ‫ڮ‬ ൅ ߙ௦‫ݔ‬௦ ൅ ߝ௧. If a polynomial trend is assumed, we may write ‫ݕ‬௧ ൌ ߚ଴ ൅ ߚଵ‫ݐ‬ ൅ … ൅ ߚ௞‫ݐ‬௞ ൅ ߙଶ‫ݔ‬ଶ ൅ ‫ڮ‬ ൅ ߙ௦‫ݔ‬௦ ൅ ߝ௧, where the absolute term ߚ଴ contains the effect of the first season. This representation of seasonality also assumes that the sum of all seasonal fluctuations equals zero, i.e. the fluctuations cancel out. Also, independence of the seasonality on the trend is assumed as well as the additive decomposition of the entire regression model. Should there be a dependence between the trend and seasonality, the so-called proportional seasonality might be more convenient for this situation. Let us demonstrate now how to work with the model. PROBLEM 3 Table 25 contains values of a time series that spans a four-year time period. These are quaterly data and, as suggested by figure 5, seasonality probably occurs in each quater of the year. We also adopt the view, based on the graph, that the trend component of our model could be linear. Therefore, we decided to use the model ‫ݕ‬௧ ൌ ߚ଴ ൅ ߚଵ‫ݐ‬൅ߙଶ‫ݔ‬ଶ൅ߙଷ‫ݔ‬ଷ ൅ ߙସ‫ݔ‬ସ ൅ ߝ௧. Table 25: Quaterly time series data Season(i) Sz(1) Sz(2) Sz(3) Sz(4) Sz(1) Sz(2) Sz(3) Sz(4) t 1 2 3 4 5 6 7 8 yt 1,74 -0,3 2,27 -2,7 1,19 -1 1,51 -2,8 Season(i) Sz(1) Sz(2) Sz(3) Sz(4) Sz(1) Sz(2) Sz(3) Sz(4) t 9 10 11 12 13 14 15 16 yt 1,46 -0,2 2,35 -3 1,39 -1,3 1,7 -2,5 Source: author’s Figure 5: Time series data from table 25 -4 -2 0 2 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Series Series Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 75 Let us estimate the unknown coefficients of the model, using the least squares method (see the chapter on regression analysis). We get β଴ൌ1,605, βଵൌ-0,023, αଶൌ-2,1, αଷൌ0,56, αସൌ-4,13. Thus, ܻ෠௧ ൌ 1,605 െ 0,023‫ݐ‬ െ 2,1‫ݔ‬ଶ ൅ 0,56‫ݔ‬ଷ െ 4,13‫ݔ‬ସ. However, this is not where the calculation ends. We must realize that the value 1,605 includes the effect of the first-quater seasonality, and we would like to know that effect. Also, the estimates ߙ௜ do not represent the effect of the corresponding i-th quarter yet. For example, the value 0,56 reflects an increase in the third quarter, taking into account the effect of the first quarter which is automatically absorbed in the value 1,605. To isolate the seasonal effects Sz୧, i = 1, 2, 3, 4, for all the quarters, we calculate Szଵ ൌ αଶ ൅ αଷ ൅ αସ 4 , Szଶ ൌ αଶ ൅ Szଵ, Szଷ ൌ αଷ ൅ Szଵ, Szସ ൌ αସ ൅ Szଵ. In our case, ܵ‫ݖ‬ଵ ൌ 1,4175, so the other seasonal effects are ܵ‫ݖ‬ଶ ൌ െ0,68, ܵ‫ݖ‬ଷ ൌ 1,977, ܵ‫ݖ‬ସ ൌ െ2,71. Using one more auxiliary variable xଵ for the first season, we now have yො୲ൌሺ1,605-1,4175ሻ-0,023t൅1,4175xଵ-0,68xଶ൅1,977xଷ-2,71xସ. We can evaluate the theoretical values given by the resulting model. This is done in table 26. These values are visually compared to the original values of the time series (figure 6). The graph proves that the model has been selected well enough. Table 26:Theoretical values given by the regression model t 1 2 3 4 5 6 7 8 y-theoretical 1,582 -0,5 2,096 -2,617 1,49 -0,6 2,004 -2,709 t 9 10 11 12 13 14 15 16 y-theoretical 1,398 -0,7 1,912 -2,801 1,306 -0,8 1,82 -2,893 Source: author’s Figure 6: Comparison of the theoretical and empirical time series data -4 -3 -2 -1 0 1 2 3 1 2 3 4 5 6 7 8 9 10111213141516 Series Model Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 76 - 5.2.3 PROPERTIES OF THE RANDOM COMPONENT OF A REGRESSION MODEL We considered in our models, based on their decomposition into the trend and seasonal component, a random component as well. It is this component that gives the final answer to how the time series in question behaves. We also used the principles of regression to estimate the unknown parameters that appeared in our model. Therefore, care should be taken to make sure that such estimated parameters have good statistical properties. To achieve this goal, it is necessary that the random part of the model satisfies the conditions of the classical regression model. These conditions are listed in the chapter on regression, and we repeat them here for convenience: 1. Expected value of εi is zero, or E(εt) = 0 for each t. 2. Variance of εt is constant, i.e. independent of t: Var(εt) = σ2 for each t. 3. Variables εi, εj are uncorrelated, i.e. Cov(εi, εj) = 0 for i ≠ j. 4. Variable εi follows a normal distribution N(0, σ2 ) for each t. If the condition 2 holds, we talk about homoscedasticity (in the opposite case, when the condition does not hold, we talk about heteroscedasticity). The conditions above should be verified with appropriate statistical methods. We shall focus now on the third condition which typically isn’t met in the case of time series. Before we do so, we note the following: condition 1 is not usually verified, and is assumed to be true. If condition 4 holds together with all the other listed conditions, the least squares estimates are the best of all unbiased estimates in a certain sense. If „only“ the first three conditions hold, which is not to be taken for granted, of course, the least squares estimates will „only“ be the best estimates of all the so-called linear unbiased estimates. There are some terms we just used that require an explanation. We shall not provide an explanation at this point because the main reason for making the note is to stress that we can still get reasonably good estimates of the unknown regression coefficients, using the least squares method, even if condition 4 does not hold. As far as condition 2 is concerned, one may use the Goldfeld-Quandt’s test to verify the validity of the condition, or some other more general test, such as the White’s test. However, heteroscedasticity is a problem, typical of cross-section data, not of time series data, and so we shall not be preoccupied with it here. For these reasons, we shall focus exclusively on the problem of autocorrelation (condition 3). An in-depth coverage of the other problems may be found in different school subjects (in Econometrics, for instance). The analysis of condition 3 is based on analysing the residuals of the model, i.e. it is based on the values ˆˆ , 0, 1, 2,...t t t te Y T S t= − − = ± ± . Here, the original regression model, describing the time series, is of the form , 0, 1, 2,...t t t tY T S tε= + + = ± ± . In the expression ˆˆ t t t te Y T S= − − , tY represents an individual value of the series, ˆ tT is an estimate of the trend, and ˆ tS is an estimate of the seasonal component of the model. To verify that the random terms of the model are uncorrelated, the Durbin-Watson’s test is frequently used. 5.2.4 DURBIN-WATSON’S TEST The test verifies the null hypothesis: the random terms of the model are uncorelated versus the alternative hypothesis: the random terms are correlated, the correlation taking the Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 77 following first-order autoregressive form AR(1): ߝ௧ ൌ ߩߝ௧ିଵ ൅ ‫ݑ‬௧, where ‫ݑ‬௧ satisfies the conditions of the classical regression. The AR(1) model also contains an unknown parameter ߩ, which is the population paired correlation coefficient measuring the correlation between ߝ௧ and ߝ௧ିଵ. The test analyses validity of the null hypothesis that there is no autocorrelation in the model versus the alternative hypothesis that there is an autocorrelation of the form AR(1). The test is realized in several steps. First, the least squares estimates of the unknown regression parameters of the model are found, and the estimates are used to calculate the residuals of the model ݁௧. Second, the residuals are used to construct the following test criterion 5-7 ܶ ൌ ∑ ሺ݁௧െ݁௧ିଵሻଶ் ௧ୀଶ ∑ ݁௧ ଶ் ௧ୀଵ , where T is the length of the times series. Special statistical tables are then used to compare the criterion with the critical value of the test. The tables are at the end of this textbook. To use the tables properly, for the given number of observations ܶ, nivel of test ߙ and the number of model parameters ݇, which does not include the absolute term of the model, a lower level ݀௅ and an upper level ݀ு are found in the tables. Also, the sample paired correlation r between the neighbouring residuals has to be calculated according to 5-8 ‫ݎ‬ ൌ ∑ ݁௧݁௧ିଵ ் ௧ୀଶ ∑ ݁௧ ଶ் ்ୀଵ . If ‫ݎ‬ is positive, the conclusion to the test is such that if the test criterion 5-7 is greater than ݀ு, the null hypothesis is accepted, whereas if the criterion is smaller than ݀௅, the null hypothesis is rejected. If ‫ݎ‬ is negative, the statistic ܶ‫כ‬ ൌ 4 െ ܶ is used instead of T, and the just-described conclusion is worded based on ܶ‫כ‬ . If T or ܶ‫כ‬ falls between the values ݀௅ and ݀ு, the test is inconclusive, however, it is recommended that an autocorrelation be assumed preventively because this is usually the case in time series models. PROBLEM 4 Table 27 contains fictitious data on households’ monthly expenses on food (in millions of crowns). The data spans the time period from January 2000 (t = 1) to March 2001 (t = 15). Table 27: Monthly expenditures Yt Yt 141 145 142 147 146 154 150 158 157 165 164 170 167 174 175 t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Source: author’s Given the character of the series (figure 7), we shall assume that it follows a linear trend y୲ൌβ଴൅βଵx୲൅ε୲ with a single explanatory variable ‫ݔ‬௧ ൌ ‫.ݐ‬ The seasonal component is not assumed. We start with the estimation of the parameters, and the resulting model will be tested for autocorrelation, using the Durbin – Watson’s test. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 78 Figure 7: The time series Yt The least squares method gives ࢈ ൌ ൬ ܾ଴ ܾଵ ൰ ൌ ሺࢄᇱ ࢄሻିଵ ࢄᇱ ࢅ ൌ ቀ 0,295 െ0,028 െ0,028 0,0036 ቁ · ቀ 2355 19552 ቁ ൌ ቀ 136,66 2,543 ቁ. Inserting the estimates in the model, we get the theoretical values ܻ෠௧ ൌ ܾ଴ ൅ ܾଵ‫ݐ‬ and the residuals ݁௧ ൌ ܻ௧ െ ܻ෠௧, ݁௧ିଵ ൌ ܻ௧ିଵ െ ܻ෠௧ିଵ. Calculations necessary for the test are in table 28. Table 28: Residuals and preparatory calculations et et-1 (et - et-1) 2 et 2 et . et-1 1,8 3,24 3,25 1,8 2,1025 10,5625 5,85 -2,28 3,25 30,5809 5,1984 -7,41 0,17 -2,28 6,0025 0,0289 -0,3876 -3,37 0,17 12,5316 11,3569 -0,5729 2,08 -3,37 29,7025 4,3264 -7,0096 -4 2,08 42,6409 19,8025 -9,256 1 -4 29,7025 1 -4,45 -2,54 1 12,5316 6,4516 -2,54 2,91 -2,54 29,7025 8,4681 -7,3914 -0,62 2,91 12,4609 0,3844 -1,8042 2,82 -0,62 11,8336 7,9524 -1,7484 -2,71 2,82 30,5809 7,3441 -7,6422 1,74 -2,71 19,8025 3,0276 -4,7154 0,2 1,74 2,3716 0,04 0,348 Sum 272,547 85,9438 -48,7297 For the Durbin – Watson’s test, we get the criterion ‫ܹܦ‬ ൌ 272,547/85,9 ൌ 3,1. The estimated correlation between the residuals is according to 5-8: ‫ݎ‬ ൌ െ48,73/85,9 ൌ െ0,56. In the tables at the end of the textbook, we find ݀‫ܮ‬ ൌ 1,077 and ݀‫ܪ‬ ൌ 1,361 for ܶ ൌ 15, the number of regressors without the absolute term = 1 and the nivel of test ߙ ൌ 0,05. Since the sample paired correlation between the residuals is negative, we shall use the alternative test Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 79 criterion T* = 4 െ ‫ܹܦ‬ ൌ 0,9. This number is very low, which means that we may assume the presence of an AR(1) autocorrelation in the model. 5.3 MOVING AVERAGES The trends we have devoted ourselves to assumed that the objective was to find a single mathematical curve that would intersperse and describe the entire times series. Such an effort can be justified, for instance, in situations when the analysed time series is fairly short, so that it makes sense to explain the way it was generated with a single function. Another example is the situation when the user of the final model is interested in a longer-term trend of the series rather than its short-term fluctuations. However, we are often encountered with circuimstances under which using a single function to describe the series behaviour is a too ambitious objective. To give an example, let us consider a company that is interested in the short-term future of the series. If this is the case, the company will probably also be interested in the time period of the series that has just preceded the present, for it is this recent past that will most likely affect the future behaviour of the series to a great extent, as opposed to the more distant past. In such a case, it makes more sense to work with models that reflect potential shifts in the behaviour of the series. The recognition of the shifts means, in other words, that we distinguish between the behaviour of the series from the recent past and its behaviour from the distant past. This approach does not assume that the mechanism which generated the series remains the same across time. Approaches which assume shifts in the behaviour of the series are called adaptive approaches, and include, among other techniques, moving averages. Moving averages are based on the principle that only a part of the series is modelled with a selected function, and for later analyses, only one of the values of the function – the middle value - is used as a representative. The function used to model the part of the series depends on unknown parameters which can be estimated with the least squares technique. It is assumed that the parameters may change in time. In this chapter, we shall again work with an additive regression function ܻ௧ ൌ ܶ௧ ൅ ߝ௧, where ߝ௧ satisfies all the required conditions, and we also assume that the analytical form of the trend remains the same across time. However, as we move from one part of the series to another, the model used to describe the new part will generally have different parameters. Therefore, it is said that models with changeable parameters are applied. There are different types of moving averages. If a linear function is used to model various sections of the time series, we talk about a simple moving average. If a second-order polynomial is used for these purposes, we talk about a weighted moving average. Let us take an example to see the theoretical and practical aspects of working with moving averages. 5.3.1 SIMPLE MOVING AVERAGES Let ܻଵ, ܻଶ, … , ܻ௡ be a time series. To use the moving average technique, the length m of the series to be modelled must be determined first. Also, the order of the polynomial that will be used as a model must be selected. If a linear function is used, it will be a polynomial of order one. The length m is usually chosen to be an odd number which can be written as ݉ ൌ 2‫݌‬ ൅ 1, where ‫݌‬ is a positive integer. Each part o the series to be modelled has its center point. These are the values ܻ௧, where ‫ݐ‬ ൌ ‫݌‬ ൅ 1, ‫݌‬ ൅ 2, … , ݊ െ ‫.݌‬ This means that the subsequent part of the series to be modelled is the previous part of the series shifted forward by one observation. This is how the modelled section of the series is moved forward, hence the name of the technique. The center of the first section is given by ܻ௣ାଵ, the center of the Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 80 second part equals ܻ௣ାଶ, etc. … until the center of the final part of the series is ܻ௡ି௣. The k-th part of the series includes observations ܻ௣ା௞ି௣, ܻ௣ା௞ିሺ௣ିଵሻ, … , ܻ௣ା௞, ܻ௣ା௞ାଵ, … , ܻ௣ା௞ା௣, which can be written as ൛ܻ௣ା௞ା௝ൟ ௝ୀି௣ ௣ , or ൛ܻ௧ା௝ൟ ௝ୀି௣ ௣ for ‫ݐ‬ ൌ ‫݌‬ ൅ ݇. To model the part of the series with center point at time ‫,ݐ‬ using the least squares, means to minimize the criterion 5-9 ෍ ൣܻ௧ା௝ െ ܾ଴ሺ‫ݐ‬ሻ െ ܾଵሺ‫ݐ‬ሻ݆൧ ଶ . ௣ ௝ୀି௣ Calculating the corresponding partial derivatives and putting them equal to zero to find the minimum, we arrive at equations 5-10 2 ෍ ൣܻ௧ା௝ െ ܾ଴ሺ‫ݐ‬ሻ െ ܾଵሺ‫ݐ‬ሻ݆൧ሺെ1ሻ ൌ 0, ௣ ௝ୀି௣ 2 ෍ ൣܻ௧ା௝ െ ܾ଴ሺ‫ݐ‬ሻ െ ܾଵሺ‫ݐ‬ሻ݆൧ሺെ݆ሻ ൌ 0. ௣ ௝ୀି௣ We said the principle of moving averages lies in working with a single representative of the modelled part of the series, which is to be the theoretical value of the center-point observation of that part. At this center point, ݆ ൌ 0, which means that ܻ෠௧ ൌ ܾ଴ሺ‫ݐ‬ሻ must hold. Thus, it suffices to calculate only the absolute term of the model, using 5-10. Doing so, we get 5-11 ܾ଴ሺ‫ݐ‬ሻ ൌ 1 ݉ ෍ ܻ௧ା௝. ௣ ௝ୀି௣ As we can see, the center-point representative of a given section of the series is the simple average of all the observations that make up the section. And this is where the name of the technique – simple moving average - came from. PROBLEM 5 Table 29 contains a time series which we will now model with a simple moving average of length five. Table 29: A time series for problem 5 t 1 2 3 4 5 6 7 8 9 10 Yt 34 40 37 42 45 47 44 51 52 58 t 11 12 13 14 15 16 17 18 19 20 Yt 55 64 59 66 68 62 72 75 72 77 Source: author’s Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 81 Since ݉ ൌ 5, we have ‫݌‬ ൌ 2. The first theoretical value is ܻ෠ଷ ൌ ଵ ହ ∑ ܻଷା௝ ൌ 39,6ଶ ௝ୀିଶ , the second theoretical value is ܻ෠ହ ൌ ଵ ହ ∑ ܻସା௝ ൌ 42,2ଶ ௝ୀିଶ , etc.…until the last theoretical value is ܻ෠ଵ଼ ൌ ଵ ହ ∑ ܻଵ଼ା௝ ൌ 71,6ଶ ௝ୀିଶ . Extending table 29 with the theoretical values, we get table 30. Table 30: The original time series and the moving averages of length five t 1 2 3 4 5 6 7 8 9 10 Yt 34 40 37 42 45 47 44 51 52 58 average 40 42 43 46 48 50 52 56 t 11 12 13 14 15 16 17 18 19 20 Yt 55 64 59 66 68 62 72 75 72 77 average 58 60 62 64 65 69 70 72 Source: author’s Figure 8 compares the empirical and theoretical values of the series. The original series was shortened by leaving out the first two and the last two observations. Figure 8: The time series and its moving averages 5.4 MAKING PREDICTIONS WITH TIME SERIES MODELS One of the most important reasons why any time series model is constructed is to use the model for predictions of the future behaviour of the time series. Sometimes we also talk about making extrapolations. Predictions are based on the principle that the past mechanism which generated the series will keep its properties unchanged in the future. Thus, one may expect the model to work reasonably well even for future realizations of the series. Let the time series model be of the form t t tY T ε= + , t = 1, 2,…, n, where tT is a linear or quadratic trend (i.e. a second-order polynomial), and n be the present time point. The point prediction n hY + % of the uknown value t hY + of the series at a time point n + h, where h is positive and represents the time horizon of the point prediction, is given by: Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 82 - 5-12 n h n hY T+ +=% . Here, hnT + is the trend evaluated at n + h. This value is not known, but can be estimated by the given trend regression function. The point prediction allows to estimate the future realization of the time series with a single number in a very simple and straightforward way – the future time point n+h is used and inserted in the estimated trend component of the model. Apart from point prediction, interval prediction for t hY + is constructed, as well. The interval prediction constructed at a time n for a time period n+i is given by the following confidence interval: - If the trend is linear, the 95% confidence interval is of the form 5-13 [ n iY + % – tn-2(0,05) ( )ns Q i , n iY + % + tn-2(0,05) ( )ns Q i ], where 5-14 2 1 1 ˆ 2 n n t t t t Y T s n = = − = − ∑ ∑ and 5-15 2 2 2 1 1 ( ) ( ) 1n n t n i t Q i n t nt = + − = + + −∑ , ( 1) / 2t n= + . - If the trend is quadratic, the 95% confidence interval is given by 5-16 [ n iY + % – tn-3(0,05) ( )ns Q i , n iY + % + tn-3(0,05) ( )ns Q i ], where 5-17 2 1 1 ˆ 3 n n t t t t Y T s n = = − = − ∑ ∑ and 5-18 2 1 2 2 1 1 1 1 2 4 ( ) 1 1, ,( ) ( ) 1, ,( ) , . . . 1 TT nQ i n i n i X X n i n i X n n −         = + + + ⋅ ⋅ ⋅ + + =          . Any 95% confidence interval tells us that it contains the unknown realization of Y with probability = 0,95. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 83 - SUMMARY In this chapter, we studied several important types of time series and possibilities of their use for making predictions. First, we examined the trend components of time series models, the components being expressed in the form of a polynomial or S-curve. In the latter case, a procedure of estimating the unknown parameters of the curve was demonstrated. We also analysed the seasonal component of the model, taking the regression approach again. Finally, the random component of the model and its properties were clarified for the model to be credible when exploited for predictions. One of these properties, the lack of autocorrelation, was examined further with the Durbin-Watson’s statistical test. At the end of the chapter, formulas for point and interval predictions were presented for the case when the trend in the time series model is either linear or quadratic. CONTROL TEST 5 5.1 The deterministic component of a time series model consists of (more of the following answers may be correct): a. trend component b. trend and seasonal components c. trend, seasonal and cyclical components d. seasonal and cyclical components 5.2 The periodical component of a time series model consists of: a. seasonal component b. trend and seasonal components c. trend, seasonal and cyclical components d. seasonal and cyclical components 5.3 Select an item from the left column of the following scheme, and decide what item from the right column it belongs to: (1) Additive time series model (A) has components summed together (2) Multiplicative t.s. model (B) is a line (3) Linear trend in a t.s. model (C) has components multiplied together 5.4 Complete the sentences: a. If the variance of the random part of a model is constant, the property is called __________. b. Random components of a model should be mutually __________. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 84 - 5.5. The following table contains a time series. Model the series with a quadratic trend (no other component, apart from the random one, is assumed in the model). Use the least squares. t 1 2 3 4 5 6 7 8 9 10 Yt 1,2 6,3 14,3 37,1 76,5 125 274 349 499 578 t 11 12 13 14 15 16 17 18 19 20 Yt 711 859 987 1114 1135 1349 1506 1680 1721 1890 Source: author’s 5.6. Model the time series below with moving averages of length five. t 1 2 3 4 5 6 7 Yt 18,683 15,236 20,552 20,988 30,598 23,22 38,375 T 8 9 10 11 12 13 14 15 Yt 43,698 47,813 61,403 62,002 68,386 63,904 68,247 67,818 Source: author’s 5.7 Using the data in the tables below, estimate the model ܻ௧ ൌ ߚ଴ ൅ ߚଵ‫ݔ‬௧ଵ ൅ ߚଶ‫ݔ‬௧ଶ ൅ ߝ௧, and verify the validity of the no-autocorrelation assumption with the Durbin-Watson’s test. The nivel of test is five per cent. t 1 2 3 4 5 6 7 8 xt1 3,3 3,4 3,5 3,5 3,4 3,3 3,4 3,2 xt2 5,9 6 6,2 6,3 6,3 5,9 5,9 5,8 Yt 25,3 23,02 19,9 20,95 18,59 16,15 15,22 17,26 t 9 10 11 12 13 14 15 16 17 xt1 3,2 3,1 3,1 3,1 3,2 3,1 3,1 3 3 xt2 5,5 5,4 5,2 4,8 4,8 4,7 4,6 4,5 4,5 Yt 18,98 20,09 18,65 17,79 20,84 16,69 18,33 16,79 16,48 Source: author’s Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 85 - SOLUTIONS 5.1 c. 5.2 d. 5.3 (1) – (A), (2) – (C), (3) – (B) 5.4 a. homoscedasticity b. uncorrelated. 5.5 The least squares method applied to the model 2 0 1 2Y t tβ β β ε= + + + gives the estimates 0 ˆ 127,β = − 1 ˆ 38,34,β = 2 ˆ 3,27.β = 5.6 Moving averages: t 3 4 5 6 7 8 9 10 11 12 13 average 21,2 22,1 26,7 31,4 36,7 42,9 50,7 56,7 60,7 64,8 66,1 5.7 The least squares give the estimates 0 3,5b = , 1 3,88b = , 2 0,52b = . The residuals are t 1 2 3 4 5 6 7 8 9 e(t) 5,898 3,173 -0,439 0,566 -1,407 -3,256 -4,568 -1,703 0,17 t 10 11 12 13 14 15 16 17 e(t) 1,727 0,389 -0,266 2,401 -1,313 0,379 -0,719 -1,03 The sample correlation is 17 1 2 17 2 1 40,3 0,67. 59,37 t t t t t e e r e − = = = = = ∑ ∑ Further, the Durbin-Watson’s statistic equals 17 2 1 2 17 2 1 ( ) 71,84 1,21. 59,37 t t t t t e e DW e − = = − = = = ∑ ∑ Since the model contains two parameters (the absolute term is excluded from the test), k is equal to 2. The sample size is n = 17. Therefore, the Durbin-Watson’s test table provides us with values dL = 1,015 and dH = 1,536. Since DW lies between the two values, the test is inconclusive, as to whether there is an autocorrelation or not in the model. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 86 - 6 ANALYSIS OF VARIANCE Chapter 2 of this textbook dealt with two-sample t-tests. Chapter 4 is devoted to their extension in the form of analysis of variance - ANOVA. Analysis of variance ranks among one of the most frequently used statistical procedures in marketing as well as other areas of data analysis. The method enables one to assess the potential influence of a qualitative or quantitative variable on another quantitative variable. For example, it is possible to evaluate effects of different forms of a promotional campaign on the sales of a product. In this case, different promotional campaigns reresent different categories of the observed qualitative variable = promotional campaign. The sales are then the quantitative variable in question. The potential effect can be expressed mathematically in such a way that the expression analyses whether a change in the level of the qualitative/quantitative variable changes the population mean of the other observed quantitative variable. In this sense, ANOVA tests if there are any differences among the population means of the quantitative variable. Mathematically speaking, the basic idea of ANOVA is given by a decomposition of what is called the total variability of the observed variable. The decomposition is made up of different sources of the total variability. There is more than one term forming the decomposition. Some of the terms represent the main sources of the total variability. Another term is called the residual variability, which reflects the influence upon the total variability of all the other minor sources. Depending on how many main sources or factors appear in the decomposition, we talk about one-way ANOVA, two-way ANOVA and so on. A shrewd observer might come up with the suggestion that the two-sample t-test could be used several times instead. Such a procedure would test potential differences of various population means under scrutiny. In this case, if none of these tests ended up being significant, i.e. the null hypothesis of equal population means woud always be accepted for each pair, we could conclude that all the population means are the same, i.e we would conclude that the factor has no effect. Theoretically speaking, it is possible to proceed this way, but at the cost of credibility of this procedure. Recall that every statistical test is accompanied by errors, and if a whole series of tests is realized, the probability of these errors may cumulate to unbearable levels. This is the reason why ANOVA was developed as a special procedure to keep the probability at a reasonal level. We shall discuss the one-way and two-way ANOVA in this chapter. ANOVA stands for „ANalysis Of VAriance“. After having studied the subject matter on ANOVA, we encourage the readers to try to resolve theproblems presented in this book on their own, and check their results against those presented at the end of the chapter. 6.1 ONE-WAY ANOVA There are many situations when k independent data samples are available, and the samples do not come from the same population. The sample sizes are 1 2, ,..., kn n n , k being equal or greater than 2. The sample mean ix can be calculated for the i-th sample, as well as the sample variance 2 is . In practice, these samples usually originate by classifying the population into k classes by a factor X, and then drawing randomly data of size in from each of these k classes. The variable X is called factor and its levels or categories are given beforehand, so the factor is called controllable. The categories of X are denoted 1 2, ,..., kx x x . Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 87 Let the factor X is observed at k levels (categories) and may potentialy influence a statistical quantitative variable Y. Values of Y randomly obtained for the i-th category ix of X are denoted 1 2, ,..., ii i iny y y . It is convenient to organize the entry data into table 31. Table 31: ANOVA table Factor level Data sample for the factor level Sample size Sample mean Sample variance 1 1111211 nj y,...,y,...,y,y 1n 1y 2 1s 2 2222221 nj y,...,y,...,y,y 2n 2y 2 2s M M M M M i iinijii y,...,y,...,y,y 21 in iy 2 is M M M M M k kknkjkk y,...,y,...,y,y 21 kn ky 2 ks Total N y 2 s The main principle of the analysis of variance is to decompose the total variability of the observed variable. The total variability, measured by the sum of squared deviations of the individual values of the variable from their average, is divided by the decomposition into a part that reflects a variability within the samples and a part which reflects a variability between the samples. The total variability is usually measured by sample variance: ( ) 2 2 1 ij i j y y s N − = − ∑∑ . In analysis of variance, we are interested only in the numerator of the variance, where 1 1 1 ink ij i j y y N = = = ∑∑ . We shall denote the total sum of squares, which represents the total variability, as yS : 6-1 ( ) 2 1 1 ink y ij i j S y y = = = −∑∑ . Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 88 The symbol v,yS will be used for the within-group variability, which is also called residual variability: 6-2 ( ) 2 , 1 1 ink y v ij i i j S y y = = = −∑∑ . The among-group variability, denoted myS , , is defined as: 6-3 ( ) 2 , 1 k y m i i i S y y n = = − ⋅∑ . Expressions 6-1 to 6-3 use y , the sample average of all the values of y, as well as the subgroup averages iy (see table 31). Using algebraic operations, the following fundamental formula for the one-way analysis of variance can be derived: 6-4 , ,y y m y vS S S= + . Anglo-Saxon scholarly literature and software may denote the just-described variabilities with other symbols, as well. For instance, Sy = SD (D for Difference), Sy,m = ST (T for Treatment), Sy,v = SR (R for Residual). We shall use our symbols in the rest of this chapter. 6.1.1 ANOVA HYPOTHESES Analysis of variance is a statistical test. Therefore, we work with a pair of hypotheses: a null hypothesis and an alternative hypothesis. Before specifying the test, we emphasize that ANOVA has its conditions under which it was derived. The method assumes that each of the k random samples comes from a normal distribution, and the distributions have the same variance. Also, the samples were drawn independently of each other. The prerequisite of normality can be tested in more than one way, using the chi-square test, Anderson-Darling’s test, Kolmogorov-Smirnov’s test, Shapiro-Wilk’s test, etc. Regarding the condition of constant variance, we described earlier the F-test which verifies the hypothesis of equality of two variances. In analysis of variance, more than two samples are usually worked with, and for this case, an extension of the F-test exists in the form of Bartlett’s test. Let us return to ANOVA. Assuming the factor X is observed at k levels, the following relation is considered to hold true. 6-5 i iµ µ α= + , i = 1, 2,..., k. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 89 Here, iµ is the population average of the variable Y, corresponding to the i-th level of factor X, µ is a constant and iα is called effect. It is this effect that is supposed to express the potential difference among the population means of Y, the differences in means being caused by different levels of factor X. Now, we may ask the question whether all k samples came from the same populations. In other words, whether the populations the samples came from have the same means. Whether the effects iα are all equal to zero would be yet another equivalent question. This question forms the null hypothesis of ANOVA: 6-6 H0: 1 2 ... kµ µ µ= = = . or 6-7 H0: 1 2 ... 0kα α α= = = = . The alternative hypothesis is the negation of 6-6 or 6-7. In the first case, this means that the alternative hypothesis takes the form: H1 – there exist indices i and j, such that i jµ µ≠ . The test criterion of ANOVA is 6-8 , , / ( 1) / ( ) y m y v S k T S N k − = − , which follows a Fisher’s distribution with k-1 and N-k degrees of freedom. The critical value of the test 1, ( )k N kF α− − for a nivel of test alpha is tabulated, or it can be obtained with the Excel function FINV(α , k-1, N-k). To sum up, testing the null hypothesis of ANOVA is characterized by the following steps: Step 1. Select a nivel of test α. Alpha is usually 0,1, 0,05, 0,01, or 10%, 5%, 1%, respectively. Step 2. Calculate the test criterion T according to 6-8, where formulas 6-2 and 6-3 are used to get the within-group variability and among-group variability, respectively. Formulas 6-9 and 6-10 may also be used. They are more convenient if the variabilities are to be calculated on a calculator. 6-9 2 2 1 1 1 1 1i in nk k y ij ij i j i j S y y N= = = =   = −     ∑∑ ∑∑ , 6-10 2 2 , 1 1 1 1 ink k y m i i ij i i j S n y y N= = =   = −     ∑ ∑∑ , 6-11 , ,y v y y mS S S= − . Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 90 Step 3. Compare T from step 2 with the critical value 11 , ( )k N kF α− − . If F < 11 , ( )k N kF α− − , the null hypothesis H0 is accepted and the factor X can be pronounced not influential in relation to the variable Y. If 11 , ( )k N kF F α− −≥ , the null hypothesis H0 is rejected, meaning the factor X has a statistically significant influence on the variable Y. If the test confirms that the factor X affects Y, we may ask which population means are different. It can be the case that only population means are different, while all the other population means are the same. There are methods that try to answer this question, one of them being devised by Scheffé and one by Tukey. PROBLEM 1 The following table contains data obtained through several independent random samplings. The observed factor is the number of octanes used to describe the quality of car fuel (90, 91, 95, 98 octanes are usually available). Thus, the factor is monitored at four possible levels. For each of the levels, five car drivers using the fuel of the corresponding quality were randomly selected (see table 32). In this case, all samples have the same size, which is not required for one-way ANOVA. We want to know whether the quality of the fuel affects fuel consumption (car mileage). To answer the question, we shall employ ANOVA. Table 32: Car mileage for different types of fuel Factor levels 90 91 95 98 8,1 7,7 7,6 7,5 8 7,8 7,6 7,8 Samples 7,9 7,9 7,5 7,6 7,8 7,6 7,6 7,5 8,2 7,8 7,6 7,5 Source: author’s The nivel of test is set at 5%. Regarding the among-group variability, we must calculate the column averages (or group averages, more generally speaking). These are 8, 7,76, 7,58 and 7,58 for the first, second, third and fourth column of the table, respectively. The total average is 7,73. Using 6-3 with 5in = for every i, we have , 0,594.y mS = The within-group variability is 2 2 2 2 2 , (8,1 8) (8 8) ... (8,2 8) (7,7 7,76) ... (7,5 7,58) 0,228.y vS = − + − + + − + − + + − = We have N = 20 values altogether and the number of factor levels is k = 4. Therefore, , , / ( 1) 0,594 / 3 13,895. / ( ) 0,228 /16 y m y v S k T S N k − = = = − The critical value K = FINV(0,05,3,16) = 3,2389. Since the test criterion is greater than K, we reject the hypothesis that fuel quality has no effect on car mileage. In other words, it seems the factor does have an influence on car mileage. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 91 Excel: The same procedure can be performed in Excel with its Data Analysis module. The module offers one-way analysis of variance (see figure 9). Figure 9: The dialogue window of the Data Analysis module In the subsequent dialogue window (figure 10), it is necessary to insert as the Input Range a reference to the area of the Excel spreadsheet that contains the data samples to be worked with in ANOVA: Figure 10: Insertion of information for ANOVA The analyst must also confirm in the dialogue window whether each sample represents a column in the spreadsheet table, or a row (see figure 10). In our example, we work with columns. Once the nivel of test alpha is confirmed at 5%, or changed, and the location of the ANOVA output is selected, Excel returns a result of the form shown in table 33. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 92 Table 33: ANOVA results presented by Excel ANOVA Source of variability SS Difference MS F P-value F krit Among-groups 0,594 3 0,198 13,89474 0,0001015 3,238872 Within-groups 0,228 16 0,01425 Total 0,822 19 In this table, „F“ represents the test criterion, and „F krit“ stands for the critical value. As we can see, our calculations were correct. 6.1.2 A MEASURE OF DEPENDENCE Variability of iy ’s around y is caused by a dependence of Y on X. We described such variability with the among-group sum of squares ,y mS . The within-group variability ,y vS is, on the other hand, induced by factors other than X. Higher ,y mS implies a stronger dependence of Y on X. Based on 6-4, this dependence can be measured, using the determination ratio, denoted P2 : 6-12 ,2 y m y S P S = . The square root of P2 is called the correlation ratio. P2 can take on any value from interval [0,1].The stronger the dependence of Y on X, the closer the characteristic is to one, and the closer the among-group sum of squares is to the total sum of squares (total variability). On the contrary, the within-group variability approaches zero in this situation. The closer the determination ratio is to zero, the smaller the part of the total variability which is accounted for by the among-group variability. In this case, the dependence of Y on X is weak. SUMMARY If we are interested in whether there is no statistically significant difference between two population means, we can verify such hypothesis or surmise with the two-sample t-test. Analysis of variance (ANOVA) enables us to verify the hypothesis that there is no difference between two or more population means. The procedure makes it also possible to test if different levels of one factor or more factors have any effect on another quantitative variable. Analysis of variance is based on the idea that total variability of a variable can be broken down to sub-variabilities each of which reflects its own source of variation. One type of variability is called residual variability, and is generated by sources that are not of interest to us, and are hard or impossible to identify. Depending on how many sources of the total variability we work with, we talk about one-way ANOVA, two-way ANOVA, three-way ANOVA, etc. This chapter was devoted to the first type: one-way ANOVA. In one-way ANOVA, the total variability/sum of squares is divided into two parts: one part represents the influence of the only factor considered, while the other part is represented by the residual variability/sum of squares. We assume that the only factor X is observed at Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 93 k possible levels, and we formulate the null hypothesis that all samples, each of which is obtained randomly for the given level of X, came from the same population. To verify the null hypothesis, we use statistic 6-8, which follows a Fisher’s distribution if the null hypothesis is true. The appropriate critical value is found for a given nivel of test alpha. In the end, the null hypothesis is either accepted, meaning that X has no effect, or we reject the null hypothesis, in which case the factor does exert an influence. Apart from the testing itself, which provides a yes/no answer to the existence of an effect of the factor, we can also measure the amount of the effect. This is done by evaluating the determination ratio, the values of which belong to closed interval [0, 1]. The stronger the influence, the higher the value of the ratio. Its square root is called the correlation ratio. CONTROL TEST 6 6.1 One-way ANOVA serves for (check the correct answer(s)): a. calculating frequency distribution of individual variables b. testing presence of effect of a factor on a quantitative variable c. finding probability distribution d. testing mutual correlation of statistical variables 6.2 In ANOVA, we a. test the null hypothesis that population means are the same, b. test the null hypothesis that two variables are mutually dependent, c. test the null hypothesis that value of a variable is different from a preedefined value d. test the null hypothesis that two statistical variables are mutually independent. 6.3 ANOVA uses the critical value of: a. a student’s distribution, b. a Pearson’s distribution, c. a Fisher’s distribution, d. a normal distribution. 6.4 Determine whether the following statements are true (write T) or false (write F): a. The F-test of equal variances should be used before ANOVA. b. The determination ratio takes on values from interval [0,1]. c. The smaller the among-group variability, the stronger the dependence between X and Y. d. The ANOVA test criterion must fall to a certain set for the ANOVA null hypothesis to be accepted. This set is a union of two intervals. e. The variance of ANOVA sample/group averages reflects the within-group variability. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 94 - 6.5 Complete the statement: a. If the ANOVA test criterion F falls to critical region, the variable Y can be considered to be __________ on/of X for a given nivel of test. b. ANOVA, where the single factor is observed at l different levels, and the total number of all observations of Y equals m, works with a Fisher’s distrbution with __________and __________degrees of freedom. c. The ANOVA test criterion F is always __________ (positive/negative). d. One-way ANOVA tests__________ Y on/of a factor X. 6.6 Complete the statement: a. The square root of the determination ratio is called __________ __________. b. If the ANOVA test criterion F falls to __________ __________ , the null hypothesis is rejected. c. To find the critical region in ANOVA, we must know ___________________ and __________. SOLUTIONS 6.1 b. 6.2 a. 6.3 c. 6.4 T, T, F, F, F 6.5 a. dependent, b. l-1 and m-l, c. positive d. independence 6.6 a. correlation ratio, b. critical region, c. degrees of freedom of a Fisher’s distribution and the nivel of test. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 95 - 7 TWO-WAY ANOVA AND LATIN SQUARES We got acquainted with one-way ANOVA in the previous chapter. Now we shall work with more factors. We are in a situation where influence of two or three factors on a quantitative variable is examined, the factors being again qualitative or quantitative. Therefore, we shall work with two-way and three-way ANOVA. Two-way and three-way ANOVA have their own experimental plans. More on experimental plans will be presented in later chapters. These plans can be designed in such a way that not much data is necessary for ANOVA to yield credible results. The plans play an important role in statistics since the more factors appear in the analysis, the more data is needed, and the increase in the amount of data is then fairly steep. If influence of two factors on a quantitative variable Y is observed, we talk about two-way ANOVA. As in the previous chapter, random sampling can be performed for different combinations of the two factors considered, and the data provides us later with the possibility of examining a potential influence of each of the two factors individually. An interaction of the two factors can also be regarded as another factor. We will skip interactions in our ANOVA presentation. Analogous statements are true for the three-way ANOVA case in which three major factors appear, and if it is required, two-factor interactions and a threefactor interaction may be analysed, as well. As was already said, a lot of data is required when more factors are included in ANOVA. Therefore, it is often the case that only one observation is made for each combination of the factors. We then talk about ANOVA with a single observation in each subgroup. The case of the same number of observations in each subgroup was exploited in the previous chapter. However, while one-way ANOVA can do without this experimental plan, the case of two-way and three-way ANOVA is more complicated, and it is strongly recommended that the requirement of the same number of observations be complied with whenever possible in practice since in the opposite case, ANOVA may be carried out in more than one way, and each of the ways give generally different results. This is not the case when the number of observations in each subgroup is the same. 7.1 TWO-WAY ANOVA In two-way ANOVA, two factors are considered. In this case, the total variability, inroduced in chapter six, is decomposed again, but this time into more terms each of which reflects an effect of the corresponding factor. As opposed to the case of one-way ANOVA, where two terms appeared in the decomposition, two-way ANOVA leads to three such terms. The additional term represents an effect of the second factor. The decomposition of the total variability takes the following form: 7-1 A B RS S S S= + + , where 7-2 2 1 1 ( ) k n ij j i S y y = = = −∑∑ , Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 96 - 7-3 ( ) 2 1 n A i i S k y y = = −∑ , 7-4 ( ) 2 1 k B j j S n y y = = −∑ 7-5 .R A BS S S S= − − Here, n is the number of levels/categories of factor A, k is the number of levels/categories of factor B. There are nk observations altogether with a single observation in each subgroup. The symbol iy is used to denote the average observation when A is at its i-th level, while jy is the average observation when B is at its j-th level. The symbol y denotes the average of all observations, as usual. The scheme of the experiment is given in table 34. Table 34: Data scheme for two-way ANOVA Factor B: levels B1 B2 … Bk A1 A2 Factor A: levels . . An jy ijy in the i-th row, j-th column iy The symbol iy can be regarded as the row average in our table representation, and the symbol jy as the column average. The sum AS reflects the effect of A, the sum BS represents the effect of B, the sum RS reflects the effect of all the other factors. The total variability of all observations is described by the term S . Since two factors are involved in this analysis, twoway ANOVA consists of two statistical tests. Each of the tests examines the statistical significance of one of the factors. 7.1.1 EFFECT OF FACTOR A We test the null hypothesis H0: Factor A has no effect on the variable Y. The alternative hypothesis says the opposite: H1: Factor A has an effect on the variable Y. The test criterion T is of the form 7-6 / ( 1) / ( 1) A R S n T S nk n k − = − − + . Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 97 The critical value K = 1, 1( )n nk n kF α− − − + for a nivel of test alpha, i.e. it concerns a Fisher’s distribution with n-1 and nk-n-k+1 degrees of freedom. If T K≥ , we reject the null hypothesis. In this case, we may conclude that the factor A does have an effect on Y. If T K< , we accept the null hypothesis, and the factor A does not have an effect on Y. 7.1.2 EFFECT OF FACTOR B We test the null hypothesis H0: Factor B has no effect on the variable Y. The alternative hypothesis says the opposite: H1: Factor B has an effect on the variable Y. The test criterion T is of the form 7-7 / ( 1) / ( 1) B R S k T S nk n k − = − − + . The critical value K = 1, 1( )k nk n kF α− − − + for a nivel of test alpha. IF T K≥ , we reject the null hypothesis, meaning that the factor B affects Y. On the other hand, if T K< , the null hypothesis is accepted, and the factor does not affect Y. PROBLEM 1 Two factors A, B are given. The factor A is observed at three levels, the factor B is considered at four levels. A single observation is available for each combination of the two factors. We assume the observations originated independently of each other, and they follow a normal distribution with equal variances. The nivel of test being five per cent, we want to test the potential influence of the two factors. The observations are in table 35. Table 35: Entry data for problem 1 B B1 B2 B3 B4 A1 24 25 25 23 A A2 22 21 22 25 A3 21 22 21 21 Source: author’s Table 36 is an extension of table 35, containing the row and column averages as well as the total average. Also, n = 3 and k = 4. Using formulas 7-2 through 7-5, we get Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 98 Table 36: ANOVA and various averages B B1 B2 B3 B4 averages A1 24 25 25 23 24,25 A A2 22 21 22 25 22,5 A3 21 22 21 21 21,25 averages 22,33333 22,66667 22,66667 23 22,66667 Total average 2 1 1 ( ) 30,66, k n ij j i S y y = = = − =∑∑ ( ) 2 1 18,166, n A i i S k y y = = − =∑ ( ) 2 1 0,66, k B j j S n y y = = − =∑ 11,833.R A BS S S S= − − = Therefore: To test the effect of A: / ( 1) 18,166 / 2 4,6. / ( 1) 11,833/ 6 A R S n T S nk n k − = = = − − + To test the effect of B: / ( 1) 0,66 / 3 0,1126. / ( 1) 11,833/ 6 B R S k T S nk n k − = = = − − + The critical values are: for the test of A: K = FINV(0,05,3-1,12-3-4+1) = 5,143. for the test of B: K = FINV(0,05,4-1,12-3-4+1) = 4,757. As we can see, none of the factors seems influential. Excel: The same problem can be solved in Excel, using the Data Analysis module. In this module, Anova: two factors without repetition is selected (figure 11). Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 99 Figure 11: The dialogue window of the Data Analysis module in Excel After selecting ANOVA in the dialogue window, a reference is made to the area of the spreadsheet containing the entry data, and the nivel of test is confirmed at five per cent or changed. Excel then returns the following result (table 37): Table 37: Two-way ANOVA results provided by Excel ANOVA Source of variability SS Difference MS F P-vaue F krit Rows 18,16666667 2 9,083333 4,605634 0,06137 5,143253 Columns 0,666666667 3 0,222222 0,112676 0,949497 4,757063 Error 11,83333333 6 1,972222 Total 30,66666667 11 The interpretation of the table is the same as in the case of one-way ANOVA. The second column contains the terms making up the total variability in 7-1. The F symbol is the test criterion for the two tests of ANOVA and F krit stands for the corresponding critical values. 7.2 THREE-WAY ANOVA (LATIN SQUARES) A special case of a three-factor procedure, called Latin squares, belongs to analysis of variance, as well. We shall describe the procedure at the end of this chapter. Latin squares rank among classical methods of experimental design. The name of the procedure dates back to the eighteenth century when L. Euler (1707 – 1783) presented to the Petrohrad-based academy a problem on 36 commissioned officers: the task was to position officers of 6 different ranks from 6 different regiments on a square in such a way that each row and column of the square contained officers of all ranks from all regiments. More generally speaking: Let us have objects with two properties of interest: A and B (example: A = rank, B = regiment). Each property may take on n different forms (example: n = 6, 6 different military ranks, such as private, corporal, sergeant, captain, major and colonel; 6 different regiments). The task is to set up the n2 objects, each having a different A and B properties, so that each row and column was occupied by objects that do not have the same A and B properties Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 100 (example: the first row is occupied by the private from the first regiment, the corporal from the second regiment, etc.). Such a scheme is called the Latin square of order n. The wellknown mathematical result discovered by Euler himself says that at least one Latin square of order n exists for any integer n. We will use Latin squares for three-way analysis of variance. Let us have three factors an effect of which on another variable Y is conceivable at the moment. Since three factors are involved in the analysis, it is hard to represent the whole experiment with a two-dimensional table. Another problem is that the number of data increases significantly as more factors are included in the analysis. However, it is possible to consider each combination of the factor levels, and utilize a single observation for each of these combinations. Such a procedure allows us to represent the experiment with a table. The heading of the table will describe different levels of two factors, while the interior of the table will contain a level of the third factor together with the single observation realized by an experiment. The levels of the third factor will be selected in such a way that the whole scheme results in a Latin square. Let us denote the three factors A, B and C. When talking about a Latin square of order n = 3, we may describe our experiment in the form of table 38 Table 38: Three-way ANOVA scheme in the form of a Latin square a b c b c a c a b One side of the square represents the three levels of factor A. The adjoining side of the square represents the three levels of factor B. The interior of the table contains the levels of the third factor C. To read the square correctly, we say, for instance, that when both the factor A and B are at their first level, the factor C is as well at its first level (this corresponds to the element [1,1] of the table). For this combination of the levels, a single observation of the variable Y is realized and inscribed into the table. The analogous procedure holds true for the other elements of the table. One of the merits of this experiment is that we need only 9 observations instead of 27 we would have needed if we had wanted to use all the possible combinations of the factors. At the same time, the design of the experiment is such that the analysis of variance will give credible results. The total variability S in three-way ANOVA has the form 7-8 A B C RS S S S S= + + + , where 7-9 2 1 ( ) n A i i S n y y•• = = −∑ reflects the effect of factor A, and iy •• is the average of Y when A is at its i-th level, Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 101 - 7-10 2 1 ( ) n B j j S n y y• • = = −∑ reflects the effect of factor B, and jy• • is the average of Y when B is at its j-th level, and 7-11 2 1 ( ) n C k k S n y y•• = = −∑ reflects the effect of factor C, where ky •• is the average of Y when C is at its k-th level. Finally, 7-12 2 ( )ijk i j k S y y= −∑∑∑ is the total sum of squares. The expression contains individual observations of Y, i.e. the terms ijky , and also the average of all observations y . Also, 7-13 R A B CS S S S S= − − − . is the residual sum of squares. Three-way ANOVA comprises three statistical tests. Each of these tests relates to one of the factors. Also, in each of the tests, the null hypothesis has the form: H0 – the tested factor is insignificant. The alternative hypothesis is: H1 the tested factor has an effect. The tests criteria for each test are summarized in table 39: Table 39: Three-way ANOVA Source of variability Sum of squares Degrees of freedom Estimate of variance Test criterion Factor A SA dfA=n-1 MSA=SA / dfA FA=MSA / MSR Factor B SB dfB=n-1 MSB=SB / dfB FB=MSB / MSR Factor C SC dfC=n-1 MSC=SC / dfC FC=MSC / MSR Residuals SR dfR=(n-1)(n-2) MSR=SR / dfR Total S dfT=n2 -1 If FA ≥ 1,( 1)( 2) ( )n n nF α− − − , the null hypothesis is rejected, and factor A is considered significant. If the opposite inequality holds, the null hypothesis is accepted. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 102 If FB ≥ 1,( 1)( 2) ( )n n nF α− − − , the null hypothesis is rejected, and factor B is considered significant. If the opposite inequality holds, the null hypothesis is accepted. If FC ≥ 1,( 1)( 2) ( )n n nF α− − − , the null hypothesis is rejected, and factor C is considered significant. If the opposite inequality holds, the null hypothesis is accepted. PROBLEM 2 Fuel emission Y is studied, and its potential dependence on the following three factors: Factor 1 = petrol ingredient (A, B, C, D), Factor 2 = car driver (I, II, III, IV), Factor 3 = vehicle used (1, 2, 3, 4). The result of the corresponding experiment is in table 40 Table 40: Entry data for the three-way ANOVA in problem 2 driver\vehicle 1 2 3 4 I A : 21 B : 26 D : 20 C : 25 II D : 23 C : 26 A : 20 B : 27 III B : 15 D : 13 C : 16 A : 16 IV C : 17 A : 15 B : 20 D : 20 We are testing the potential effect of the individual factors on Y, the nivel of test being five per cent. According to formulas 7-9 through 7-13, we have 2 1 1 ( ) 40. n i i S n y y•• = = − =∑ 2 2 1 ( ) 216. n j j S n y y• • = = − =∑ 2 3 1 ( ) 24. n k k S n y y•• = = − =∑ Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 103 - 2 ( ) 296.ijk i j k S y y= − =∑∑∑ 16.R A B CS S S S S= − − − = The test criteria are: For factor 1: (40 / 3) 5. (16 / 6) T = = For factor 2: (216 / 3) 27. (16 / 6) T = = For factor 3: (24 / 3) 3. (16 / 6) T = = The critical value is the same in all three tests: K = FINV(0,05,3,6) = 4,757. This means the factors 1 and 2 are statistically significant, whereas type of vehicle used is not. SUMMARY This chapter described three-way ANOVA, in particular its special form called Latin squares, and also two-way ANOVA with a single observation in each subgroup. We explained the purpose of these methods and the mathematical technique behind it. The reader became familiar with the terms two-factor and three-factor analysis of variance, Latin squares, decomposition of total variability and ANOVA table. The following problems allow the reader to practise the methods, including one-way ANOVA from the previous chapter. CONTROL TEST 7 1) The nivel of test being 5 per cent, test if parsley yields depend on type of fertilizer used. All the necessary observations are in table 41. Table 41: Entry data for ANOVA Fertilizer Yields (1kg/10m2 ) A 40 42 45 40 44 47 B 76 75 82 68 C 60 58 62 64 70 Source: author’s 2) Evaluate intensity of dependence between the parsley yields and the type of fertilizer, using the determination ratio characteristic. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 104 - 3) Six drivers were randomly selected, each of them experiencing a ride with different gasoline. Test whether gasoline consumption depends on type of gasoline used and/or driver. The nivel of test is 5%. The data are in table 42. Table 42: Entry data for ANOVA Driver Gasoline A B C D E F Averages Aral 7,5 6,9 7,9 7,3 6,9 7,8 7,38 Shell 7,6 7,2 7,5 8 7,3 8,2 7,63 Benzina 7,2 8,1 7,8 7,6 7,8 6,9 7,57 Slovnaft 7 7,3 7,2 7,5 8,2 7,7 7,48 Averages 7,33 7,38 7,6 7,6 7,55 7,65 7,5 Source: author’s SOLUTIONS 1) There are three types of fertilizers, i.e. k = 3, and the corresponding samples of observations are of the size 1 2 36, 4 5n n n= = = , respectively. The total number of observations is N = 15. Tested is the hypothesis 0 1 2 3H : µ µ µ= = , i.e. parsley yields do not depend on type of fertilizer used. To perform the test, we make the following preliminary calculations: • Conditional averages 1 in ij j i i y y n = = ∑ , for 1,2...,i k= , where ijy are observations. • Total average 1 1 1 1 1ink k ij i i j i y y y N k= = = = =∑∑ ∑ , • Among-group variability ( ) 2 , 1 k y m i i i S n y y = = −∑ , where: in is the number of observations in the i-th group, iy is the sample average in the i-th data group. • Within-group variability ( ) 2 , 1 1 ink y v ij i i j S y y = = = −∑∑ . • Total variability , ,y y m y vS S S= + . We get: 1 2 3 40 42 ... 47 43, 6 76 75 ... 68 75,25, 4 62,8, y y y + + + = = + + + = = = Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 105 - 43 75,25 62,8 60,35. 3 y + + = = ( ) ( ) ( ) ( ) 2 2 2 2 , 1 6 43 60,35 4 75,25 60,35 5 62,8 60,35 2724,188. k y m i i i S n y y = = − = − + − + − =∑ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 2 2 2 , 1 1 2 2 2 2 2 2 40 43 42 43 ... 47 43 76 75,25 75 75,25 ... 68 75,25 60 62,8 58 62,8 ... 70 62,8 223,55. ink y v ij i i j S y y = = = − = − + − + + − + + − + − + + − + + − + − + + − = ∑∑ The results are summarized in table 43: Table 43: ANOVA output Source of variability Sums of squares Degrees of freedom Averaged sums of squares Test criterion F Factor x (among-group variability) 2724,188 k – 1 = 2 1362,1 73,12 Residual (within-group) variability 223,55 N – k = 12 18,63 Total variability 2947,74 N – 1 = 14 The test criterion 73,12T = , the critical value 2,12 (0,05) 3,89F = , the critical region is [3,89; )C = + ∞ . Since T belongs to the critical region, we reject the null hypothesis. Parsley yields depend on type of fertilizer used. 2) To answer the question: „How strong is the relationship between type of fertilizer used and parsley yields?“, we calculate the correlation ratio ,y m y S P S = , where myS , is the among-group variability, yS is the total variability. We get 2724,188 0,92 0,96 2947,74 P = = = . Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 106 Raising the result to the second power, we obtain the determination ratio 2 0,922P = . A value of the determination ratio which is close to one signals a strong dependence of parsley yields on type of fertilizer used. 3) The table with entry data already contains the subgroup averages needed for testing dependence of fuel consumption Y on type of gasoline X1 and driver X2. There are four levels of X1, 4=k , and six levels of X2, 6=r . In the case of X1, we test the null hypothesis: factor X1 is not influential, versus the alternative hypthesis: X1 is influential. The hypotheses for the second factor X2 are analogous. The individual sums of squares are: ( ) ( ) ( )1 4 2 2 2 . 1 6 7,38 7,5 ... 7,48 7,5 0,21.X i i S r y y =  = − = − + + − =  ∑ ( ) ( ) ( )2 6 2 2 2 . 1 4 7,33 7,5 ... 7,65 7,5 0,358.X j j S k y y =  = − = − + + − =  ∑ The most straightforward way to calculate the residual sum of squares RS is to calculate first the total variability S, and then evaluate the residual sum of squares from equation 1 2R X XS S S S= + + . ( ) ( ) ( ) ( ) ( ) 24 6 2 2 2 2 1 1 7,5 7,5 6,9 7,5 ... 8,2 7,5 7,7 7,5 3,79.ij i j S y y = = = − = − + − + + − + − =∑∑ Thus, 3,22.RS = The test criterion for the first factor is ( )( ) 1 0,21 1 3 0,33. 3,22 151 1 X R S kT S k r −= = = − − The critical value is 3,15(0,05) 3,29F = . Since 0,33 < 3,29, we cannot reject the null hypothesis. Therefore, type of gasoline used seems to have no effect on fuel consumption. The test criterion for the second factor is ( )( ) 2 0,36 51 0,33. 3,22 151 1 X R S rT S k r −= = = − − The critical value is 5,15(0,05) 2,9F = . Since 0,34 < 2,9, we accept the null hypothesis again, therefore type of person driving the car does not affect fuel consumption either. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 107 - 8 FULL FACTORIAL EXPERIMENTAL PLANS This chapter covers the foundations of design of experiments, a branch of statistics dominated in industrial applications. To experiment means to change working conditions so that the best possible working procedures are found, and more knowledge is acquired about the product and related working process. The best possible working procedures can be viewed in the following sense: denoting Y the observed quality characteristic of a product (or we can work with more such characteristics 1 2 , ,..., k Y Y Y ) and denoting A, B, C,… the factors which potentially affect the product, their levels being 1 2 3 , , ,...A A A for factor A, 1 2 3 , , ,...B B B for factor B and so on, design of experiments aims to determine which of the factors are influential and what their optimal levels are with respect to optimal levels of the quality characteristics. 8.1 FOUNDATIONS OF EXPERIMENTING AND ITS APPLICATIONS In our context, experimenting means analysing various combinations of the levels of those factors that are thought to affect the observed quality characteristic of a product. The characteristic is a response, a result of an experiment. The response is related to a certain combination of the factor levels. Analysing the relations is how one can attain the objectives outlined in the previous paragraph. The outlined objectives may be achieved in more than one way. Some of the appropriate procedures were already described in the previous chapters. For example, to determine which factor is influential, analysis of variance can be exploited for this purpose. Regression analysis could also be used to define suitable levels of the factors with respect to the process output, represented in the regression equation by the dependent variable. The problem with these methods is that they often require a lot of data. Also, the methods are based on a set of conditions that must be met for the particular method to work correctly. Some of the conditions are even impossible to verify with a prescribed high probability, such as the form of the regression model governing the relation between several variables. For these and other reasons, design of experiments as a separate branch originated decades ago, to help solve reasonably the drawbacks just explained. We shall start the description of the discipline by explaining the meaning of several elementary terms the discipline works with. Factor: a parameter or independent variable affecting the observed quality characteristic of a product. We denote a factor with a capital letter, such as A, and its levels with the same letter with a lower index, such as 1 A (the first level of factor A). Two elementary types of factors exist: a) regulated factor is a variable thought to affect the observed quality characteristic. The levels of the variable can be set up and maintained, and it is desirable to do so. b) noise factor is a factor that has an adverse effect on the quality characteristic. The levels of the factor cannot be set up and maintained during the experiment, or it is not desirable to do so. Interaction of factors: a combined effect of two or more factors. In this case, the effect of one factor of the interaction generally depends on the effect of another factor of the interaction. The interaction of factors A and B is denoted as AB. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 108 Design of experiments is applied in areas, such as: simulation, product design and development, process design and development, testing and validation, solving production quality problems, measurement system analysis and improvement. 8.2 EXPERIMENTAL PROCEDURE The steps that must be taken to detect influential factors and their optimal levels form an experimental procedure. These steps are: planning the experiment (using brainstorming, for instance), designing the experiment, realizing the experiment, analysing the the results of the experiment. Planning the experiment means to set up an experimental team in the first place. Representatives of the departments that design the product and the process leading to the product should be members of the team. The team should have 2-15 members. The team members ought to participate in brainstorming sessions which will determine what quality or output characteristics of the product will be monitored, and which factors should be considered together with their starting levels. The planning phase results in defining the objective of the experiment, the objective being related to the product under scrutiny, and it should also define the characteristics of the product, based on which it will be possible to make a judgement on whether the objectives were achieved. Another result of the planning phase is a list of factors that could potentially play a role in defining the product quality. These information are used in the second phase – in designing the experiment, which results in an experimental plan. Such a plan is a table consisting of individual experimental runs. Each run, which is represented by a row in the table, tells the experimenter the levels the factors should take on during the experimental run. Later, we shall demonstrate this procedure in an example. Realizing the experiment takes place either separately at a laboratory or at the production line. The latter case is more challenging, of course, since the production capacity on the one hand, and the requirement to analyse the production through an experiment on the other hand, can lead to a clash of interests. Therefore, in the latter case, night shifts or weekends usually provide an opportunity to run the experiment outside laboratory. Analysing the results of the experiment means to seek a combination of the input factors that will optimize (at least approximately) the observed characteristic of the product. In the final step, the optimal factor set-up is verified by subsequent experiments and/or simulations. PROBLEM 1 (Spring) This problem demonstrates how to set up a full factorial experimental plan, using coded variables. What we are now interested in is how much pressure an industrial spring can withstand, the spring being compressed by a machine tool until it breaks down. The following factors are considered in the experiment. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 109 L = length of the spring, G = width of the spring, T = material of the spring. It is to be determined which factors influence the service life of the spring. SOLUTION Let us construct a table of the factor levels considered: we shall use two levels for each factor, therefore the corresponding experimental plan is called a two-level plan (there are also threelevel plans). Table 44: Factors and their levels Factor Symbol Lower level Upper level úroveň- + Spring length pružiny L 10 cm 15 cm Spring width G 5 mm 7 mm Spring material T A B Source: author’s There is more than one way how to build an experimental plan that will prescribe individual experimental runs. The so-called full factorial plan is one of the most frequently used schemes: Table 45: Full factorial plan Run L G T Y 1 10 5 A 2 15 5 A 3 10 7 A 4 15 7 A 5 10 5 B 6 15 5 B 7 10 7 B 8 15 7 B We talk about a full factorial plan because such a plan contains all possible combinations of the factor levels. The symbol Y will be used to denote the result of an experiment, i.e. the response to a specific combination of the factor levels. It is more suitable to prescribe an experimental plan, using the following symbols: the lower and the upper level of a factor is denoted -1 and +1, respectively. Table 45 then takes the form of table 46 Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 110 Table 46: Full factorial experimental plan using coded variables Run L G T Y 1 -1 -1 -1 2 +1 -1 -1 3 -1 +1 -1 4 +1 +1 -1 5 -1 -1 +1 6 +1 -1 +1 7 -1 +1 +1 8 +1 +1 +1 The conversion of the original variables to the coded variables, but not only their upper and lower levels as we do here, can be done according to equation 8-1 max min 0 max min 2 2 c x x x x x x + − = − , where 0 x is the variable in the original physical units, cx is the coded variable, maxx is the upper level of x, minx is the lower level of x. For instance, the conversion of factor L leads to 15 10 10 2 1 15 10 2 cL + − = = − − , or the upper level of factor G, G = 7, is 7 5 7 2 1 7 5 2 cG + − = = + − . A full factorial experimental plan containing k factors represents 2k n = experimental runs. For example, if k = 3 factors are considered in an experiment, there will be 3 2 8n = = experimental runs altogether. Therefore, there will be eight rows in the table of the experimental plan. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 111 Experimental plan prescribes how and under which conditions to proceed with the experiment. After applying the plan, values of the observed variable Y resulting from the experiment can be recorded. In our example, each experimental run has been carried out twice, and the results are recorded in table 47. Table 47: Full factorial plan and measurements Run Factor Factor Factor Output Output Average L G T 1Y 2Y Y 1 - - - 77 81 79 2 + - - 98 96 97 3 - + - 76 74 75 4 + + - 90 94 92 5 - - + 63 65 64 6 + - + 82 86 84 7 - + + 72 74 73 8 + + + 92 88 90 Source: author’s Table 47 ends the preparatory and experimental work. What follows now are calculations which are to determine the influential factors. These factors may also include interactions LG, LT, GT and LGT, therefore interactions are usually included in experimental plans, as well. This can be done after the experiment ran its course. The levels of the interactions are calculated additionally by multiplying the elements of the table that lie in the same row and columns of those factors which form the interaction. For instance, to calculate the levels of the interaction LG, the signs, or ones with the corresponding sign, are taken from a given row and the columns marked „L“ and „G“, and then they are multiplied. Table 48 represents the result of this procedure. Table 48: Full factorial plan and interactions Run L G T LG LT GT LGT 1 - - - + + + - 2 + - - - - + + 3 - + - - + - + 4 + + - + - - - 5 - - + + - - + 6 + - + - + - - 7 - + + - - + - 8 + + + + + + + 8.3 EFFECT OF A FACTOR AND ITS SIGNIFICANCE Effect of a factor is the change in quality characteristic Y, induced by changing the level of that factor from -1 to +1. To calculate the effect, we shall use the sign method which, in the table of an experiment, multiplies each value from the column of Y by the corresponding Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 112 number one of the analysed factor (the one from the same row of the table), the multiples are summed together, and the sum is divided in the end by one half of the number of experimental runs (number of rows in the table). For instance, to calculate the effect of factor L in our example, we get ( ) ( ) 1 79 97 75 92 64 84 73 90 18 4 efekt L = − + − + − + − + = . The effect of T is ( ) ( ) 1 79 97 75 92 64 84 73 90 8 4 efekt T = − − − − + + + + = − . Analogous procedure is used for other factors including their interactions: ( ) ( ) 1 79 97 75 92 64 84 73 90 1. 4 efekt LG = − − + + − − + = − The effects of all the factors from our example are recorded in the last row of table 49. Table 49: Full factorial plan and factor effects Run L G T LG LT GT LGT Y 1 - - - + + + - 79 2 + - - - - + + 97 3 - + - - + - + 75 4 + + - + - - - 92 5 - - + + - - + 64 6 + - + - + - - 84 7 - + + - - + - 73 8 + + + + + + + 90 Effect 18 1,5 -1 -8 0,5 6 -0,5 Source: author’s Now, we would like to find out which of the effects is significant. To do that statistically, variance of the factor effect 2 eσ , the effect being a random variable since we work with a data sample, must be estimated. The variance is the same for all factor effects under suitable conditions: 8-2 2 2 4 e N σ σ = , where N is the number of experimental runs (including their repetitions if there are any). In our example, 16=N as the table has eight rows, i.e. there are eight runs, and each run is implemented twice. If runs are repeated, we can calculate Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 113 - 8-3 2 2 2 1 1 1 ... ... k k k s s s υ υ υ υ + + = + + , where 1i inυ = − , in is the number of repetitions of the i –th experimental run, and 2 is is sample variance of Y corresponding to the i-th experimental run. Now, the estimate of 8-2 is 8-4 2 2 4 e s s N = . 8.3.1 STATISTICAL TEST OD FACTOR SIGNIFICANCE Estimate 8-4 is used to test significance of factor effect: 1. The null hypothesis is Ho: Factor effect is insignificant, the alternative is H1: Factor effect is significant. 2. The test criterion is of the form e efekt t s = . 3. The critical value K = ( )1 2 ... kn n n nt α+ + + − , where 1,..., kn n are the numbers of repetitions of the first, second,…, k-th experimental run, respectively; in our example, 2=in for each i; n is the number of experimental runs without the repetitions, i.e. the number of rows in the experimental plan. As we can see, the critical value concerns a Student’s distribution. 4. If ( )1 2 ... kn n n nt t α+ + + −≥ , the null hypothesis is rejected, and the factor is considered significant. In the opposite case, the factor is regarded as insignificant. PROBLEM 2 Let us return to the Spring example. We have: ( ) ( )1 2 ... 16 8 0,05 2,306kn n n nt tα+ + + − −= = . Using 8-3, we also get: 5 8 828282282 = +++++++ =s , and this result inserted in 8-4 leads to 2 2 4 4 5 1,25 16 e s s N ⋅ = = = , tj. 1,12es = . This allows us to evaluate the test criteria for all the factors including interactions. The criteria are in table 50. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS Table 50: Test of factor significance Run Y1 1 77 2 98 3 76 4 90 5 63 6 82 7 72 8 92 Source: author’s The critical value K = TINV criterion for the factors L, T and factors are not, as far as their effect on the spring service life is concerned. 8.3.2 GRAPHICAL ASSESSMENT OF If the experimental runs graphical method for detecting name, is based on constructing a graph, on the horizonta designated. The vertical axis of the graph records the 8-5 where i = 1,2, ..., m , m being the number of all the factors of the experiment including interactions. More precisely, the graph is a set of points smallest effect among all the calculated effects. the central line running through the middle section of the graph sugg If the graph takes the form of an S some of the points of the graph will turn away from the approximately linear middle section of the S-curve. Those are the points that s When using the graphical method, it is convenient to set up a table similar to table 51 which contains the factor effects sorted in the ascending order. ; STATISTICAL METHODS FOR ECONOMISTS - 114 Test of factor significance Y2 Effect 81 8 96 2 L = 18 74 2 G = 1,5 94 8 LG = -1,0 65 2 T = -8,0 86 8 LT = 0,5 74 2 GT = 6,0 88 8 LGT =-0,5 = TINV(0,05,8) = 2,306, which exceeds in absolute value the test nd GT. Therefore, the three factors are significant, the remaining factors are not, as far as their effect on the spring service life is concerned. PHICAL ASSESSMENT OF FACTOR SIGNIFICANCE runs are not repeated, the test above cannot be used. In such case detecting the influential factors exists. The method, as suggested by its name, is based on constructing a graph, on the horizontal axis of which factor effects are designated. The vertical axis of the graph records the following values: ( )100 0,5 i i P m − = , being the number of all the factors of the experiment including More precisely, the graph is a set of points [efect(i), Pi], where all the calculated effects. The points of the graph that seem through the middle section of the graph suggest the significant factors. If the graph takes the form of an S-curve, which is the case under certain conditions, then some of the points of the graph will turn away from the approximately linear middle section curve. Those are the points that signal which factors are influential. When using the graphical method, it is convenient to set up a table similar to table 51 which contains the factor effects sorted in the ascending order. t 16,07 1,34 -0,89 -7,14 0,45 5,36 -0,45 , which exceeds in absolute value the test herefore, the three factors are significant, the remaining factors are not, as far as their effect on the spring service life is concerned. re not repeated, the test above cannot be used. In such cases, a . The method, as suggested by its l axis of which factor effects are being the number of all the factors of the experiment including where efect(i) is the i-th The points of the graph that seem to lie ouside est the significant factors. curve, which is the case under certain conditions, then some of the points of the graph will turn away from the approximately linear middle section ignal which factors are influential. When using the graphical method, it is convenient to set up a table similar to table 51 which Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 115 Table 51: Auxiliary data for graphical assessment of factor significance Number r 1 2 3 4 5 6 7 Effect -8,0 -1,0 -0,5 0,5 1,5 6,0 18 Factor T LG LGT LT G GT L iP 7,14 21,42 35,71 50 64,29 78,57 92,86 Source: author’s The second and the fourth row of table 51 are coordinates of the points that form the graph (figure 12). Figure 12: Points determining significance of factors The graph shows that it is the points of the factors the effects of which turn out to be significant that lie outside the line running through the middle section of the graph. These points relate to factors L, T and GT. 8.3.3 GRAPH OF INTERACTIONS Significant interactions are usually accompanied by graphs that allow for a discussion on the optimal levels of the factors making up the interactions. For instance, the interaction GT can be accompanied by a graph which will outline the effect of the factor G on Y, depending on the level of the factor T. To do that, we can scrutinize the full experimental plan, and select the values of Y from the plan, corresponding to different levels of G and T. Table 52 contains these values, together with the average value of Y. Table 52: Responses of Y to various levels of G and T G T Response 1 Response 2 Average Y - - 79 97 88 + - 75 92 83,5 G T Response 1 Response 2 Average Y - + 64 84 74 + + 73 90 81,5 Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 116 Connecting the points [ ] [ ]5, 88 , 7, 83,5 , we get the average response of Y to a change in G provided the factor T is fixed at its lower level. Connecting the points [ ] [ ]5, 74 , 7, 81,5 , we get the average response of Y to a change in G provided the factor T is fixed at its upper level (see figure 13). Figure 13: Interaction GT and its effect on Y As suggested by the graph, to maximize Y, for instance, it is best to have T at its lower level. We also see that the interaction has a certain effect on Y because the slope of the line, describing the effect of G, changes with a change in the level of T. 8.4 REGRESSION MODEL OF THE 23 EXPERIMENT Once the effects of main factors and their interactions are detected, it is possible to construct a regression model of the experiment. The model will describe dependence of Y on the factors. Incomplete quadratic model of the 3 2 experiment, which works with factors A, B, C, is of the form 8-6 0 1 2 3 12 13 23 123 ˆy b b A b B b C b AB b AC b BC b ABC= + + + + + + + . It is a model containing all major factors and their interactions as explanatory variables, but not the quadratic terms A2 , B2 ,… of the major factors. Each of the coefficients 1 2 123, ,...,b b b can be calculated as one half of the effect of the factor the coefficient belongs to in the model. The absolute term of the model is .0 Yb = These are exactly the values one would get by applying the least squares method to the matrix of regressors, represented by the full experimental plan. In our Spring example, we have ˆy = 81,75 + 9L – 4T + 3GT. There are many reasons why such a model is constructed. It is constructed 1. to determine local minima/maxima of the factors involved, Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 117 - 2. to determine the direction of the so-called dynamic experimental planning, which shifts the experiment to a new subset in the domain of ‫ݕ‬ො, the new subset serving as an area for a new experiment. The shift is usually carried out, using the gradient of the model found. 3. to make local predictions of the quality characteristic ˆy . SUMMARY The reader has been acquainted with foundations of design of experiments in this chapter. It is possible to create a full experimental plan, or a fractional/partial experimental plan which will be analysed in the next chapter. Fractional plans are constructed if conditions of the experiment are such that not all experimental runs can be performed (the whole experiment would be too costly, for instance). Each factor participating in the experiment may potentially affect Y, a variable of interest. An effect is present if a change in the level of the factor leads to a change in Y. The presence of the effect can be verified either by a graphical method or a statistical test. Such procedures were described in the chapter. Planned experiments follow an experimental plan which prescribes individual experimental runs to be carried out and the order of the runs in which they are to be realized. There are two terms that must be distinguished in connection with experimenting: experimental run, which leads to measurements of Y, the variable of interest, under a given set of conditions; these conditions are represented by a specific row in the experimental plan; experiment, which is the set of all experimental runs. The aim of experimental planning is to determine which factors have a statistically significant effect on a quality characteristic Y, and to determine the optimal level of the significant factors so that the variable Y is optimized and/or stabilized. Stability of Y means that the variable remains optimal or close to its optimal state under various conditions (environment, product treatment, etc.). If this is the case, we talk about product robustness. The following terms were described in the chapter: experimental plan, experiment, experimental run, factor effect, test of factor significance, model of experiment. The following examples provide further assistance in the study of this subject matter. PROBLEM 3 A full experimental plan was constructed for two factors A and B. Each experimental run was realized twice. The result of the experiment is in table 53. Table 53: Result of an experiment with two factors A B 1Y 2Y - - 5 6 + - 5 5 - + 7 6 + + 5 4 Source: author’s Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 118 - Calculate: a. Effects of the factors A, B and AB, b. Write the equation of the incomplete quadratic model for this experiment, c. Estimate variance of the factor effects, d. Test significance of A, B and AB (the nivel of test is 5%). SOLUTION a. First, we shall attach average responses of Y to table 53. By doing so, we get table 54. Table 54: Average response to different combinations of factor levels A B 1Y 2Y Y - - 5 6 5,5 + - 5 5 5 - + 7 6 6,5 + + 5 4 4,5 The effects are: ( ) 1 5,5 5 6,5 4,5 1,25 2 Ae = − + − + = − , ( ) 1 5,5 5 6,5 4,5 0,25 2 Be = − − + + = , ( ) 1 5,5 5 6,5 4,5 0,75 2 ABe = − − + = − . b. The equation of the model is: 1,25 0,25 0,75 ˆ 5,375 2 2 2 y A B AB= − + − . c. The estimated variance is: 2 2 4 e s s N = , where 2 0,25 0 0,25 0,25 0,1875. 4 s + + + = = Thus, 2 4 0,1875 0,094. 8 es ⋅ = = Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 119 The estimated standard deviation is 0,31es = . d. We test the hypotheses: :0H Factor (effect) is not statistically significant; :1H Factor (effect) is statistically significant. The test criterion is e efekt t s = . We get: 4,03At = − , 0,8Bt = , 2,41ABt = − . The test criteria are compared to the following critical value: ( ) ( )8 40,05 0,05 2,776N nt t− −= = , where N is the number of all experimental runs including their repetitions, and n is the number of experimental runs excluding their repetitions. Since 4,03 2,776, 0,8 2,776, 2,41 2,776− > < − < , A is significant, the other factors are not. PROBLEM 4 Using the graphical method, find out which factor is significant. The factor effects and other data are contained in table 55. Table 55: Entry data for graphical assessment of factor significance i 1 2 3 4 5 6 7 Effect -8 -1 -0,5 0,5 1,5 6 18 Factor C AB ABC AC B BC A iP Source: author’s SOLUTION Expanding table 55 by calculating ( )100 0,5 i i P m − = , mi ,...,2,1= , m = the number of all factors including their interactions, i.e. m = 7 in this case, we have: Table 56: Pi’s for graphical evaluation of factor effects i 1 2 3 4 5 6 7 Effect -8 -1 -0,5 0,5 1,5 6 18 Factor C AB ABC AC B BC A iP 7,14 21,42 35,71 50 64,29 78,57 92,86 Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 120 The resulting graph is: Figure 14: Graphical evaluation of effects The central line running through the middle section of the graph does not seem to contain the „points“: 92,86; 7,14; 78,57. This suggests that the factors A, C and BC are influential. CONTROL TEST 8 Yes/No answers: 8.1 Experimental plan defines the order in which experimental runs are carried out? 8.2 A full plan working with 4 major factors consists of 8 experimental runs? 8.3 Factor effect can take on positive values only? 8.4 When testing factor significance, the corresponding critical value is related to a Fisher’s distribution? 8.5 When testing factor significance with the graphical method, those factors which lie outside the central line of the graph are regarded as significant? Complete the statement: 8.6 Experiment is a system of __________ 8.7 A full plan with three major factors has __________experimental runs. 8.8 The null hypothesis of the test of factor significance is: __________ 8.9 The graphical method of testing factor significance is used when __________ is/are not available. 8.10 The graph constructed for testing factor significance requires calculation of P(i) which is given by the formula __________ 8.11 Complete the table so that it represents a full experimental plan: Run A B 1 2 3 4 Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 121 - 8.12 The table below represents a full plan for factors A and B. Each experimental run has been realized twice. A B 1Y 2Y - - 2,3 2,6 + - 3,1 2,9 - + 3 3,5 + + 1,9 2,2 Source: author’s Calculate: a. the effect of the factors A, B and AB, b. the model of the experiment, c. the estimate of the factor effect variance. 8.13 Test whether the effects of A, B and AB from 8.12 are significant (nivel of test = 5%). 8.14 Draw the graph of the interaction AB from 8.12. Depict the effect of A on Y, depending on the level of B. What level of B maximizes Y? SOLUTIONS 8.1 yes 8.2 no 8.3 no 8.4 no 8.5 yes 8.6 runs 8.7 823 = 8.8 insignificant 8.9 repetition of individual runs 8.10. ( )100 0,5 ,i i P m − = where mi ,...,2,1= and m is the number of all the factors. 8.11. Run A B 1 - - 2 + - 3 - + 4 + + 8.12. a. effect(A) = – 0,325; effect(B) = – 0,075; effect(AB)= – 0,875. b. 2,69 0,1625 0,0375 0,4375Y A B AB= − − − . c. 2 s = 0,029; es = 0,12. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 122 - 8.13. Only AB is significant, since 7,29 2,776− > . 8.14. The graph shows that the maximal value of Y is achieved for the upper level of B, and the interaction proves to have an effect on Y, as the two lines resemble a cross. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 123 - 9 TWO-LEVEL FRACTIONAL PLAN The previous chapter presented full factorial plans. It is, however, not always possible to construct such a plan, for different reasons including financial limitations. In such cases, fractional plans are used instead. Fractional plan is a partial plan which does not work with all possible combinations of the factors observed. As we shall learn in this chapter, different degrees of full plan reduction exist, the so-called half plans being among them. It is also the half plans we shall devote to, in particular. Full factorial plans prescribe experimental runs for each factor, whereas fractional plans set down runs only for a subset of the factors, for main factors, and the remaining factors (secondary factors) are calculated as combinations of the main factors. The calculation is later used to define runs for the secondary factors. In this way, the total number of experimental runs can be reduced, which leads to a fractional plan. If 2k denotes the number of experimental runs of a full experiment, k being the number of its factors excluding interactions, then pk− 2 denotes the corresponding fractional plan, p being the degree of the reduction in this case. For instance, if we want to cut down 128 runs of a 27 full experiment by one half to 17 7 2 2 2 − = , we get a half plan with 642 17 == − n runs. This is the smallest possible degree of reduction of the full plan. Plans originating from full plans by applying the smallest degree of reduction are called half plans. The degree of reduction p can be greater than one, such as four, leading to a 47 2 − fractional plan, which will have n = 8 runs. If the number of factors exluding interactions is k = 7, the degree of reduction equal to 4 would be the highest possible, based on the rule that the number of runs should be at least as high as the number of all factors. In the opposite case, too few data is available to perform a meaningful analysis. To give another example, if k = 15, the highest possible degree of reduction is 11 since 1115 2 − = 16. The degree of reduction equal to 12 would lead to 1215 2 − = 8, which is too small a number. The degrees of reduction between two and the highest possible degree define central plans. For example, 27-1 and 27-4 belong to this category. To sum up, fractional factorial plans can be divided into a. Half plans, representing the smallest degree of reduction of full plans, b. Plans, representing the highest possible degree of reduction of full plans, c. Central plans. Let us take a closer look at the half plans now. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 124 - 9.1 HALF PLANS We shall demonstrate in an example the effect of reducing a full plan by one half. To understand better what is going on, however, we need to define other fundamental terms from the theory of experimental plans. We do so now. We shall use the symbol I to denote the factor whose column in the experimental plan contains only ones. We call this variable identity factor. We also define multiplication of two factors: such an algebraic operation applied to two factors yields as its result a factor whose column in the experimental plan contains values obtained by multiplying the ones from the corresponding rows of the plan and from the columns of the two factors that appear in the multiplication. The multiplication possesses the following basic algebraic properties: A.A = I A.I = I.A = A (A.B).C = A.(B.C) A.B = B.A Suppose now that A, B, C, D, E are factors for which a half plan is to be constructed. To do so, one must select four of these factors (the main factors) for which the full plan will be created. These can be, for instance, the factors A, B, C, D. The remaining factor(s), the factor E in this case, will be defined as a combination (multiplication) of the main factors: let us define E = ABCD. In this way, instead of working with the 25 full plan, we shall work with the 25-1 half plan. Only one half of all two-level combinations of all the factors is set up before the experiment is carried out, whereas the remaining two-level combinations are assigned by the multiplication. Now, not every combination of the factors is appropriate. Every combination forms a word. Such a word consists of letters. Number of letters defines the length of the word. The equality E = ABCD is called the plan generator. The pk− 2 factorial plans contain p generators. Since E.E = E. ABCD, we get I = ABCDE. Words that yield the indentity factor I are called defining equations. The shortest word among the defining equations is the resolution of the plan. The length of the shortest word is designated by the corresponding Roman figure in the symbol of the plan: for instance, 15 2 − V applies to our example. Using the defining equations, we can find factor pairs (interactions, in general) with the same columns in the experimental plan. Such pairs are called interchangeable, and they play a role in the entire data analysis based on experimental planning. To illustrate the idea, if the plan generator is E = ABCD, the defining equation is I = ABCDE. In this case, an interchangeable pair for the interaction DE, for instance, is obtained by multiplying the defining equation by DE: I = ABCDE /. DE DE.I = DE.ABCDE. Hence, DE = ABC. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 125 The two interactions have the same column of ones in the experimental plan. The following problem shows the principles of working with half plans. PROBLEM 1 (Dyestuff) The amount of dyestuff Y left in a piece of fabric is observed. The amount depends on five factors: A = pH, B = temperature, C = concentration of the solution used for tinting, D = finishing temperature of tinting, E = finishing time. We are to construct a half experimental plan, and use it to detect influential factors. The necessary entry data is in table 57. Table 57: Entry data for problem 1 Factor Symbol -1 +1 pH A 4,5 5,5 temperature B C0 70 C0 80 concentrationn C 1 g/l 3 g/l finishing temp. D C0 170 C0 190 finishing time E 50s. 70s. Source: author’s SOLUTION The main factors are chosen to be the factors A, B, C, D, while E is selected to be the secondary factor. Based on the half plan, constructed as described above and depicted by table 58, the experiment is performed, leading to the following values of Y: Table 58: The output of the experiment Run A B C D E =ABCD Y 1 - - - - + 6,4 2 + - - - - 9,9 3 - + - - - 8,1 4 + + - - + 6,6 5 - - + - - 9,0 6 + - + - + 5,3 7 - + + - + -5,1 8 + + + - - -1,0 9 - - - + - 10,6 10 + - - + + 12,7 11 - + - + + 12,9 12 + + - + - 11,2 13 - - + + + 2,4 14 + - + + - 9,7 15 - + + + - 4,1 16 + + + + + 4,0 Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 126 The factor effects are calculated here the same way as for full experimental plans. For instance, the effect of D is ( )( )1 ( ) 6,4 9,9 ... 1 10,6 12,7 ... 4 4,8 8 efekt D = − − − − − + + + + = . The effects of other factors are obtained similarly. However, since we work with a half plan, interchangeable pairs exist in this case. As we said before, such pairs are factors that have the same column of ones in the experimental plan. The effect calculated for one of these factors does not belong to that factor alone anymore! Now it represents the influence of all the interchangeable factors together. This is illustrated in table 59. If the effect was not attributed to all the interchangeable factors, the effect would lose its original interpretation it had in the case of the full experimental plan. Table 59: Effects in a half plan Factor Effect A + BCDE 0,0 B + ACDE -4,4 C + ABDE -5,0 D + ABDE 4,8 E + ABCD -0,8 AB + CDE 0,2 AC + BDE -0,6 AD + BCE -0,6 AE + BCD 0,5 BC + ADE -4,2 BD + ACD 1,1 BE + ACD -0,2 CD + ABE 0,7 CE + ABD -0,5 DE + ABC 2,4 For instance, the first zero effect represents the collective influence of the factors A and BCDE, the interaction which is interchangeable for A. 9.2 GRAPHICAL EVALUATION OF FACTOR EFFECT The graphical method we used to verify significance of factor effects in the case of full experimental plans can be exploited for half plans, as well. This means we calculate the Pi’s: Table 60: Sorted effects and their Pi’s i 1 2 3 4 5 6 Factor C + ABDE B + ACDE BC + ADE E + ABCD AD + BCE AC + BDE Effect -5 -4,4 -4,2 -0,8 -0,6 -0,6 iP 3,3 10 16,6 23,3 30 36,6 Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 127 i 7 8 9 10 11 12 Factor CE + ABD BE + ACD A + BCDE AB + CDE AE + BCD CD + ABE Effect -0,5 -0,2 -0,0 0,2 0,5 0,7 iP 43,3 50 56 63,3 70 76,6 i 13 14 15 Factor BD + ACD DE + ABC D + ABCE Effect 1,1 2,4 4,8 iP 83,3 90 96,6 Once the Pi’s are calculated, we construct the familiar graph (figure 15). Figure 15: Graphical evaluation of factor effect significance As we can see, the half plan brought us results similar to those obtained with the full experimental plan. This means we get similar results with fewer experimental runs. However, the results are not generally the same, and a certain amount of information has been lost after all, because the calculated effect belongs to the combined influence of several factors, such as A + BCDE, B + ACDE etc. The fact that a particular effect belongs to A + BCDE does not mean that exactly one half of the effect is caused by A and the other half is caused by BCDE! Generally speaking, it is not known what part of the total effect belongs to A or BCDE in this case. None the less, it is known that the longer the word representing the interaction/factor, the smaller its contribution to the total effect. Therefore, it is advisable that the fractional plans be generated in such a way that the interchangeable pairs are very long interactions. PROBLEM 2 Let us have five factors A, B, C, D, E, where factor E is to be generated as a secondary factor. There is more than one way of generating E. Let us compare the consequences of two scenarios: a. E = AB, b. E = ABCD. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 128 - SOLUTION The defining equations are: a. I = ABE b. I = ABCDE In a), we have a 15 2 − III plan, whereas in b), we have a 15 2 − V plan. The plan b) is better because its resolution is V, which leads to the following discovery: when seeking the interchangeable pairs for A, for instance, we have a. A = BE b. A = BCDE In the second case, the interchangeable pair has more factors (it is represented by a longer word), which results in the interaction BCDE contributing less to the total effect of the factor A+BCDE. Interchangeable interactions, represented by words with at least three letters, account for such a small part of the total effect that their contribution to the total effect is often neglected. This, of course, facilitates the interpretation of the final effect. SUMMARY We have learnt how it is possible to set up a fractional plan – it is the full plan constructed for a subset of the set of all factors, while the levels of the remaining one-letter factors are generated (calculated). This procedure reduces the total number of experimental runs, making the total experiment cheaper and faster. We’ve also learnt that the graphical method used to detect influential factors of a full plan can be exploited for the same purpose in the case of half plans. Finally, we explained that it is important to select a proper generation of the levels of the secondary factors. A proper generator leads to favourable interchangeable pairs for each factor, facilitating the conclusion over how large effects belong to each factor. What follows is a set of illustrative examples. PROBLEM 3 A half plan was constructed for factors A, B, C and D: a. Complete the table below, b. Using the graphical method, determine which factor is significant. The necessary data are in table 61. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 129 Table 61: Effects and their probabilities Factor Effect i iP A + BCD 1 3 35,7143 B + ACD -0,5 2 21,4286 C + ABD -4 1 D + ABC 3 4 50 AB + CD 9 6 78,5714 AC + BD 6 5 64,2857 AD + BC 17 7 92,8571 Source: author’s SOLUTION a) Using the equation ( )100 0,5 i i P m − = , we get ( ) 1 100 1 0,5 7,14 7 P − = = . b) The following graph 16 implies that the interactions AD and BC and also the factor C could be considered significant. Figure 16: Graphical evaluation of factor significance PROBLEM 4 A half plan has been constructed for factors A, B, C, D, the generator of the plan being D = ABC. The output of the experiment is in table 62. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 130 a. Calculate effects of the factors including the three-factor and four-factor interactions. b. Using the interactions from a), estimate variance of the factor effects (this is a new procedure to be explained!) c. Write the defining equation and interchangeable pairs. d. Assess graphically the factor effects. Table 62: Experimental output A B C D = ABC Y ABC ABD BCD ACD ABCD - - - - 77 + - - + 67 - + - + 64 + + - - 51 - - + + 64 + - + - 53 - + + - 73 + + + + 67 Source: author’s SOLUTION a) Inserting the remaining signs + and – (or plus ones and minus ones) in the table, we get the effects: ( ) 1 77 67 64 51 64 53 73 67 10 4 A BCDe e= − + − + − + − + = − = ; 1,5B ACDe e= − = ; 0,5C ABDe e= − = ; 2D ABCe e= = ; 129ABCDe = . b) If there is only one measurement (experimental output) for each combination of the factors (for each row of the table), 2 s needed to estimate the variance is calculated as the average of the second powers of the effects belonging to the longest interactions: ( ) ( ) ( ) 2 2 22 2 2 2 2 2 0,5 10 1,5 129 3349,5 5 4. 1674,75 40,9.e e s s s s n + − + − + − + = = = = ⇒ = c) Since D = ABC, the defining equation is I = ABCD. The interchangeable pairs are AB, CD; AC, BD; AD, BC. d) Calculating the Pi’s, as in table 63, Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 131 Table 63: Factor effects and their Pi’s Factor Effect i Pi A + BCD -20 1 7,14 B + ACD -3 2 21,42 C + ABD -1 3 35,71 D + ABC 4 6 78,57 AB + CD 1 4 50 AC + BD 3 5 64,28 AD + BC 26 7 92,85 we construct the graph (figure 17). Figure 17: Graphical evaluation of factor effects The graph shows that the effects of AD + BC and A+BCD lie outside the central line. These effects are deemed significant. In the latter case, we can restrict ourselves to the factor A, since the part of the total effect of A+BCD belonging to BCD will be small enough to be neglected. CONTROL TEST 9 Yes/No answers: 9.1 Factorial plans contain all combinations of factor levels? 9.2 The commutative property of multiplication known from the theory of real numbers does not hold true in the case of factor multiplication? 9.3 If the generator of the plan is D = BC, the defining equation is I=BCD? 9.4 If interactions ABC and DE have the same columns in the experimental plan, the effect calculated from one of these columns belongs to both interactions? 9.5 When a half plan is constructed, one of the factors is defined as an interaction of other factors? Complete the statement: 9.6 Secondary factors are expressed as a __________ of the main factors. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 132 - 9.7 Factorial plans can be divided into __________plans, __________plans and __________ plans. 9.8 Given a factor A, the following holds: AI = IA = ___, where I is the identity factor. 9.9 Two factors with the same column of ones in an experimental plan are called__________. 9.10 In full plan, effects of factors and their interactions __________ the same as in the related half plan. 9.11 Construct the half plan for factors A, B, C, D, generated by B=ACD. Calculate the effect of C provided the following two measurements for each factor combination were obtained from the experiment: The first series of measurements: 10,11,14,12,12,10,13,14, The second series of measurements: 11,12,12,8,14,12,13,14. 9.12 A half plan was used for factors A, B, C, D. a) Complete the table below, b) Detect significant factors with the graphical method. Effects i iP A + BCD 1 B + ACD -8 C + ABD -10 D + ABC 4 AB + CD 9 AC + BD 7 AD + BC 5 Source: author’s SOLUTIONS 9.1 no 9.2 no 9.3 yes 9.4 yes 9.5 not necessarily 9.6 combination 9.7 half plans, central plans, highest-reduction plans (saturated plans) 9.8 A 9.9 interchangeable 9.10 are not 9.11 1Ce = Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 133 - 9.12 a. Effects i iP A + BCD 1 3 35,71 B + ACD -8 2 21,42 C + ABD -10 1 7,14 D + ABC 4 4 50 AB + CD 9 7 92,85 AC + BD 7 6 78,57 AD + BC 5 5 64,28 b. Significant factors: B, C (figure 18): Figure 18: Graphical evaluation of factor effects Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 134 - 10 TAGUCHI’S METHODS – LOSS FUNCTIONS Taguchi’s methods, based on the research by Genichi Taguchi, include online methods used during production, and offline methods reserved for pre-production stages. The former is the contents of the chapters 10 and 11, and it relies heavily on loss functions. In this chapter, we shall explain the logic behind these functions and the way they are constructed and used. The first part of the chapter defines loss function, and presents its properties. The second part of the chapter works with different kinds of loss functions, which is related to different kinds of what is called tolerance interval. The end of the chapter presents examples and a control test with questions and answers. Taguchi’s methods based on loss functions try to measure financial losses experienced by product users due to producers’ inability to fabricate a product that would precisely comply with users’ demands. Most often there is always at least a slight imprecision in the production due to its physical nature, no matter how much the production is surveilled and controlled. Introduction of loss functions brought a new concept to how problems with quality are viewed. Earlier, the standard approach had been such that as long as the observed quality characteristic of a product lied within a tolerance interval, the characteristic not necessarily being equal to its desired optimal value, the product users would not bear any losses incurred by quality imprecisions. Taguchi disagreed with this view of the problem, and introduced simple mathematical functions that suprisingly turned out to be precise enough to measure the losses that occur even when the slightest deviation of the product quality characteristic from its optimal level exists. Let us emphasize that the loss-function approach is only one of many forms of looking at the process or product. Whereas loss functions quantify the process quality, another question is how to improve the quality if it is detected to be inadequate, for example by a loss function. There are many ways how to solve problems within a process: Six Sigma methodology is one technique based on statistical methods (regression, in particular); analyzing process by simulation (Zgodavová and Bober, 2012) is another technique which may be used after process quality characteristics are properly measured or quantified (Zgodavová, 2010), and key concepts are exactly defined (Brannmark et al., 2012). The objective of process analyses is to create an optimal or close-to-optimal process set-up, and keep it that way, regardless of variations of factors which could destabilize the optimal set-up, i.e. keep the process robust (Siva, 2012). 10.1 DEFINITION AND PROPERTIES OF LOSS FUNCTIONS Before defining the functions and presenting their graphs and properties, let us mention some fundamental conditions which are considered to be met in order for the functions to be used correctly and properly: Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 135 - 1. Every product bears a certain quality characteristic (such as its size, weight, mechanical property, etc.), and the quality of the whole product is judged based on that particular quality characteristic. 2. A target (optimal) value T is given for the quality characteristic from 1). 3. Lack of product quality is measured by deviation of the observed product quality characteristic from its target value T. 4. Any deviation of the characteristic from T brings a financial loss that the product user must bear because of the necessity to increase expenses on the product maintenance, repairments, etc. One of the simpler loss functions is of the form 10-1 ( ) ( ) 2 L Y k Y T= − for ( , )Y T d T d∈ − + , = A otherwise, where T = target value of the quality characteristic, d = tolerance A = maximal loss due to poor quality Y = truly achieved value of the quality characteristic (which is a random variable) L(Y) = financial loss given by the specific value of Y k = constant to be determined. Figure 19 shows the loss function just described. Above the tolerance interval ( , )T d T d− + , it is a parabole, whereas outside the interval, it is a constant function. Figure 19: Loss function If Y T d≤ − or Y T d≥ + , in other words, if d Y T≤ − , then ( )L Y A= . We can write: 10-2 2 kdA = . Since d and A are usually known, 10-2 is used to determine k: 2 /k A d= . Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 136 PROBLEM 1 Write the loss function equation for d = 5 and A = 2. SOLUTION We have 2 2 / 5 0,08k = = , therefore ( ) ( ) 2 0,08L Y Y T= − . The variable Y is considered to be a random variable, usually following approximately a normal distribution ( )2 ( ),N E Y σ . We are often more interested in the average loss E(L) rather than the individual loss. The average loss is calculated according to: 10-3 ( ) ( ) ( ) 2 2 2 E L E k Y T kE Y T kσ = − = − =   , Provided that E(Y) = T. The symbol 2 σ denotes the variance of Y, as usual. However, if ( )E Y T≠ , then ( ) ( ) 22 ( )E L k k E Y Tσ= + − . Therefore, several equations are used in connection with loss functions: a. the defining equation ( ) ( ) 2 L Y k Y T= − , b. the equation determining the constant k: 2 A kd= , c. the equation for average loss ( ) 2 E L kσ= or ( ) ( ) 22 ( )E L k k E Y Tσ= + − . Quality costs can be enumerated in a much more complex manner involving all possible kinds of losses induced by the not-optimal product quality, such as expenses on repairs, expenses on product control, losses due to imprecise quality measurements, etc. We shall work with this concept in chapter 11. There are also loss functions dependent on more quality characteristics, i.e. they are functions of several variables. Some of the characteristics do not even have to be quantitative. 10.2 LOSS FUNCTIONS FOR DIFFERENT TYPES OF TOLERANCES The loss function described in figure 19 is not the only one. There are more kinds of loss functions, depending on what tolerance interval we work with. What follows is a classification of some of the most fundamental loss functions. Each of the functions is accompanied by the corresponding graphical representation. We distinguish the following types of tolerances: Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 137 a) Symmetric N-tolerance Figure 20: Symmetric N-tolerance In this case, we write T ± d, where d = tolerance. The interval (T-d, T+d) is called the tolerance interval. The tolerance is symmetric in the sense that the target value lies in the center of the tolerance interval. If the quality characteristic observed is smaller that the lower tolerance limit T-d, the financial loss incurred equals A, and the same is true for the case when the characteristic is greater than the upper tolerance limit T+d. b) Nonsymmetric N-tolerance In this case, the loss function looks as described by figure 21. Figure 21: Nonsymmetric N-tolerance Here, the tolerance interval is (T - d1, T + d2). We see that there are two tolerances d1 and d2, which are generally different. There are also two generally different maximal losses A1 and A2, depending on whether the quality characteristic is too high or too low. The different maximal losses and different tolerances define two generally different curves above the horizontal axis, one of the curves being above and to the right of T, and the other being above and to the left of T. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 138 On the interval (T-d1, T), we work with equation 10-1, where k = k1. On the interval (T, T+d2) we work again with equation 10-1, however k = k2 in this case. In the former case, k1 = A1/ d1 2 , in the latter case, k2= A2/ d2 2 . c) S-type tolerance (S for Small) For this type of tolerance, the following is true: the smaller the observed quality characteristic Y, the better. The target value T = 0. Figure 22 describes the shape of an S-type tolerance loss function. Figure 22: S-tolerance To give an example of this situation, surface roughness can be the quality characteristic Y. Another example is air pollution. A certain upper tolerance/specification limit is acceptible, and beyond that limit, the losses reach their maximum. On interval (0, USL), we work with equation 10-1, where k = A/USL2 ; beyond USL, the loss function is constant. d) L-type tolerance (L for Large) In this case, the opposite is true: the bigger the characteristic Y, the better. The optimal/target value is ∞=T . Here, the loss function is of the form 10-4 ( ) 2 2 E L A d s= ⋅ ⋅ , where 2 2 (1/ )s E Y= , i.e. the expected value (the „average“) of the random variable 2 1/Y . Equations a) -c) give an individual loss. If we want to determine the average loss, we calculate the average of the individual losses. Since the population average is usually unknown, we estimate it with the sample average. The same is true for the case d) where the unknown population average 2 (1/ )E Y is replaced with its estimate 1 2 n Y− − ∑ . Before presenting examples, let us summarize the essentials. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 139 - SUMMARY We have described all major types of loss functions. We know that a quality characteristic is observed for each product. An optimal or target value is given for the characteristic, and if the optimal value is not achieved, certain financial loss is brought upon the product user. Depending on what the target value and tolerance are, we distinguish the following types of tolerances: 1. N-type tolerance: symmetric and unsymmetric, 2. S-type tolerance: smaller Y is better, T = 0. 3. L-type tolerance: larger T is better, ∞=T . Different loss functions correspond to the cases 1-3. PROBLEM 2 In a crankshaft production, length and diameter of crankshafts are observed. The diameter is supposed to be 25mm ± 1mm, while the length has a prescription of 100mm ± 2mm. If the diameter falls outside the tolerance interval, an individual loss of 40 crowns is generated; for the length, the individual loss is 30 crowns. Ten crankshafts have been taken out of the production line randomly for examination. These are the measurements of their length and diameter: Diameter (in mm): 25,1; 25; 25; 24,9; 25,1; 25; 24,9; 25; 25,1; 24,9. Length (in mm): 99,9; 99,9; 99,8; 100,2; 100; 100; 100,1; 98; 99,9; 100,2. Compare the quality of two operations: one that yields a certain diameter of the crankshaft, and the other resulting in a length of the crankshaft. SOLUTION: Diameter: T1 =25, A1 = 40, d1 = 1. ( ) ( ) ( ) 2 2 22 1 25,1 25 25 25 ... 24,9 25 0,006 10 s  = − + − + + − =   . 2 40 ( ) 0,006 0,24 1 Estimated E L = = crowns per unit. Length: T2 = 100, A2 = 30, d2 = 2. ( ) ( ) 2 22 1 99,9 100 ... 100,2 100 0,02 10 s  = − + + − =   . Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 140 - 2 30 ( ) 0,02 0,15 2 Estimated E L = = crowns per unit. We can conclude that the length of the crankshaft is produced at a higher quality than the diameter of this product. The total average loss is estimated to be 0,24 + 0,15 = 0,39 crowns per unit (per crankshaft). PROBLEM 3 Drums to be used in washing machines are supposed to be 30 cm wide (= diameter). The tolerances are specified as follows: 30cm-1cm, 30cm+4cm. If the diameter is smaller than the lower tolerance limit, a loss of 50 crowns is recorded by the consumer. If the diameter exceeds the upper tolerance limit, the loss is 100 crowns. Two companies produce the same drums. Compare their quality if the following data samples on their production are available: Table 64: Production data Company Deviations from the target value (!) A 0; 0; -1; 3; 0; 4; 2; -1; 0; 1; 2; 4 B -1; -1; 0; 0; 0; 3; 2; -1; 1; 2; 0 Source: author’s SOLUTION: The parameters are: A1 = 50, A2 = 100, d1 = 1, d2 = 4. Company A: ( ) ( ) ( )22 2 2 2 2 2 2 1 2 2 1 50 100 ( ) 1 1 3 4 2 1 2 4 12 1 4 Estimated E L   = − + − + + + + + +    ( )1 34,375Estimated E L = crowns per unit. Company B: ( ) ( ) ( ) ( )2 2 2 2 2 2 2 2 2 2 1 50 100 ( ) 1 1 1 3 2 1 2 11 1 4 Estimated E L   = − + − + − + + + +    ( )2 23,864Estimated E L = crowns per unit. The production of company B seems to give a higher quality. Let us take a closer look at the logic behind the calculations performed (the case A, the second case is analogous): we are estimating the average loss by the sample average. Therefore, the data are summed together and divided by the sample size, which is 12. Each of the terms in the summation represents an individual loss, i.e. a functional value of the loss function used. Since the tolerance we work Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 141 with is unsymmetric, each part of the loss function has its own defining equation. Each of these equations lead to a different value of the constant k. In one case, the constant equals 50, in the other case, it is equal to 100/16. PROBLEM 4 Ballbearings are produced by two different companies: Company A produces it with specification 0,4T ± , while the other company B must cling to specification 1T ± . Fifty thousand ballbearings is produced every day. Each ballbearing costs 0,60 crowns. If the tolerance interval is not achieved, the corresponding production unit is scrapped. A spot check at the two companies led to the following data samples: Company A: deviations from the optimal size: -0,3; 0,1; 0,2; 0; 0; -0,2; -0,1; 0; 0,4; 0,1; -0,1; 0, 0; 0,1; -0,2. Company B: deviations from the optimal size: 0; 0; 1; -0,8; -0,8; 0, 0,6; 0,7; 0; -0,3; -0,2; 0; 0; 1; 0,2. Compare the production quality of the two companies. SOLUTION 1. Company A: A = 0,6; d = 0,4. We have ( ) ( ) 2 22 21 0,42 0,3 0,1 ... 0,2 0,028 15 15 s  = − + + + − = =   . and 2 2 2 0,6 ( ) .0,028 0,105 0,4 A Estimated E L s d = = = crowns per unit. The daily loss is 50 000·0,105 = 5 250 crowns. 2. Company B: A = 0,6 ; d = 1. In this case, 2 2 21 0 ... 0, 2 0, 287 15 s = + + =   . and 2 0,6 ( ) 0,287 0,172 1 Estimated E L = = crowns per unit. The daily loss in the second case is 50 000·0,172 = 8 600 crowns. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 142 PROBLEM 6 The surface of a piston is adjusted during production so that the surface roughness does not exceed 10 mm. The smoother the surface, the better. If the upper roughness limit is exceeded, it is re-worked with a machine tool for 200 crowns. Two different workers at two different firms have the same job: they are in charge of how smooth the piston surface is. Compare the quality of their work when the following data samples are available (table 65). Table 65: Entry data Worker Surface smoothness 1 0, 1, 9, 6, 10, 2, 3, 0, 9 2 3, 2, 4, 4, 5, 2, 4, 6, 5, 3 Source: author’s SOLUTION The parameters are: A = 200, d = 10 (S-type tolerance) 1st Worker: 2 1 9 s = ( (0-0)2 + (1-0)2 + (9-0)2 + ... + (9-0)2 ) = 34,67. 2 2 2 200 ( ) 34,67 69,34 10 A Estimated E L s d = = = crowns per unit. 2nd Worker: 10 12 =s (32 + 22 + 42 + ... + 52 + 32 ) = 16. ( ) 2 200 16 32 10 Esitmated E L = = crowns per unit. The second worker’s skills are more than twice as good as those of the first worker. PROBLEM 7 Rock-climbing equipment makers are required to produce ropes with stiffness of at least 300 kg. If the lower limit is not achieved, the rope must be re-stiffened at the cost of 50 crowns per metre. Compare two technologies of rope-making if the following production data is available Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 143 Table 66: Production data Technology Rope stiffness 1 305, 350, 350, 410, 310, 300, 350, 400 2 305, 301, 308, 306, 300, 320, 310, 310, 320, 325 Source: author’s SOLUTION The parameters are: A = 50, d = 300. It is the L - type tolerance. 1st technology. Based on 10-4, we have: 2 6 1 2 2 2 1 1 1 1 ... 8,62 10 8 305 350 400 s −  = + + + = ⋅    . The average loss is ( ) 2 6 1. 50 300 8,62 10Est E L − = ⋅ ⋅ ⋅ = 38,79 crowns per metre. 2nd technology Variance: 2 5 2 2 2 2 1 1 1 1 ... 1,03 10 10 305 301 325 s −  = + + + = ⋅    . Average loss: ( ) 2 5 2. 50 300 1,03 10Est E L − = ⋅ ⋅ ⋅ =46,76 crowns per metre. CONTROL TEST 10 Yes/No questions: 10.1 Loss function can be described mathematically as ( ) ( )2 L Y k Y T= − ? 10.2 The higher Y, the better…this is the S –type tolerance? 10.3 If a product feature has its optimal value, a lower-than-optimal quality of the product manifests itself by deviations of the feature value from the optimal value? 10.4 With the N-type tolerance, the optimal value is smaller than the target value? 10.5 Loss functions are in part a parabola? Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 144 Complete the statement: 10.6 Any deviation from the target value T brings __________ . 10.7 A certain product__________is usually observed, based on which we judge the quality of the product. 10.8 A part of loss function is mathematically a __________. 10.9 Based on what is considered the target value T, we distinguish these types of tolerances:__________, __________, __________. 10.10 When working with the S – tolerance, the target value T = __________. 10.11 A product diameter and weight are observed. The diameter is to be T1 = 20cm ± 1 and the weight is to be T2 = 100g ± 2. If the diameter falls outside the tolerance interval, it costs 20 crowns to repair the product or scrap it; for the weight, the same cost is 30 crowns. Ten product units have been randomly drawn from the production line: Their diameter was: 20,1; 20; 20; 19,9; 20,1; 20; 19,9; 20, 20,1; 19,9. Their weight was: 99,9; 99,9; 99,8; 100,2; 100; 100; 100,1; 9,8; 99,9; 100,2. Compare the production quality in terms of the ability of the company to keep the target value of the diameter and the weight. 10.12 In filter production, it is required that the throughput of the filter be 10% at the most. Two filter makers were inspected, and these are the results of the inspection: Maker % throughput A 3, 9, 9, 7, 1 B 8, 8, 1, 1, 2, 5 Source: author’s If the throughput tolerance is exceeded, the costs of the maker A rise by 600 crowns, whereas in the case of the maker B, the costs increase by 700 crowns. Which producer yields a betterquality filter? SOLUTIONS 10. 1 yes 10. 2 no 10. 3 yes 10. 4 no 10. 5 yes 10. 6 losses 10. 7 characteristic 10. 8 ( ) ( )2 L Y k Y T= − Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 145 - 10. 9 N (nominal), S (smaller), L (larger) 10. 10 0 10. 11 Diameter: ( ) 0,12E L = crowns per unit; Weight: ( ) 0,15E L = crowns per unit. 10. 12 A: ( ) 265,2E L = crowns per unit; B: ( ) 185,5E L = crowns per unit. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 146 - 11 TAGUCHI’S METHODS: TOTAL QUALITY COSTS In the previous chapter, we worked with loss functions that measured financial losses of customers, resulting from product lower quality. The losses were feature-specific. In this chapter, we shall include in the losses other types of expenses that are not directly linked to a specific product feature. Inclusion of these expenses leads to a total quality costs function. In chapter 11, we shall talk about control charts, as well. This is a very important part of quality management, as it monitors stability of production processes. The chapter is divided into four parts: the first part discusses quality cost monitoring, the second part introduces the total quality cost function when 100% control of the production is realized, the third part of the text provides the reader with the total quality cost function when the production is controlled after every batch of n production units, and the final part of the chapter describes basic types of control charts. 11.1 QUALITY COST MONITORING The term „quality cost“ can mean more than one thing, nevertheless, it is mostly connected to expenses on ensuring or improving quality, as well as expenses of nonproductive nature, such as those resulting from making nonconforming products. From the practical point of view, it is convenient to divide the quality costs into three categories: - quality costs of the producer, - quality costs of the customer, - quality costs of the whole society. We shall focus on the first category. Producers must invest in prevention, production evaluation and defect removal, so that appropriate quality is achieved in all production stages, i.e. in the product development, product manufacturing, product installation and product use. Monitoring these „investments“ allows for product improvement. There are different ways of monitoring the expenses: 1) monitoring based on PAF models, 2) monitoring based on process models, 3) monitoring through the Taguchi’s approach. Ad1) PAF models (Prevention, Appraisal, Failure) This model is based on dividing company costs into four categories: Costs resulting from internal defects (these defects originate within company before its final product reaches the customer), Costs resulting from external defects (these include customer complaints, repairs, handling costs, discounts, expenses due to lawsuits, market share losses, etc.), Evaluation costs (these are mainly expenditures on measuring customer satisfaction, measuring equipment, software, certification, laboratory testing, etc.). Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 147 Prevention costs (these are expenses which should rise continuously; they include expenditures on exploring customer demands, management system development, education of employees and others). Ad2) Process models Process models represent a higher degree of monitoring which keeps track of costs related not to each product but processes. The costs involve expenses on converting process inputs into process outputs according to a plan, as well as expenses on resolving inconsistencies that were not supposed to originate in the process at all. Ad 3) Taguchi’s methods These methods use mathematics to describe relations between total quality costs and different factors that contribute to the costs. The exact nature of mathematics enables to optimize the costs, which is, of course, an advantage of this approach. We shall devote ourselves to Taguchi’s methods in the following two sections 11.2 and 11.3. 11.2 TAGUCHI’S APPROACH – THE CASE OF 100% PROCESS CONTROL The total quality costs are calculated in this case according to equation 11-1 2 02 s d A R Q L += , where Q = yearly expenses on 100 % control, R = yearly production (number of product units made), d = tolerance within which the product remains satisfactory in terms of its quality, A = losses due to exceeding the tolerance d, ( ) ( ) ( ) 2 2 22 0 2 1 3 2 1 1 ... 1 n ns y y y y y y n −  = − + − + + −  − , the y’s being measurements of the observed product quality characteristic. PROBLEM 1 An automated control (i.e. a 100% control) at a factory costs 25 000 crowns a year. Each year, four million product units leave the factory. The tolerance for the quality characteristic observed is 9 and the company’s costs rise by 5 crows each time the tolerance is exceeded. Calculate the total quality costs if a random sampling showed the „variability“ of the characteristic to be 12 0 =s . Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 148 - SOLUTION We have: Q = 25 000 Kč R = 4 000 000 ks, d = 9, A = 5 Kč, 12 0 =s . Thus, according to 11-1, we get: 2 25000 5 1 0,068 4000000 9 L = + ⋅ = crowns per unit. The total quality costs are 4 000 000·0,068 = 272 000 crowns per year. 11.3 THE CASE OF PROCESS CONTROL AFTER N UNITS If n product units are made between two controls, the total quality costs are calculated according to formula 11-2 2 2 2 2 2 2 2 1 3 ms d A z n u D d AD d A u C n B L +      + + +++= , where A = loss due to exceeding tolerance d, B = product control costs, C = production machinery repair costs, n = control interval, u = average number of units produced between two controls, d = tolerance within which the product remains satisfactory in terms of its quality (the tolerance is defined by the customer), D = tolerance defined by the producer (it is usually more demanding than what the customer demands), z = number of product units made during the control, n B = control costs per unit, u C = repair costs per unit, 3 2 2 D d A = costs resulting from imprecise production,       + + z n u D d A 2 12 2 = costs due to producing defective units, 2 2 ms d A = costs due to imprecise measurements. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 149 Equation 11-2 is a result of Taguchi’s long-time experience, and it was mathematically defined, not derived. However, three terms in the equation are derived from loss functions. The question we usually ask ourselves is: How often should the production control take place so that the total quality costs were minimal? What tolerance should the company define for itself to minimize its total quality costs? The answers to the questions can be obtained by standard optimization procedures which seek local minima of a function: the first-order derivatives with respect to n and D are calculated (the two problems are solved separately), and the derivatives are put equal to zero. This necessary extremum condition leads to an optimal control interval of 11-3 * 2uB d n A D = . and an optimal tolerance 11-4 2 2 * 4 3CD d D Au = . 11.4 CONTROL CHARTS Control charts rank among major statistical tools for production process regulation. The charts were introduced by Walter Shewhart in the 1920s. Their aim is to monitor a characteristic of a process in time, and give a signal if a problem in the process occurs. If such a deterioration of the process takes place, the process owner reacts to the situation, and makes the necessary adjustments to the process. Thus, the charts serve as a problem prevention. Values of the observed characteristic are measured on the y-axis of the chart, whereas its xaxis records points in time at which the characteristic is observed or measured. The time series values of the characteristic should not exceed certain limits, nor should they form an improbable pattern. In either case, the chart signals a systematic impact on the process which has nothing to do with the natural character of the process. Such an impact may result, for instance, from a defect developed in a machine that is used during the process. Although nearly any process characteristic can be observed in time, it should satisfy some basic requirements if it is to be used in the framework of statistics. In our case, such a characteristic should at least approximately follow a normal distribution. Two properties of the process characteristic (or the process in question) are observed: a. its ability to keep itself close to a pre-defined target value, b. its variability around the target value. Therefore, two control charts are usually constructed, each of which observes either the property a) or b). Perhaps the most common charts are: the chart for the average x of the characteristic and the chart for the range of the characteristic R; the pair is denoted CC( )Rx, ; or the chart for the average x of the characteristic and the chart for its standard deviation s, i.e. the pair CC( )sx, . Let us take a closer look at the first pair. To construct a CC( )Rx, , we take the following steps: Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 150 a. We gather data about the characteristic at time points t = 1,2, …, m (the first column of table 67). b. The average and range minmax xxR −= are calculated for each of the samples, i.e. for each point in time t. c. An optimal value (central line), an upper limit and a lower limit are calculated to be used in the chart as a reference (see below). d. We plot the time averages in one chart to get the CC( )x chart. Likewise, we plot the individual time ranges in another chart to get the CC(R) graph. The time series depicted by either chart should stay below the calculated upper limit and above the calculated lower limit. Also, it should fluctuate more or less randomly around the central line of the chart. If it doesn’t seem to be the case, the process owner must check the status of the process. Tabulka 67: Samples and characteristics for CC( x , R) charts Data x-axis y-axis i = time average ix range Ri nxxx 11211 ,...,, 1 1x R1 nxxx 22221 ,...,, 2 2x R2 nxxx 33231 ,...,, 3 3x R3 : : : : mnmm xxx ,...,, 21 m mx Rm Figure 23 outlines the fundamental limits of the CC(R) chart: Figure 23: Limits and lines of the CC(R) control chart UCL = upper control limit, LCL = lower control limit, CL = central line. For the CC( x ) chart, the limits are calculated as follows:: 2LCL x A R= − , 2UCL x A R= + , CL x= , Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 151 - where 1 1 m i i x x m = = ∑ is the average of the individual sample averages from different time points. Further, 1 1 m i i R R m = = ∑ , is the average of the individual ranges from different time points. The limits for the CC(R) chart are: 4UCL D R= , 3LCL D R= , CL R= . The constants A2, D3 and D4 are in table 68 (the table has zeros where the data’s missing). Table 68: Constants for LCL and UCL limits of control charts Source: author’s Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 152 PROBLEM 2 Calculate the limits UCL, LCL and CL of CC( x ) and CC(R) if the following data is given Table 69: Data samples for CC( )Rx, Source: author’s SOLUTION Control chart CC( x ): 1 1 9,994 m i i x x m = = =∑ . 1 1 1,271 m i i R R m = = =∑ . 2 9,994 0,720 1,271 0,915LCL x A R= − = − ⋅ = . 2 9,994 0,720 1,271 10,909UCL x A R= + = + ⋅ = . 9,994CL x= = . Control chart CC(R): 4 2,282 1,271 2,9UCL D R= = ⋅ = . 3 0 1,271 0LCL D R= = ⋅ = . 1,271CL R= = . Subsequently, for each point in time, the corresponding individual average would be plotted on the vertical axis of the control chart CC( x ), the time point being plotted on the horizontal axis of the chart, and the same is true for the control chart CC(R), in which the individual range corresponding to the particular point in time would be plotted on the vertical axis. In the end, the charts are evaluated: the basic rule is that none of the points plotted in the chart is either above the UCL limit or below the LCL limit of the chart. If it does happen, the process must be checked as to what occured in the process at the moment when the point exceeded the chart limits. It is quite probable that something unnatural or systematic interfered the natural course of the process. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 153 - SUMMARY The chapter presented the total quality cost function, and also served as an introduction to control charts. The cost function was not derived mathematically, it was defined, based on experience of Genuchi Taguchi, a Japanese engineer. If production process is not controlled each time its product unit is made, it is convenient to figure out how often the control should be carried out and how precise it should be so that it didn’t cost too much. The total quality cost function can be used for these purposes. Control charts are a fundamental tool for statistical process regulation. Their objective is to monitor process characteristics, and signal that anomalies occur in the process. There are different types of control charts, depending on the nature of the process characteristic observed. The reader became acquainted with two basic control charts: one of them monitors time development of the average value of the process characteristic, while the other records time development of the range of the same characteristic. What follows is a problem related to the total quality cost functions. PROBLEM 3 A pressing machine produces a set of 8 pressed parts at once. It cost 0,5 crowns to produce each of these parts. The factory controls the pressing by checking the whole set, and if one of the parts is faulty, the whole set is scrapped, the machine is stopped and adjusted at a cost of 70 crowns. The tolerance defined by the customer for the size of each part is 4, while the factory has its own tolerance of 10. Four hundred and eighty parts are produced every hour, and the number of working hours is 2000 per year. The control of the machine lasts 2 minutes, and it costs 10 crowns. Imprecision of the measuring equipment used during the control is not considered here. The time interval between two controls is 4 hours in average. We are to calculate the total quality costs, and determine the optimal control regime, including its contribution to the quality cost reduction. SOLUTION The parameters are: A = 8·0,5 = 4 crowns B = 10 crowns C = 70 crowns Do= 4 d = 10 no = 480 units 162. 60 480 ==z uo = 4.480 = 1920 units Inserting uo, no and Do in 11-2, we have the current quality costs of 2 2 0 2 2 10 70 4 4 4 480 1 4 ( 16) 0,356 480 1920 10 3 10 2 1920 L + = + + + + ⋅ = crowns per unit. To optimize the control, we have Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 154 a. According to 11-3 * 2 2 1920 10 10 244,9 240 4 4 o o u B d n A D ⋅ ⋅ = = = ≈ units, i.e. the control should be performed approximately every 30 minutes. b. Based on 11-4, the optimal tolerance is 2 2 2 2 * 44 3 3 70 4 10 2,57 2 4 1920 o o CD d D Au ⋅ ⋅ ⋅ = = = ≈ ⋅ . c. Cost savings: *2 * *2 * 2 2 2 2 2 2 1 ( ) 3 2 10 70 4 2 4 240 1 2 ( 16) 0,287 crowns per unit. 240 480 10 3 10 2 480 B C A D A n D L z n u d d u L + = + + + + ⋅ + = + + ⋅ + + ⋅ = The cost reduction is =− LL0 0,356 – 0,287 = 0,069 crowns per unit, that is 0,069·480·2000 = 66 240 crowns in savings per year. CONTROL TEST 11 Yes/No answers: 11.1 If a production process is controlled each time its product unit is made, the total quality costs are 2 02 Q A L s R d = + ? 11.2 The fundamental formula for quality cost evaluation is mathematically derived? 11.3 When working with a process characteristic, we are interested in how well the characteristic clings to its target value and how it fluctuates around the target value? 11.4 The range R of a data sample is calculated as maxR x x= − ? 11.5 CC( )Rx, represents a chart for the average and a chart for the range of a process characteristic? Complete the statement: 11.6 In equation 2 02 Q A L s R d = + , 2 0s =__________. 11.7 If process control is not performed each time a product unit is made, we are interested in how often __________ and __________. 11.8 __________ __________ is a main tool for statistical process regulation. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 155 - 11.9 Two elementary control charts are: for the average and __________; and for the average and __________ __________. SOLUTIONS 11.1 yes 11.2 no 11.3 yes 11.4 no 11.5 yes 11.6 ( ) ( ) ( )[ ]2 1 2 23 2 12 2 0 ... 1 1 −−++−+− − = nn yyyyyy n s 11.7 to control, how precisely to control 11.8 control charts 11.9 range; standard deviation. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 156 - CONCLUSION This textbook has presented selected but frequently used statistical methods which include procedures applied in industry. The logic of the text was based on the fact that industrial procedures draw on statistical terms and techniques, which means, the terms and techniques have to be presented before they can be applied either in industry or other sectors of economy. Classical statistical methods include, but are not restricted to, regression, correlation analysis, hypothesis testing, time series analysis, analysis of variance, and descriptive statistics. These topics were covered in chapters 1-8, whereas the remaining chapters described the fundamentals of the design of experiments, which is closely related to regression, Taguchi’s loss functions and control charts. The structure of the text followed the well-established scheme according to which the subject matter explained is accompanied by examples, and the end of the chapter presents relevant questions. The textbook best serves as an outline of major statistical methods, providing the reader with main ideas and principles of the methods. The extent and depth of the presented topics comply with the subject matter contained in the course Statistical methods for economists, taught at the Faculty of Business Administration of the Silesian University. There are other literary resources, as well, which cover each topic of this textbook. These resources focus specifically only on some of the methods, and therefore elaborate the ideas behind the methods further, as compared to what is presented in this textbook. The reader is encouraged to examine other external scholarly texts, as well. Some of the literary sources are listed on the following page. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 157 - REFERENCES [1]ANTONY, J.: Design of Experiments for Engineers and Scientists, 8th edition, Butterworth-Heinemann, 2003, ISBN: 0-7506-4709-4. [2] BISSELL, B.: Statistical methods for SPC and TQM. 1.vyd. London: Chapman and Hall, 1994, ISBN 9780412394409. [3] BRÄNNMARK M., LANGSTRAND J., JOHANSSON S., HALVARSSON A., ABRAHAMSSON L., WINKEL J.: Researching Lean: Methodological Implications of Loose Definitions, Quality, Innovation, Prosperity, Vol. XVI/2-2012, p. 35-48, ISSN 1335-1745, DOI: 10.12776/qip.v16i2.67. [4] McClave, J., SINCICH, T.: Statistics, 12th edition, Pearson Education Ltd., 2014, ISBN: 978-1-292-02265-9. [5] ROY, R.K.: Design of Experiments Using Taguchi Approach, 1st edition, John Wiley and Sons, 2001, ISBN: 0-471-36101-1. [6] SIVA V.: Improvement in Product Development: Use of Back-End Data to Support Upstream Efforts of Robust Design Methodology, Quality, Innovation, Prosperity, Vol. XVI/2-2012, p. 84-102, ISSN: 1335-1745, DOI: 10.12776/qip.v16i2.65. [7] TAGUCHI G., CHOWDHURY, S., WU, Y.: Taguchi’s Quality Engineering Handbook, 1st edition, John Wiley and Sons, 2005, ISBN: 0-471-41334-8. [8] TOŠENOVSKÝ, J., NOSKIEVIČOVÁ, D.: Statistické metody pro zlepšování jakosti. 1.vyd. Ostrava: Montanex, a.s., 2001, ISBN 80-7225-040-X. [9] TOŠENOVSKÝ, J., DUDEK, M.: Základy statistického zpracování dat.1.vyd. Ostrava: VŠB, 2001, ISBN 80-248-0006-3. [10] WITTE, R.S., WITTE, J.S.: Statistics, 9th edition, John Wiley and Sons, 2010, ISBN: 978-470-39222-5. [11] ZGODAVOVÁ, K., BOBER, P.: An Innovative Approach to the integrated Management System Development: SIMPRO-IMS Web-Based Environment, Quality, Innovation, Prosperity, Vol. XVI/2-2012, p. 59-70, ISSN: 1335-1745, DOI: 10.12776/qip.v16i2.69. [12] ZGODAVOVÁ, K.: Complexity of Entities and its Metrological Implications, Proceedings of the 21st International DAAAM Symposium, p. 365-367, 2010, ISSN: 1726- 9679. Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 158 APPENDIX 1 – TABLE FOR DURBIN-WATSON’S TEST Table for Durbin – Watson’s test: alpha = 1%, dL = lowel limit, dU = upper limit, n = sample size, k′ = number of model regressors without the absolute term. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 159 - Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 160 Table for Durbin – Watson’s test: alpha = 5%, dL = lower limit, dU = upper limit, n = sample size, k′ = number of model regressors without the absolute term. Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS - 161 - Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS - 162 Název: Statistical Methods for Economists Autor: Ing. Filip Tošenovský, Ph.D. Vydavatel: Slezská univerzita v Opavě Obchodně podnikatelská fakulta v Karviné Určeno: studentům SU OPF Karviná Počet stran: 162 Vydání: on-line ISBN: 978-80-7510-033-7