Slezská univerzita v Opavě
Obchodně podnikatelská fakulta v Karviné
STATISTICAL METHODS
FOR ECONOMISTS
Filip Tošenovský
Karviná 2014
Projekt OP VK č. CZ.1.07/2.2.00/28.0017
„Inovace studijních programů na Slezské univerzitě,
Obchodně podnikatelské fakultě v Karviné“
Field: Statistics
Annotation: This textbook presents to the reader important parts of statistical data analysis. The subject
matter contained in the book focuses on statistical methods which constitute a standard part of
scholarly materials used both at domestic and foreign universities. The methods include
description of statistical characteristics, hypothesis testing, regression and correlation analysis,
analysis of variance, and also other procedures abundantly used in industries for product quality
control, such as design of experiments, Taguchi’s methods based on loss functions and control
charts.
Key words: Statistical characteristics, regression, correlation, hypothesis testing, analysis of variance, design
of experiments, Taguchi’s methods.
Author:
Ing. Filip Tošenovský, Ph.D.
Reviewers: Prof. RNDr. Josef Tošenovský, CSc., Ing. Elena Mielcová, Ph.D.
ISBN 978-80-7510-033-7
- 3 -
CONTENTS
INTRODUCTION.................................................................................................................... 5
1 ESSENTIAL STATISTICAL TERMS, CHARACTERISTICS................................ 6
1.1 THE CASE OF A SINGLE VARIABLE ............................................................................................ 7
1.1.1 MEASURES OF CENTRAL TENDENCY ............................................................................ 7
1.1.2 MEASURES OF VARIABILITY ......................................................................................... 9
1.1.3 MEASURES OF DATA CONCENTRATION....................................................................... 11
1.1.4 GENERAL MOMENTS.................................................................................................. 12
1.2 THE CASE OF TWO VARIABLES................................................................................................ 13
2 HYPOTHESIS TESTING IN MARKETING............................................................ 18
2.1 TESTING STATISTICAL HYPOTHESES....................................................................................... 18
2.2 MARKETING STUDY ................................................................................................................ 23
2.3 MEDIAN TEST.......................................................................................................................... 31
2.4 CHI-SQUARED TESTS............................................................................................................... 32
2.4.1 TESTING A DISCRETE PROBABILITY DISTRIBUTION ..................................................... 32
2.4.2 CHI-SQUARED TEST OF INDEPENDENCE.................................................................... 34
3 REGRESSION ANALYSIS......................................................................................... 38
3.1 THE CONCEPT OF REGRESSION ANALYSIS............................................................................... 38
3.2 ESTIMATION OF REGRESSION COEFFICIENTS .......................................................................... 40
3.3 TESTING SIGNIFICANCE OF REGRESSION COEFFICIENTS ......................................................... 45
3.4 CONFIDENCE INTERVALS FOR REGRESSION COEFFICIENTS .................................................... 46
3.5 TESTING MODEL SIGNIFICANCE.............................................................................................. 46
4 CORRELATION ANALYSIS..................................................................................... 55
4.1 CORRELATION COEFFICIENT................................................................................................... 55
4.2 CORRELATION INDEX ............................................................................................................. 58
4.3 SPEARMAN’S RANK CORRELATION COEFFICIENT................................................................... 58
4.4 MULTIVARIATE DEPENDENCE- THE CASE OF TWO VARIABLES............................................... 60
5 METHODS FOR SALES PREDICTIONS ................................................................ 67
5.1 TIME SERIES............................................................................................................................ 67
5.2 TIME SERIES MODEL DECOMPOSITION.................................................................................... 68
5.2.1 TREND...................................................................................................................... 69
5.2.2 SEASONAL COMPONENT – THE CASE OF CONSTANT SEASONALITY .............................. 73
5.2.3 PROPERTIES OF THE RANDOM COMPONENT OF A REGRESSION MODEL ...................... 76
5.2.4 DURBIN-WATSON’S TEST.......................................................................................... 76
5.3 MOVING AVERAGES................................................................................................................ 79
5.3.1 SIMPLE MOVING AVERAGES....................................................................................... 79
5.4 MAKING PREDICTIONS WITH TIME SERIES MODELS ................................................................ 81
6 ANALYSIS OF VARIANCE ....................................................................................... 86
6.1 ONE-WAY ANOVA ................................................................................................................... 86
6.1.1 ANOVA HYPOTHESES................................................................................................. 88
6.1.2 A MEASURE OF DEPENDENCE ................................................................................... 92
- 4 -
7 TWO-WAY ANOVA AND LATIN SQUARES......................................................... 95
7.1 TWO-WAY ANOVA .................................................................................................................. 95
7.1.1 EFFECT OF FACTOR A............................................................................................... 96
7.1.2 EFFECT OF FACTOR B............................................................................................... 97
7.2 THREE-WAY ANOVA (LATIN SQUARES) .................................................................................. 99
8 FULL FACTORIAL EXPERIMENTAL PLANS................................................... 107
8.1 FOUNDATIONS OF EXPERIMENTING AND ITS APPLICATIONS ................................................ 107
8.2 EXPERIMENTAL PROCEDURE ................................................................................................ 108
8.3 EFFECT OF A FACTOR AND ITS SIGNIFICANCE....................................................................... 111
8.3.1 STATISTICAL TEST OD FACTOR SIGNIFICANCE.......................................................... 113
8.3.2 GRAPHICAL ASSESSMENT OF FACTOR SIGNIFICANCE ............................................... 114
8.3.3 GRAPH OF INTERACTIONS ....................................................................................... 115
8.4 REGRESSION MODEL OF THE 23
EXPERIMENT ....................................................................... 116
9 TWO-LEVEL FRACTIONAL PLAN...................................................................... 123
9.1 HALF PLANS.......................................................................................................................... 124
9.2 GRAPHICAL EVALUATION OF FACTOR EFFECT...................................................................... 126
10 TAGUCHI’S METHODS – LOSS FUNCTIONS.................................................... 134
10.1 DEFINITION AND PROPERTIES OF LOSS FUNCTIONS .............................................................. 134
10.2 LOSS FUNCTIONS FOR DIFFERENT TYPES OF TOLERANCES ................................................... 136
11 TAGUCHI’S METHODS: TOTAL QUALITY COSTS ........................................ 146
11.1 QUALITY COST MONITORING................................................................................................ 146
11.2 TAGUCHI’S APPROACH – THE CASE OF 100% PROCESS CONTROL ........................................ 147
11.3 THE CASE OF PROCESS CONTROL AFTER N UNITS ................................................................. 148
11.4 CONTROL CHARTS ................................................................................................................ 149
CONCLUSION..................................................................................................................... 156
REFERENCES..................................................................................................................... 157
APPENDIX 1 – TABLE FOR DURBIN-WATSON’S TEST........................................... 158
- 5 -
INTRODUCTION
The presented textbook serves as a study material for the course Statistical Methods for
Economists taught at the Karviná-based School of Business Administration of the Silesian
University. The course Statistical Methods for Economists, a follow-up to the course
Statistics, stresses the importance of application of statistical methods in economic
disciplines, such as marketing, management, production planning and quality management.
The textbook is divided into twelve chapters, which corresponds to the usual twelve weeks of
teaching the school term consists of. The chapters are more or less the same in terms of the
extent of their contents and difficulty. The extent of each chapter corresponds to a two-hour
lecture presented to full-time students at schools of economics. As part of a full-time study,
the course lecture is accompanied by a seminar in which the explained subject matter is
practised, using specific numerical examples and often a computer software, as well.
However, the Silesian University part-time students may also use the textbook. Part-time
studying is a form of study which, in the case of Statistical Methods for Economists, requires
students to work regularly and persistently, be able to concentrate on the subject and take an
active approach to solving problems on their own. This is where the textbook should help
substitute the fine full-time teaching, and serve as a study material. Other literary resources
listed at the end of the textbook may also be of additional help in this respect.
To pass the course Statistical Methods for Economists successfully, it is assumed that
students have passed the course Statistics in the first place. It is true that not all learnt in
Statistics is necessary to master Statistical Methods for Economists because some of the
subject matter presented earlier had a different purpose of being demonstrated. None the less,
the ability of accurate and logical thinking experienced in the previous course will come in
useful, and so will the ability to recognize mathematical symbols used and the knowledge of
essentials of the probability theory and statistics.
Returning to the course Statistical Methods for Economists, let us describe its contents in a
greater detail. A more accurate name for the subject could be Selected Statistical Methods for
Economists, or even more accurately: Selected Statistical Methods of Marketing,
Management and Quality Control. These are the three major areas of interest the university
students often encounter in real life when applying statistics. Chapter 1 of the textbook revises
elementary terms used in statistics, chapters 2-7 deal with the application of statistics in
marketing and management, and chapters 8-12 are devoted to statistics in production planning
and quality control. The subject matter and the related problems are studied using Excel, as
long as the given problem allows Excel to find the solution. Students have already become
familiarized with Excel in the course Statistics.
As was mentioned at the beginning of this introduction, the text is divided into twelve
chapters. Each chapter requires about four to six hours of study. However, the reward waiting
at the end of the study is worth it: it is the feeling that something significant has been
overcome – an obstacle that separates the world of professionals from the world of nonprofessionals.
With such knowledge, one can better analyse information we are all flooded
with at present.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 6 -
1 ESSENTIAL STATISTICAL TERMS, CHARACTERISTICS
All statistical methods work with certain terms. This way, both authors of the statistical
theory and its users can communicate among themselves results of their analyses in a
comprehensible way. To simplify the communication, specific terms are introduced in
statistics. The advantage of this procedure lies in the fact that the terms are introduced only
once, but their validity is permanent for all interested parties. Also, using the simplest terms,
one may construct more sophisticated terms or methods. We shall revise as well as extend
some of the terms introduced in the course Statistics, and define essential statistical
characteristics, using the terms. The characteristics will in a convenient way summarize
information contained in the data under scrutiny. We note that from now on, whenever we
refer to a closed interval, we shall denote the interval with parentheses „[ ]”.
The main objective of statistics is to analyse a certain data. Of course, there is a reason why
data originated and is maintained. Its purpose is to help analyse the form and behaviour of a
statistical variable the data is related to. An example of such a variable is the height of
women in the Czech Republic, political preference of a citizen, gross domestic product of a
country, an average of a ball bearing produced, etc. We shall be mainly interested in
numerical variables that better suit the needs of mathematics. If this is our case, the analysed
variable can take on different values (otherwise it wouldn’t be a variable, of course). The set
of all values a variable can take on is called population. Since population is related to a
specific variable, it is a relative term. For instance, if we are interested in political preferences
of the Czechs, the population consists of all Czech citizens, and will not be usually available
unless census takes place and its results are available to the public. On the other hand, if we
are interested in school results of a specific group of students who attend the course Statistical
Methods for Economists, the group will represent the population which will be easily within
reach. Statisticians, however, more often than not do not have populations at hand, and in
such cases all they can do is perform a sampling from the population, which results in having
a data sample of the population. There is more than one way how to obtain a data sample.
There are also branches of statistics whose sole purpose is to analyse various forms of data
sampling. In statistics, we usually require that the sampling be random. Random sampling
means that every element of the population has the same probability of being selected, the
same probability of being present in the final data sample. More precisely, random data
sample of size n is a random vector (ܺଵ, ܺଶ, … , ܺ௡ሻ where the random variables ܺ௜’s follow
the same probability distribution (population), and are statistically independent. Such a
sampling is required because it posseses certain „representative“ properties the theory of
statistics relies on.
If the data sample is available, we can analyse it with proper statistical methods, and based on
this analysis we may formulate conclusions about the data structure of the population the data
sample came from. Such conclusions constitute what is called statistical inference.
If the population can be obtained, the only ambition of statistics might be to describe the
population. Methods that serve this purpose form descriptive statistics. Descriptive statistics
provides us with characteristics that describe the population with a single number. These
characteristics are called population characteristics. A characteristic summarizes information
about the data. If the population consists of two thousand values, it is certainly better to use a
single number – a characteristic to get a rough idea about the population, rather than name all
its values. This aggregation, however, is not flawless: there must necessarily be a loss of the
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 7 original
information about the population since a single number cannot obviously reflect the
entire amount of the original information.
In case that only the data sample is available, not the entire population, the so-called sample
characteristics are used to describe the data sample structure. It is a common habit to denote
the population characteristics with Greek letters and the sample characteristics with Latin
letters. In this way, an order is introduced to the notation, and all users of the theory know
immediately whether they work with population characteristics or their sample counterparts.
We shall now introduce other terms and new characteristics for data that represent values of a
single statistical variable. Later, data that contains information about two statistical variables
will also be handled.
1.1 THE CASE OF A SINGLE VARIABLE
Let us have a population consisting of values ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡, where n is an integer, i.e. a
finite number (we shall work with data of finite size only). Let X be a variable of interest. The
numbers ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ are the values which the variable can take on. If we apply a random
sampling to this population, we can regard the variable X as a (discrete) random variable.
Although the population contains the values ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡, not all these values must
necessarily be different. Some of them may repeat. In such cases, X takes on only k different
values ‫ݔ‬ଵ
‫כ‬
, ‫ݔ‬ଶ
‫כ‬
, … , ‫ݔ‬௞
‫כ‬
. The value ‫ݔ‬ଵ
‫כ‬
may appear in the population ݂ଵ times. The number ݂ଵ is
called absolute frequency of the value ‫ݔ‬ଵ
‫כ‬
. Similarly, the value ‫ݔ‬ଶ
‫כ‬
appears in the population ݂ଶ
times, the value ‫ݔ‬ଷ
‫כ‬
... ݂ଷ times and so on…until the last value ‫ݔ‬௞
‫כ‬
appears in the population ݂௞
times. Apart from absolute frequencies, we also work with other types of frequencies:
a) relative frequency of appearance of value ‫ݔ‬௟
‫כ‬
is given by the division ݂௟/݊, where ݊ ൌ
∑ ݂௜
௞
௜ୀଵ = ݂ଵ ൅ ݂ଶ ൅ ‫ڮ‬ ൅ ݂௞ denotes the population size.
If we sort the values ‫ݔ‬ଵ
‫כ‬
, ‫ݔ‬ଶ
‫כ‬
, … , ‫ݔ‬௞
‫כ‬
in the ascending order, we get a population where ‫ݔ‬ሺଵሻ
‫כ‬
൑
‫ݔ‬ሺଶሻ
‫כ‬
൑ ‫ڮ‬ ൑ ‫ݔ‬ሺ௞ሻ
‫כ‬
holds. In this notation, ‫ݔ‬ሺଵሻ
‫כ‬
is the minimum of the set ‫ݔ‬ଵ
‫כ‬
, ‫ݔ‬ଶ
‫כ‬
, … , ‫ݔ‬௞
‫כ‬
, ‫ݔ‬ሺଶሻ
‫כ‬
is the
second smallest number in the set, and so on. If the value ‫ݔ‬ሺ௜ሻ
‫כ‬
has an absolute frequency of its
occurence ݂௜
‫כ‬
, we may introduce the following new terms:
b) absolute cumulative frequency of ‫ݔ‬ሺ௟ሻ
‫כ‬
, which is given by the sum ∑ ݂௜
‫כ‬௟
௜ୀଵ .
c) Relative cumulative frequency of ‫ݔ‬ሺ௟ሻ
‫כ‬
, given by the sum ∑ ݂௜
‫כ‬௟
௜ୀଵ /݊.
The foregoing types of frequencies can be used both for population and data sample.
1.1.1 MEASURES OF CENTRAL TENDENCY
Let us have a population consisting of values ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡. Population arithmetic
mean ߤ is one of the most important measures of central tendency. It is defined as
1-1 ߤ ൌ
ଵ
௡
∑ ‫ݔ‬௜
௡
௜ୀଵ .
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 8 Since
the mean is related to a population, it is called population mean. If we performed a data
sampling of size m from the population and obtained values ‫ݔ‬෤ଵ, ‫ݔ‬෤ଶ, … , ‫ݔ‬෤௠ by the sampling, we
could estimate the usually unknown population mean by sample mean
1-2 ‫ݔ‬ҧ ൌ
ଵ
௠
∑ ‫ݔ‬෤௜
௠
௜ୀଵ .
Excel: In Excel, both characteristics can be calculated using the function mean() which
requires only one parameter: a reference to the area in the Excel spreadsheet that contains the
data.
If we know the population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ contains only k different values: a value ‫ݔ‬ଵ
‫כ‬
exactly ݂ଵ
times, a value ‫ݔ‬ଶ
‫כ‬
… ݂ଶ times, etc. ... and finally a value ‫ݔ‬௞
‫כ‬
… ݂௞ times, we may rewrite 1-1 to
1-3 ߤ ൌ
ଵ
௡
∑ ‫ݔ‬௜
‫כ‬
.௞
௜ୀଵ ݂௜ ൌ
ଵ
∑ ௙ೕ
ೖ
ೕసభ
∑ ‫ݔ‬௜
‫כ‬
.௞
௜ୀଵ ݂௜.
Similarly, equation 1-2 can be rewritten, using the absolute frequencies with which the values
‫ݔ‬෤ଵ, ‫ݔ‬෤ଶ, … , ‫ݔ‬෤௠ appear in the data sample. Also, we may regard 1-3 as a special case of what is
called weighted average. Weighted average of values ‫ݔ‬ଵ
‫כ‬
, ‫ݔ‬ଶ
‫כ‬
, … , ‫ݔ‬௞
‫כ‬
with weights ‫ݓ‬ଵ, ‫ݓ‬ଶ, … , ‫ݓ‬௞
is defined as
1-4 ߤ௪ ൌ
ଵ
∑ ௪ೕ
ೖ
ೕసభ
∑ ‫ݔ‬௜
‫כ‬
.௞
௜ୀଵ ‫ݓ‬௜.
If the sum of weights is equal to one, it is clear that formulas 1-3 and 1-4 represent the same
thing for ‫ݓ‬௜ ൌ ݂௜/ ∑ ݂௝
௞
௝ୀଵ , i = 1, 2,..., k.
Another measure of central tendency is mode ‫ݔ‬ො, which is the value with the highest absolute
frequency. This definition doesn’t guarantee uniqueness of the mode, however. Thus, it may
happen that the data contains more than one mode.
Yet another measure of central tendency is median, denoted ‫ݔ‬෤ or ‫ݔ‬ହ଴. We also talk about a
middle value. Median is generally not the same as average or mean. For a data sample
consisting of values ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡, we may calculate the median in the following steps:
1) we sort the data in the ascending order to get a data ‫ݔ‬ሺଵሻ ൑ ‫ݔ‬ሺଶሻ ൑ ‫ڮ‬ ൑ ‫ݔ‬ሺ௡ሻ,
2) we calculate ‫ݖ‬ ൌ ݊ · 0,5 ൅ 0,5,
3) if z is an integer (this is true when n is odd), then ‫ݔ‬෤ ൌ ‫ݔ‬ሺ௭ሻ. If z is not an integer (this
happens when n is even), then ‫ݔ‬෤ ൌ ሺ‫ݔ‬ሺ௭ି଴,ହሻ ൅ ‫ݔ‬ሺ௭ା଴,ହሻሻ/2.
Excel: Excel uses the function median() to determine median. The function requires only one
parameter – a reference to the region in the Excel spreadsheet containing the data.
Uniqueness of median is guaranteed by the definition in this case. We stress that to use the
function correctly, each value of the data sample must be written out explicitly, i.e. it must not
be a data region containing only different values of the sample together with their respective
frequencies (see problem 1 how to proceed in this situation).
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 9 PROBLEM
1
Let a data sample contains number 7 with absolute frequency 234, number 9 with absolute
frequency 672 and number 43 with absolute frequency 347. Calculate the relative frequency
of each value, arithmetic mean of the sample, mode and median.
SOLUTION
Since mode is the value with the highest frequency, number 9 is the mode in this case. The
relative frequency of 7 equals 234/(234+672+347), the relative frequency of 9 is
672/(234+672+347) and the relative frequency of 43 equals 347/(234+672+347). The
arithmetic mean/average is
7 · 234 ൅ 9 · 672 ൅ 43 · 347
234 ൅ 672 ൅ 347
ൌ 18,04.
The sample size is 1253 – an odd number. Therefore, the median is the 627th value in the
sorted data sample, which is number 9.
1.1.2 MEASURES OF VARIABILITY
Measures of central tendency summarize in a certain sense information about where on
the real line the values of the observed variable X typically are. However, the nature of these
measures is such that they don’t say anything about how far the values are from one another.
For these purposes, measures of variability were introduced. They describe „a typical“ mutual
deviation of the individual values of X. In case of a population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡, population
variance ߪଶ
, as one of the measures of variability, is defined as
1-5 ߪଶ
ൌ
ଵ
௡
∑ ሺ‫ݔ‬௜ െ ߤሻଶ௡
௜ୀଵ .
If we only have a data sample ‫ݔ‬෤ଵ, ‫ݔ‬෤ଶ, … , ‫ݔ‬෤௠ of size m drawn from the population, we estimate
the usually unknown population variance by sample variance ‫ݏ‬ଶ
1-6 ‫ݏ‬ଶ
ൌ
ଵ
ሺ௠ିଵሻ
∑ ሺ‫ݔ‬෤௜ െ ‫ݔ‬ҧሻଶ௠
௜ୀଵ ,
where ‫ݔ‬ҧ is the arithmetic mean of the values ‫ݔ‬෤ଵ, ‫ݔ‬෤ଶ, … , ‫ݔ‬෤௠. We note that 1-6 is the typical
formula for sample variance, but not the only one. Equation 1-5 tells us that the variance is a
mean squared deviation of individual values of X from the population average ߤ.
Excel: To calculate the population variance, we use the Excel function varp() which demands
only one parameter – the data region in the spreadsheet, for which the variance is to be
determined. To calculate the sample variance, the function var() is used with the same
argument.
Just like in the case of the arithmetic mean, we may rewrite equation 1-5 or 1-6 to its
equivalent form that works with absolute frequencies: If we know the population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡
contains only k different values – a value ‫ݔ‬ଵ
‫כ‬
…݂ଵ times, a value ‫ݔ‬ଶ
‫כ‬
...݂ଶ times,..., a value
‫ݔ‬௞
‫כ‬
…݂௞ times, we may use the following formula instead of 1-5
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 10 -
1-7 ߪଶ
ൌ
ଵ
∑ ௙ೕ
ೖ
ೕసభ
∑ ሺ‫ݔ‬௜
‫כ‬
െ ߤሻଶ
· ݂௜
௞
௜ୀଵ .
The same logic/analogy applies to equation 1-6 if we use the absolute frequencies with which
the data sample contains different values ‫ݔ‬෤ଵ, ‫ݔ‬෤ଶ, … , ‫ݔ‬෤௠.
Another measure of variability is standard deviation, defined as the square root of variance. If
the underlying measure is a population variance, we talk about population standard deviation
ߪ. If the underlying measure is a sample variance, we arrive at sample standard deviation s by
taking the square root.
Range R, defined as ܴ ൌ ‫ݔ‬௠௔௫ െ ‫ݔ‬௠௜௡, where ‫ݔ‬௠௔௫ is the highest value in the data and ‫ݔ‬௠௜௡
is the lowest value in the data, also belongs among measures of variability.
Excel: The highest number in the data may be found using the Excel function max(), while
the lowest number can be obtained with the function min(). Both functions require as the
parameter a reference to the section of the spreadsheet containing the data.
We finalize the section on measures of variability with coefficient of variation V. If a
population is available, we define its population coefficient of variation by equation
1-8 ܸ ൌ
ఙ
|ఓ|
.
If only a sample is within reach, the corresponding sample coefficient of variation is
calculated, using the sample standard deviation and sample mean:
1-9 ܸ ൌ
௦
|௫ҧ|
.
The coefficient is suitable for situations when two data groups are to be compared in terms of
their variability, but each of the groups relates to a variable with different physical units.
Under these circuimstances, it is meaningless to use variance as a measure of variability
because it would be calculated in different and squared physical units, making itself useless
for any comparison.
The higher the coefficient of variation, the higher the variability of the given data group.
PROBLEM 2
Let a population contains number 7 with absolute frequency 234, number 9 with frequency
672 and number 43 with frequency 347. Calculate the population variance.
SOLUTION
We shall use formula 1-7. In the previous problem, which presented the same data group as a
sample, we calculated the sample average 18,04. The same average appears here, but this
time, it is a population average. According to 1-7, we have:
ߪଶ
ൌ
ሺ7 െ 18,04ሻଶ
· 234 ൅ ሺ9 െ 18,04ሻଶ
· 672 ൅ ሺ43 െ 18,04ሻଶ
· 347
234 ൅ 672 ൅ 347
ൌ 239,12.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 11 -
1.1.3 MEASURES OF DATA CONCENTRATION
The last category of characteristics we are about to describe consists of measures that
reflect in a sense to what extent or how the data under scrutiny are grouped together. Two
major representatives of this category are kurtosis Ku and skewness Sk. If a population
‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ is available, population kurtosis is defined as
1-9 ‫ݑܭ‬ ൌ
∑ ሺ௫೔ିఓሻర೙
೔సభ
௡ఙర .
Formula 1-9 can also be written equivalently as
1-10
‫ݑܭ‬ ൌ
∑ ሺ‫ݔ‬௜
‫כ‬
െ ߤሻସ
· ݂௜
௞
௜ୀଵ
ߪସ ∑ ݂௝
௞
௝ୀଵ
provided the population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ contains only k different values: a value ‫ݔ‬ଵ
‫כ‬
… ݂ଵ times,
a value ‫ݔ‬ଶ
‫כ‬
… ݂ଶ times, ... , a value ‫ݔ‬௞
‫כ‬
… ݂௞ times. If ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ represent only a data
sample, we calculate sample kurtosis, using again equation 1-9 or 1-10, but replacing the
population mean ߤ in the corresponding equation with the sample mean ‫ݔ‬ҧ, and also replacing
the fourth power of the population standard deviation ߪସ
with the fourth power of the sample
standard deviation ‫ݏ‬ସ
.
It is clear from 1-9 and 1-10 that kurtosis is nonnegative. Also, the intepretation of the
characteristic is such that the higher the kurtosis, the higher the concentration of the data that
lie closer to the mean, as compared to the data that are farther from the mean.
Formulas 1-9 and 1-10 sometimes appear in an altered form, with number 3 being substracted
from 1-9 and 1-10. This modification compares the kurtosis of the analysed data with the
kurtosis of normal distribution. The kurtosis of normal distribution is known to be 3,
regardless of the parameters of the distribution. This means that if the modified kurtosis of the
data is positive, the frequency distribution of the analysed data has a higher kurtosis than
normal distribution. We note that such a modification of the kurtosis is not the only one.
Excel: Excel offers the function kurt() for the enumeration of kurtosis. The function has only
one parameter, which is a reference to the area of the Excel spreadsheet containing the
analysed data. However, the function calculates yet another modification of kurtosis, which is
not identical to the most often used definitions 1-9 or 1-10. None the less, the Excel
modification can still be used for comparison of two data groups in terms of their kurtosis,
and its interpretation remains the same.
Population skewness is defined as
1-11 ܵ݇ ൌ
∑ ሺ௫೔ିఓሻయ೙
೔సభ
௡ఙయ ,
or equivalently as
1-12
ܵ݇ ൌ
∑ ሺ௫೔
‫כ‬
ିఓሻయ·௙೔
ೖ
೔సభ
ఙయ ∑ ௙ೕ
ೖ
ೕసభ
,
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 12 if
the population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡ contains only k different values: a value ‫ݔ‬ଵ
‫כ‬
… ݂ଵ times, a value
‫ݔ‬ଶ
‫כ‬
… ݂ଶ times, ... , a value ‫ݔ‬௞
‫כ‬
… ݂௞ times. If we were to calculate sample skewness, the note
made in the case of kurtosis applies here, as well: we use again formulas 1-11 or 1-12, with
the population mean replaced by the sample mean, and the third power of the population
standard deviation replaced by the third power of its sample counterpart ‫ݏ‬ଷ
.
As is clear from the defining formulas, skewness can take on any real value. If skewness is
zero, it means the frequency distribution of the data is symmetric. More intuitively,
concentration of smaller values is the same as that of higher values. If skewness is positive,
we say the frequency distribution of the data is skewed to the right, and concentration of
smaller values is stronger than concentration of higher values. Finally, if skewness is
negative, we say the data distribution is skewed to the left, and concentration of higher values
is stronger than concentration of smaller values. In the case of a nonzero skewness, the
frequency distribution of the data is said to be asymmetric.
Frequency distribution is often portrayed by a two-dimensional graph. The horizontal axis of
the graph describes different values ‫ݔ‬ଵ
‫כ‬
, ‫ݔ‬ଶ
‫כ‬
, … , ‫ݔ‬௞
‫כ‬
appearing in the data group, whereas the
vertical axis of the graph measures the frequencies with which ‫ݔ‬ଵ
‫כ‬
, ‫ݔ‬ଶ
‫כ‬
, … , ‫ݔ‬௞
‫כ‬
are contained in
the data group.
PROBLEM 3
A population contains the following values: 111 with absolute frequency 500, 222 with
absolute frequency 400, 333 with absolute frequency 600 and 444 with absolute frequency
300. Calculate the population skewness.
SOLUTION
The population size is 1800. The population average is 265,166, the population variance
equals 13880,14. Thus, the third power of the population standard deviation is 1635275.
According to 1-12, we get
ܵ݇ ൌ
ሺ111 െ 265,16ሻଷ
· 500 ൅ ‫ڮ‬ ൅ ሺ444 െ 265,16ሻଷ
· 300
1635275 · 1800
ൌ 0,01.
We may conclude the frequency distribution is almost perfectly symmetric in this case.
1.1.4 GENERAL MOMENTS
General moments are characteristics that look at data structures from a different angle.
There are several reasons why we work with general moments. One reason is that frequency
distributions and moments are related to each other uniquely under certain conditions: Data
groups with the same moments have the same frequency distributions and vice versa. What
we are interested in, however, relates to another reason why we work with the moments: it is
the fact that some of the foregoing characteristics can be calculated in a more elegant way,
using these moments.
For a population ‫ݔ‬ଵ, ‫ݔ‬ଶ, … , ‫ݔ‬௡, we define the k-th general moment Mk by equation
1-13 ‫ܯ‬௞ ൌ ݊ିଵ
· ∑ ‫ݔ‬௜
௞௡
௜ୀଵ , ݇ ൌ 1,2, …
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 13 Thus,
the k-th moment is nothing but the average of the k-th power of the original data. If the
data group contains only m different values ‫ݔ‬௜
‫כ‬
, i = 1, 2, …, m with frequencies ݂௜, we may
also enumerate 1-13 with the formula ‫ܯ‬௞ ൌ ݊ିଵ
· ∑ ሺ‫ݔ‬௜
‫כ‬
ሻ௞
· ݂௜
௠
௜ୀଵ , ݇ ൌ 1,2, … .
Now, the following relations hold true:
1-14 ‫ܯ‬ଵ ൌ ߤ,
‫ܯ‬ଶ െ ‫ܯ‬ଵ
ଶ
ൌ ߪଶ
,
ߪିଷ
· ሺ‫ܯ‬ଷ െ 3‫ܯ‬ଵ‫ܯ‬ଶ ൅ 2‫ܯ‬ଵ
ଷሻ ൌ ܵ݇,
ߪିସ
· ሺ‫ܯ‬ସ െ 4‫ܯ‬ଷ‫ܯ‬ଵ ൅ 6‫ܯ‬ଶ‫ܯ‬ଵ
ଷ
െ 3‫ܯ‬ଵ
ସሻ ൌ ‫.ݑܭ‬
PROBLEM 4
A group of data D contains value 11 with absolute frequency 4235 and value 254 with
absolute frequency 6543. Calculate the first two general moments.
SOLUTION
According to 1-13, we have
‫ܯ‬ଵ ൌ ሺ4235 ൅ 6543ሻିଵ
· ሺ11 · 4235 ൅ 254 · 6543ሻ ൌ 158,518,
‫ܯ‬ଶ ൌ ሺ4235 ൅ 6543ሻିଵ
· ሺ11ଶ
· 4235 ൅ 254ଶ
· 6543ሻ ൌ 39213,27.
1.2 THE CASE OF TWO VARIABLES
If we have a data group such that for each integer i = 1, 2, …, m and j = 1, 2, …, n it
contains a pair of values ሺ‫ݔ‬௜, ‫ݕ‬௝ሻ or even more pairs with these two values, we are working
with a data group of two statistical variables. Absolute frequency of the pair ሺ‫ݔ‬௜, ‫ݕ‬௝ሻ is called
joint frequency of ሺ‫ݔ‬௜, ‫ݕ‬௝ሻ, and is denoted ݂௜௝. The size of the data group is ‫ݎ‬ ൌ ∑ ݂௜௝௜,௝ .
The distribution of joint frequencies in the data group is described by a two-dimensional table
called contingency table (see table 1). The heading of the table contains different categories of
each variable, and the table itself contains joint frequencies with which the combinations of
the categories occur in the data group.
A variable y, for instance, may represent family status, while a second variable, say x, can
describe attained educational level. The joint frequency ݂ଵଵ, for example, will then describe
the number of individuals who attained the level of education ‫ݔ‬ଵ, and at the same time have
the family status ‫ݕ‬ଵ. A similar statement holds true for the other joint frequencies. The last
column of the table is usually reserved for the sum of the joint frequencies that lie in the same
row. The sum is denoted ݂௜. if we work with the i-th row of the table. The last row of the table
is reserved for the sum of the joint frequencies that lie in the same column. This sum is
denoted ݂.௝ if we talk about the j-th column of the table. These summations are called
marginal frequencies.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 14 Table
1: Contingency table
y
x
y1 y2 ... yn
x1 f11 f12 ... f1n f1·
x2 f21 f22 ... f2n f2·
... ... ... ... ... ...
xm fm1 fm2 ... fmn fm·
f·1 f·2 f·n r
Source: author’s
If we assume that the table represents an entire population, we can calculate basic
characteristics for the two variables, using the symbols introduced for different types of
frequencies, i.e. we can calculate the population means and population variances, using the
following formulas:
1. Population means
1
,X i ij
i j
x f
r
µ = ∑ ∑
1
.Y j ij
j i
y f
r
µ = ∑ ∑
2. Population variances
2 21
( )X i X ij
i j
x f
r
σ µ= −∑ ∑
2 21
( )Y j Y ij
j i
y f
r
σ µ= −∑ ∑ .
On the other hand, if the table represented only a data sample, we would calculate the sample
means and variances according to the formulas:
1. Sample means
1
,i ij
i j
x x f
r
= ∑ ∑
1
.j ij
j i
y y f
r
= ∑ ∑
2. Sample variances
2 21
( ) ,
1
X i ij
i j
s x x f
r
= −
−
∑ ∑
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 15 -
2 21
( ) .
1
Y j ij
j i
s y y f
r
= −
−
∑ ∑
When working with two variables, the frequencies of which are given by the aforementioned
contingency table, we also define another important characteristic called covariance.
Population covariance, denoted ),cov( YX , of variables X and Y is defined by equation
1-15 1 1
cov( , ) ( )( ) .i X j Y ij i j ij X Y
i j i j
X Y x y f x y f
r r
µ µ µ µ= − − = − ⋅∑∑ ∑∑
If we work with a data sample of size n, ݊ ൒ 2, we also define sample covariance:
1-16 1
( )( ) .
1
XY i j ij
i j
c x x y y f
r
= − −
−
∑∑
PROBLEM 5
A variable X can take on values 3, 5, 4, 6, 7 and 9. For these values, values of another variable
Y were measured: 1, 2, 7, 9, 11 and 13, respectively, i.e. 3 corresponds to 1, 5 corresponds to
2, etc. The absolute frequency of each value is one. Calculate the population covariance.
SOLUTION
We use 1-15, setting all the frequencies equal to one. The average of X is 5,66, the average of
Y is 7,16. We get
1 (3 5,66) (1 7,16) ... (9 5,66) (13 7,16)
cov( , ) ( )( ) 7,55.
6
i X j Y ij
i j
X Y x y f
r
µ µ
− ⋅ − + + − ⋅ −
= − − = =∑∑
Covariance is used to describe a mutual dependence of variables X and Y, the dependence
taking the form of a line, i.e. we deal with a simple linear dependence. If the covariance is
positive, we can say that a dependence in the form of a line exists between the two variables
to a certain extent. The dependence is such that if one of the variables rises in value, the other
rises to a certain extent, as well. On the contrary, if the covariance is negative, it signals
existence of a comotion of the two variables but in opposite directions: if one of the variables
rises in value, the other drops to an extent. In both cases, the movement of the other variable
is to an extent proportionate to the change of the first variable. Zero covariance suggests there
is no linear dependence between the variables. As we can see, it is the sign of the covariance
that matters. The value of covariance is further transformed so that the transformed
characteristic falls to closed interval [-1,1]. The transformation takes place to get a
characteristic called paired correlation coefficient, which gives us a better interpretation as to
how strong the linear dependence exists between the two variables. If we work with a
population, we obtain population paired correlation coefficient through this transformation. If
we work with a sample, the result is called sample paired correlation coefficient.
The population paired correlation coefficient is of the form
1-17
,
),cov(
YX
YX
σσ
ρ
⋅
=
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 16 where
Xσ is the population standard deviation of X and Yσ is the population standard
deviation of Y. The sample paired correlation coefficient is given by
1-18
,
YX
XY
ss
c
r
⋅
=
where Xs is the sample standard deviation of X and Ys is the sample standard deviation of Y.
Both the population and sample correlation coefficient can take on only values from interval
[-1,1]. If the population paired correlation equals one, it means that there is an exact relation
in the form of a rising line. If, on the contrary, the correlation equals minus one, there is an
exact relation between the two variables in the form of a declining line. If the population
correlation equals zero, the two variables are said to be uncorrelated. There are also other
types of correlations, apart from the paired correlation. We shall discuss them in chapter 5.
CONTROL TEST 1
The following questions concern a data group which contains different values of a variable X
together with their corresponding absolute frequencies.
Values of X Absolute frequencies
23 2345
34 6213
33 456
35 8876
37 12134
31 5436
16 445
Source: author’s
a. Calculate the arithmetic average, median and mode of X.
b. If the table represents a population, what do the variance, standard deviation, coefficient
of variation and range look like? What would these characteristics look like if the table
represented a sample?
c. Calculate the first two general moments of X.
d. The following table contains a data sample on two variables.
X 3 4 5 1 6 7 8
Y 5 3 4 6 7 8 9
Source: author’s
Determine the sample covariance.
e. Estimate the paired correlation between X and Y, using the table above.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 17 Complete
the statements:
f. Median and average are measures of……………………….. .
g. Standard deviation and range are measures of ……………………… .
h. Skewness and kurtosis are measures of……………………….. .
i. Coefficient of paired correlation takes on values from the interval…………….. .
SOLUTIONS
a. Using 1-3, the average is 33,85. There are 35905 values available, which is an odd
number. Therefore, the median equals the 17953th value in the data group sorted in the
ascending order. The following table contains the sorted data:
Values Frequencies
16 445
23 2345
31 5436
33 456
34 6213
35 8876
37 12134
Source: author’s
The numbers 16 to 34 form a data subgroup of size 14895. The numbers 16 to 35 make
up a data subgroup of size 23771. Thus, the median is obviously equal to 35. Mode is 37
in the problem.
b. The population variance is 16,56 according to 1-7. Its square root equals 4,07, which is
the population standard deviation. The range is 37-16 = 21. The population coefficient
of variation equals 4,07/33,85 = 0,12. The sample variance is (35905/35904).16,56 =
16,56. Given the relatively huge size of the data group, it is the same number as the
population variance if the result is rounded to two decimal places. The calculation
results from the relation between the two variances. The sample standard variation is
4,07, as well. The remaining sample characteristics will therefore be more or less the
same (if we use the rounded numbers, otherwise, precisely speaking, they are not
exactly the same).
c. The first general moment = average = 33,85. The second general moment = 1162,56.
d. The sample covariance = 3,166.
e. The sample correlation = 0,608.
f. Central tendency.
g. Variability.
h. Data concentration.
i. [-1, 1].
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 18 -
2 HYPOTHESIS TESTING IN MARKETING
The second chapter covers hypothesis testing, which is one of the most important
techniques in statistics. The first part of the chapter revises some fundamental principles of
hypothesis testing. A part of it was already covered in the course Statistics. Another section of
the chapter describes statistical tests which could be considered to be elementary because this
is how they are treated in many other scholarly texts. We shall also describe some other tests
which suit particularly the needs of marketing. The technique of hypothesis testing is
explained in a greater detail in accompanying examples.
2.1 TESTING STATISTICAL HYPOTHESES
Statistical hypotheses constitute only a part of all scientific hypotheses. They are related
to random variables, and we divide the set of such hypotheses to a subset of parametric
hypotheses and a subset of nonparametric hypotheses. Parametric hypotheses deal with
parameters of the probability distribution of a random variable (or the population of an
observed statistical variable). Nonparametric hypotheses are not related to parameters of such
a distribution, they are related to some other properties of the distribution, such as its shape,
for instance, because we may be interested in whether the behaviour of a random variable can
be described properly by a binomial distribution or a normal distribution.
Every statistical test works with two hypotheses that stand against each other: a tested
hypothesis (tested statement), called null hypothesis and denoted H0, and an alternative
hypothesis, denoted H1. H1 is usually the negation of H0. What we usually have available for
hypothesis testing is the result of a data sampling. Such a sampling can take the form of a
marketing study or poll. Without sampling, which originates in a random way, it is not
possible to perform a statistical test. Based on the result of the sampling, we are now to decide
whether to accept or reject the null hypothesis. To make such a conclusion, we calculate what
is called test criterion T, a function of the data gathered by the sampling. We also define a
subset within the set of real numbers, the subset being called critical region. Different tests
have different critical regions. If the test criterion T falls to the critical region, the null
hypothesis is rejected. In the opposite case, the null hypothesis is accepted. The critical region
is usually defined by a critical value K (or it can also be defined by a percentile). The critical
value is either found in statistical tables or calculated with a suitable software (Excel, for
instance).
Let us note that by accepting the null hypothesis, we are not proving the validity of the tested
statement. Testing hypotheses does not necessarily lead to the right conclusion, which is
natural since it is a process based on the limited amount of information stored in the data
sample we work with. Uncertainty in the conclusion from a statistical test is related to nivel of
test ߙ, a parameter defined by whoever performs the statistical test (we will talk about this
parameter in a moment). Let us point out again that the testing is based on the randomness of
data sampling. In other words, it is based on the fact that the data to be used for the test were
gathered independently of one another (independently in the statistical sense of the word).
Whether this is true or not can also be tested [10].
For convenience, let us summarize the steps that lead to acceptance of the null hypothesis or
its rejection.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 19 The
general procedure of hypothesis testing
1. Formulate the null hypothesis H0 and the alternative hypothesis H1,
2. Calculate the test criterion T,
3. Find the critical value K for a given nivel of test ߙ (i.e. define the critical region C),
4. Compare K and T, i.e. determine whether T C∈ , and based on this, accept or reject the
hypothesis H0.
The conclusion and credibility of hypothesis testing
If T C∈ , H0 is rejected. If T C∉ , H0 is accepted. Since the decision whether to reject or
accept the null hypothesis depends on the limited amount of information contained in the
corresponding data sample, we can make a mistake that is of one of the following two kinds:
a. We reject the null hypothesis which actually holds true. By doing so, we make a mistake of
the first kind. The probability that this mistake happens is denoted α, and is called nivel of
test.
b. We accept the null hypothesis which in fact is not true. By doing so, we make a mistake of
the second kind the probability of which is denoted β. The probability 1- β is called power
of the test. It is the probability that the null hypothesis will be correctly rejected.
Nivel of test α is usually set at 0,05, 0,01 or less frequently at 0,1. If this is the case, we talk
about a 5% nivel of test, a 1% nivel of test and a 10% nivel of test, respectively.
Apart from the nivel of test, the so-called p-values are also used in hypothesis testing. These
values are often a part of statistical software outputs. A p-value tells us the probability of
getting or exceeding the test criterion. If the p-value is smaller than or equal to the defined
nivel of test, the null hypothesis is rejected. In the opposite case, the null hypothesis is
accepted.
Basic statistical tests
We shall now present the standard and frequently used statistical tests. These are
(A) One-sample t – test.
(B) Two-sample t – test with equal variances.
(C) Two-sample t – test with unequal variances.
(D) Paired t-test.
(E) Two-sample F – test of variance equality.
Each of the tests is described now by the four-step general procedure of hypothesis testing.
The tests can also be performed in Excel if the Excel add-in module Data Analysis is selected.
The module contains many statistical methods, including the standard statistical tests.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 20 (A)
Testing the population mean (one-sample t – test)
Let X = (X1,..., Xn) be a random sample from a normal distribution N(µ, σ2
), where the
variance σ2
is unknown.
1. We test the null hypothesis H0: µ = µ0 against the alternative H1: µ ≠ µ0, where µ0 is a
given value.
2. The test criterion takes the form
0
,
X
T n
S
µ−
= ⋅
where
X = sample mean calculated from the values X1,..., Xn,
S = sample standard deviation calculated from the values X1,..., Xn,
µ0 = assumed population mean which is tested by the statistician
n = sample size.
3. The critical value K is related to a Student’s distribution with n-1 degrees of freedom, and
for a nivel of test α, it is denoted tn-1(α). The critical value is defined as the number that
satisfies the relationship 1( ( ))nP X t α α−≥ = , where X is a random variable following the
Student’s distribution. The critical value can be either found in statistical tables, or
calculated in Excel using the function TINV(ߙ; n-1). The critical region of the test is C = (∞,
-K]∪[K, +∞).
4. If T ≥ tn-1(α), H0 is rejected and H1 is accepted; in all the other cases, H0 is accepted.
(B) Testing a difference between two population means (two-sample t-test with equal
variances)
Let there be two independent random samples of sizes n1 and n2, respectively, from normal
distributions N(µ1, 2
1σ ) and N(µ2, 2
2σ ), respectively. The variances 2
1σ and 2
2σ are unknown,
but 2
1σ = 2
2σ is assumed.
1. We test the null hypothesis H0: µ1 = µ2 vs. the alternative H1: µ1 ≠ µ2.
2. The test criterion T takes the form:
( ) ( )
( )1 2 1 21 2
2 2
1 21 1 2 2
2
,
1 1
n n n nX X
T
n nn S n S
⋅ ⋅ + −−
= ⋅
+− + −
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 21 where
1X is the sample mean calculated from the data obtained from N(µ1, 2
1σ ), 2X is the
sample mean calculated from the data obtained from N(µ2, 2
2σ ). We also calculate the sample
variance 2
1S of the first sample, and the sample variance 2
2S of the second sample. The
numbers n1 and n2 represent the size of the first and second sample, respectively.
3. The critical value K is related to a Student’s distribution with n1+n2-2 degrees of freedom,
and is denoted 1 2 2 ( )n nt α+ − for a nivel of test α. The value can be either found in statistical
tables, or calculated in Excel with the function TINV(α; n1+n2-2).
4. If 1 2 2 ( )n nT t α+ −≥ , the null H0 is rejected and H1 is accepted; in the opposite case, the null
hypothesis is accepted.
(C) Testing a difference between two population means (two-sample t-test with unequal
variances)
Let there be two independent samples of sizes n1 and n2, respectively, from probability
distributions N(µ1, 2
1σ ) and N(µ2, 2
2σ ), respectively. The sample means 21 , XX are calculated
from the two samples, as well as their sample variances 2 2
1 2,S S , respectively. The population
variances 2
1σ and 2
2σ are unknown, however, inequality 2
1σ ≠ 2
2σ is assumed this time.
1. We test the hypothesis H0: µ1 = µ2 vs. the alternative H1: µ1 ≠ µ2.
2. The test criterion takes the form:
1 2
1 2
X X
T
V V
−
=
+
,
where
2
i
i
i
S
V
n
= , i = 1, 2.
3. The critical value K is calculated using the formula
1 21 1 2 1
1 2
( ) ( )n nV t V t
K
V V
α α− −⋅ + ⋅
=
+
,
where ( )1 1nt α− and ( )2 1nt α− are critical values of a Student’s distribution with n1-1 and n2-1
degrees of freedom, respectively, both for a nivel of test α. The value K can be obtained,
using the Excel function TINV(α; n1-1) for ( )1 1nt α− , and TINV(α; n2-1) for ( )2 1nt α− .
4. If T K≥ , H0 is rejected and H1 is accepted; in the opposite case, H0 is accepted.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 22 (D)
Paired t-test
Let 1 2, ,..., nX X X X= be a random sample from N(µ1, 2
1σ ), and 1 2, ,..., nY Y Y Y= be a random
sample from N(µ2, 2
2σ ). The corresponding sample means are X and Y , respectively.
1. We test the hypothesis H0: µ1 = µ2 vs. the alternative H1: µ1 ≠ µ2.
2. The test criterion takes the form
D
D
T n
S
= ,
where
2
1
1
( )
1
n
D i
i
S D D
n =
= −
−
∑ , i i iD X Y= − , i = 1, 2..., n, D X Y= − .
3. The critical value K = 1( )nt α− , and is related to a Student’s distribution with n-1 degrees of
freedom, and a nive of test α. It can be obtained with the Excel function TINV(α; n-1).
4. If T K≥ , H0 is rejected and H1 accepted; in the opposite case, H0 is accepted.
(E) Two-sample F – test of equality of variances
Let us have two independent random samples from N(µ1, 2
1σ ) and N(µ2, 2
2σ ), respectively, of
sizes n1 and n2, respectively. Let 2
1S and 2
2S be their respective sample variances.
1. We test the hypothesis that the population variances are the same, i.e. the hypothesis H0:
2
1σ = 2
2σ , against the alternative H1: 2
1σ ≠ 2
2σ .
2. The test criterion takes the form:
2 2
1 2
2 2
1 2
max( , )
min( , )
S S
T
S S
= .
3. The critical value K = 1 21, 1( )n nF α− − can be found in statistical tables for a Fisher’s
distribution with n1-1 and n2-1 degrees of freedom, and a nivel of test α. Alternatively, the
critical value can also be calculated with the Excel function FINV(α; n1-1; n2-1).
4. If T K≥ , H0 is rejected and H1 is accepted; in the opposite case, H0 is accepted.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 23 As
was mentioned previously, statistical hypotheses represent only a part of all scientic
hypotheses – that part which concerns random variables. These hypotheses/tests include
parametric and nonparametric tests. Parametric tests deal with parameter(s) of a given
probability distribution. Nonparametric tests are concerned not with parameters of a
distribution but other statistical properties of the distribution. It must be noted, however, that
nonparametric tests are used as a term in a more general sense: the term is used for those tests
that do not have to comply with as many mathematical conditions for their use as other tests
do. As we saw earlier, t-tests, for instance, required several such conditions including the
prerequisite that the random sample come from a normal distribution. There are situations
when such prerequisite cannot be met, and the question how to proceed then naturally arises.
There are more robust statistical tests that demand only very general conditions to be met for
their justified use. This is when we talk about nonparametric tests, although we may use the
tests to check the particular form of parameters of a probability distribution, as well. In order
not to get entangled in this terminology, we shall continue to perceive the term nonparametric
test as we did to this point, i.e. we shall view such a procedure as a method that tests other
distribution properties than those related directly to the parameters of the distribution.
2.2 MARKETING STUDY
To demonstrate the tests described above and introduce some new tests, let us draw on a
marketing study named Studie. The study will be used now and then for the description of
other statistical methods. The procedures that follow are accompanied by Excel calculations.
Studie
A company wants to bring out a new nonalcoholic beverage: a cola-based carbonated drink.
There are three versions of the product that are to hit the market: Kafola, Kofikola and
Kofolisima. A questionnaire-based poll was run, and answers from 47 respondents were
gathered about the consumption of the new products. The results of the poll are contained in
table 2 (the data represents a weekly consumption of the corresponding beverage in litres).
Table 2: Studie poll results
Respondent Sex Age Kafola Kofikola Kofolisima
1 m 20 1,1 0,7 0,5
2 f 34 1 0,2 0,1
3 f 43 0,8 0,1 0,2
4 f 21 1,2 0,6 0,3
5 m 39 1,1 0,1 0,4
6 f 51 0,4 0 0,2
7 m 19 0,9 0,9 0,3
8 f 45 0,3 0,2 0,2
9 f 48 1,2 0,1 0,4
10 f 21 1,4 0,4 0,2
11 f 52 0,4 0 0,3
12 f 22 1,2 0,6 0,4
13 m 62 0,2 0 0,2
14 f 47 0,6 0,2 0,1
15 m 23 0,9 0,8 0,2
16 m 35 0,9 0,1 0,4
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 24 Respondent
Sex Age Kafola Kofikola Kofolisima
17 m 22 1 0,9 0,1
18 m 38 0,5 0,2 0,2
19 f 41 0,4 0,1 0,1
20 f 21 0,9 0,7 0,2
21 f 40 0,2 0 0,3
22 f 20 0,8 0,6 0,3
23 m 19 1,1 0,9 0
24 m 39 1 0,1 0
25 m 19 0,9 1,1 0,4
26 f 38 0,2 0,2 0,5
27 f 20 1,3 1,5 0,3
28 f 37 0,4 0,1 0,8
29 m 20 1,3 0,8 0,2
30 f 41 0,1 0,2 0,1
31 m 42 0,2 0,1 0,2
32 f 20 0,9 0,9 0,3
33 m 43 1,2 0,2 0,1
34 f 21 0,9 0,7 0,2
35 m 44 0,1 0,1 0,1
36 f 45 0 0,1 0,2
37 m 46 0,1 0,2 0,1
38 m 22 1 0,9 0,2
39 m 42 0,4 0,8 0,3
40 m 41 0,1 0,1 0,4
41 f 22 1,1 0,5 0,2
42 f 40 0,2 0,1 0,1
43 f 21 1,3 0,8 0
44 m 39 0,4 0,9 0,2
45 f 20 1,1 0,1 0,1
46 m 20 1 0,2 0,3
47 f 21 0,8 0,1 0,4
Source: author’s
We note that the number of respondents is not particularly high in this case. Marketing studies
usually address hundreds of respondents. Of course, we are primarily interested in the
principles of working with the marketing data. Those principles are the same regardless of
how many respondents take part in the poll.
We shall now perform the statistical tests described at the beginning of chapter two. The tests
will try to answer various questions that could have been asked by the client who ordered the
poll. Some of the tests will also be presented as a one-sided test. This is a kind of test the null
hypothesis of which appears in the form of an inequality, not an equation. The alternative
hypothesis remains the direct negation of the null hypothesis. We shall also use table 2 to
describe and demonstrate other widely exploited statistical tests.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 25 PROBLEM
1 (one-sample t-test)
Let us find out at five per cent nivel of test whether we can assume that the average weekly
consumption of Kafola equals 0,7 litres. Thus, 0 : 0,7H µ = is tested against its alternative
1 : 0,7H µ ≠ ; 05,0=α . The test criterion satisfies
0,833 0,7
47 2,239.
0,166
T
−
= =
The sample average used in the criterion is calculated from the data in the column „Kafola“,
and it is equal to 0,833. The sample standard deviation is s = 0,166 (this may be calculated
using its defining formula from chapter one, or by taking the square root of the Excel function
var(), where the single argument is the reference to the column Kafola). Since it is possible to
chain Excel functions, the function sqrt(var()) will give the standard deviation, as well.
The sample size is 47, therefore we work with a t-distribution (or Student’s distribution) with
47-1 degrees of freedom. The critical value of the distribution is K = TINV(0,05, 46) = 2,012.
Since the test criterion in absolute value exceeds the critical value, we reject the hypothesis
that the average weekly consumption of Kafola equals 0,7 litres in the population.
The one-sample t-test can also be performed in the form of a one-sided test. In such a case,
we formulate the null hypothesis as H0: µ < µ0 and its alternative as H1: µ ≥ µ0 . We use the
test (A) described at the beginning of this chapter, however, in a slightly modified form.
1. We test H0: µ < µ0 against H1: µ ≥ µ0
2. The test criterion takes the form
0X
T n
S
µ−
= ⋅ ,
where
X = sample average,
S = sample standard deviation,
µ0 = assumption about the unknown population average; in our case, it is 0,7,
n = sample size; in our case, this is 47.
3. The critical region of the one-sided test is interval C = [K, +∞) given by the critical value
K of the Student’s distribution with n-1 degrees of freedom. The critical value in this case is
such a value that the probability of exceeding it is equal to the preset nivel of test alpha. Given
the definition of the Student’s distribution critical values, it means that K = tn-1(2α). This
number can be calculated using the Excel function TINV(2α ; n-1).
4. If T K≥ , H0 is rejected and H1 accepted. In the opposite case, H0 is accepted.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 26 As
can be seen, the one-sided version of the t-test is very similar to its two-sided version. The
difference is in the calculation of the critical value and the conclusion of the test. In our case,
if we formulate the null hypothesis H0: µ < 0,7 against the alternative H1: µ ≥ 0,7, the test
criterion results in the same number, of course, but the critical value at five per cent nivel of
test is equal to TINV(2·0,05; 46) = 1,68, and we again reject the tested hypothesis.
We could also try to answer the question for what nivel of test the null hypothesis would be
accepted in the one-sided version of the test. In our case, we accept the null hypothesis if and
only if T < K. To answer the question, it is convenient to use the Excel function TDIST(K; n;
tails). The function returns alfa which satisfies the equality ( )P X K α≥ = , where X follows
a t-distribution with n degrees of freedom if the argument „tails“ is set at 2; alternatively, the
function returns alfa satisfying the equation ( )P X K α≥ = , where X follows a t-distribution
with n degrees of freedom if the argument „tails“ is set at 1. In our case, if we substitute K for
T, and use the function TDIST, we find that the probability of exceeding T is equal to
TDIST(2,239; 46; 1) = 0,015. Thus, if T < K is to hold, the critical value K must be such that
the probability of exceeding it is smaller than 0,015. This probability, however, is called nivel
of test. So the conclusion in the one-sided version of the test is such that any nivel of test
smaller than 0,015 will lead to acceptance of the null hypothesis.
PROBLEM 2 (two-sample t-test with equal variances)
Let us demonstrate how to perform the two-sample t-test with equal variances. The equality of
variances is assumed to hold true at the moment, although this should be tested, as well, using
another statistical test (we will talk about the test later). Our objective now is to find out,
using Studie, whether the average consumption of Kofikola is the same as that of Kofolisima.
The nivel of test is five per cent. The test criterion satisfies
( ) ( )
( )1 2 1 21 2
2 2
1 21 1 2 2
2
,
1 1
n n n nX X
T
n nn S n S
⋅ ⋅ + −−
= ⋅
+− + −
from which it is obvious that we need to calculate the sample averages in both data samples
(the column „Kofikola“ represents one sample in this case, the column „Kofolisima“ is the
other sample) and also the sample variances and samples sizes. These characteristics are
contained in table 3. Again, the calculation of the characteristics can be carried out using the
corresponding defining formulas or the Excel functions.
Table 3: entry characteristics for the two-sample t-test
Kofikola Kofolisima
average 0,40851064 0,24042553
variance 0,14123034 0,02289547
sample size 47 47
Source: author’s
Thus, using the data from table 3, the test criterion T = 2,844. The critical value K =
TINV(0,05; 47+47-2) = 1,986. Therefore, the hypothesis of equal population means is
rejected, the nivel of test being five per cent.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 27 Excel:
The test can be performed, using the Excel analytical tools located at Data/Data
Analysis or Tools/Data Analysis. If Excel is installed for the first time, it may happen that
the Data Analysis module is hidden in the program. In such cases, the module must be
installed through Tools/Add-ins (older Excel versions) or through File/Options/Add-ins
(newer Excel versions). The module contains nineteen statistical methods of which five are
related to statistical tests. The main advantage of the module is that the corresponding
formulas don’t have to be constructed and calculated. Everything is done automatically by the
module itself. Each test is integrated into a single dialogue window, and its results are
presented in a unified table.
If we run the module Data Analysis, a dialogue window with various statistical methods pops
up. We select the two-sample t-test with equal variances, and confirm the option. In the
subsequent window that Excel offers, we place the computer mouse to the sub-window
Sample 1 and designate the area in the spreadsheet containing the data of interest – in our
case, the data for the Kofikola consumption, for instance. Similarly, the Sample 2 sub-window
will contain a reference to the area in the Excel spreadsheet containing the data on Kofolisima
consumption. We keep the default alpha at 0,05 as well as the default output location, offered
by the dialogue window. After confirming these options, Excel generates the following table 4
with all necessary results.
Table 4: Excel output for the two-sample t-test with equal variances
sample 1 sample 2
Mean 0,408510638 0,240425532
Variance 0,141230342 0,022895467
Sample size 47 47
: :
: :
: :
t Stat 2,84439379
P(T<=t) (1) 0,002741733
t krit (1) 1,661585397
P(T<=t) (2) 0,005483465
t krit (2) 1,986086317
The table contains characteristics necessary for carrying out the test, and also the test criterion
t Stat and the critical vaue of the two-sided test t krit (2). Both values confirm that our
previous calculations were correct. The conclusion therefore is the same.
PROBLEM 3 (F-test)
If the two-sample t-test with equal variances is to be credible, we must confirm whether the
assumption of equal population variances is correct. We shall now test the assumption at one
per cent nivel of test. The test criterion related to this test is of the form
2 2
1 2
2 2
1 2
max( , )
min( , )
S S
T
S S
= .
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 28 It’s
a division of two sample variances. In our case, the sample variance for the weekly
consumption of Kofikola is 2
1 0,141S = , and the sample variance for Kofolisima 2
2 0,0228S = .
Therefore, T = 6,168. The critical value of the test K = FINV(0,01;47-1;47-1) = 2. This means
the null hypothesis is rejected with a very small nivel of test.
Excel: The same test can be realized using the Data Analysis module in Excel if we select the
F-test of equal variances option in the module. In order for this procedure to give the same
result, it is necessary that the data sample with the higher sample variance be used as Sample
1 in the dialogue window that follows the confirmation of the option of the F-test in the
module. In our example, the higher variance relates to the data sample on Kofikola:
2
1 0,141S = . Therefore, Sample 2 option in the dialogue window will contain the reference to
the data on Kofolisima. The nivel of test alpha is 0,05 by default. We shall reset the level at
0,01 for our purposes. Confirming the options, Excel returns results in the form of table 5.
Table 5: Excel output on the F-test of equal variances
F-test of equal variances
Sample 1 Sample 2
Mean 0,408510638 0,240425532
Variance 0,141230342 0,022895467
Observations 47 47
Difference 46 46
F 6,168484848
P(F<=f) (1) 3,7416E-09
F krit (1) 2,006833595
Here, F stands for the test criterion and F krit (1) stands for the critical value of the test: the
real number such that the probability of exceeding it by the test criterion is 0,01. In this case,
the critical value equals 2,007.
As we can see, the procedure used in the previous problem, where we worked with the twosample
t-test with equal variances, was not appropriate, as it was based on the assumption
of equal variances. This assumption has just been rejected. The appropriate procedure is to
use the two-sample t-test with unequal variances, as demonstrated in the following problem.
PROBLEM 4 (two-sample t-test with unequal variances)
We said that in this case the test criterion takes the form
1 2
1 2
X X
T
V V
−
=
+
,
where 2
/i i iV S n= , i = 1, 2, 2
iS = sample variance of the i-th sample and in = size of the i-th
sample. Applying these formulas to our case of nonalcoholic beverages, we get
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 29 -
1 2
1 2
0,408 0,24
2,85.
0,14 / 47 0,0228 / 47
X X
T
V V
− −
= = =
+ +
The critical value of the test is
1 21 1 2 1
1 2
( ) ( ) 0,003 2,013 0,00048 2,013
2,013.
0,003 0,00048
n nV t V t
K
V V
α α− −⋅ + ⋅ ⋅ + ⋅
= = =
+ +
The conclusion of the test is such that we reject the null hypothesis of equal means.
Excel: using the Data Analysis module in Excel, we can select the two-sample t-test with
unequal variances option. Upon confirming the selection, we fill in the required information
in the corresponding dialogue window: we insert references to the areas of the Excel
spreadsheet containing the data on Sample 1 and Sample 2, just like in the case of the twosample
t-test with equal variances. Also, we set the nivel of test at alpha (we leave it at five
per cent here). The output of Excel calculations is contained in table 6.
Table 6: Excel output of the two-sample t-test with unequal variances
Two-sample t-test with unequal variances
Sample 1 Sample 2
Means 0,408510638 0,240425532
Variances 0,141230342 0,022895467
Observations 47 47
: :
: :
t Stat 2,84439379
P(T<=t) (1) 0,00302417
t krit (1) 1,670219484
P(T<=t) (2) 0,006048339
t krit (2) 1,999623585
Let us comment on the results of table 6: t stat represents the test criterion, t krit (2) is the
critical value. It is necessary to note that as in the case of several fundamental statistical
characteristics (skewness and kurtosis, in particular), Excel carries out some calculations
differently, compared to how such calculations are performed in rigorous statistical texts. The
critical value of the test is calculated in more than one way, to be more precise. Different
procedures aim to approximate differently the degrees of freedom of the Student’s distribution
related to this test. Therefore, it is quite likely that the critical value for this test provided by
Excel will differ from the one defined at the beginning of this chapter. Nonetheless, despite
the difference in the critical value, the conclusion to our problem remains the same. Also, as
demonstrated above, the difference in the critical values is certainly not severe.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 30 PROBLEM
5 (paired test)
We emphasize again that while applying the two-sample t-tests with equal or unequal
variances, we assume, among other things, that the two samples are independent of each
other. This is a very important prerequisite. If this is not the case, it is better to apply the
paired test. If the condition of independence is met, and the analyst applies the paired test
instead of the two-sample t-test, nothing serious happens. However, such a procedure is not
optimal because the paired test requires two samples of the same size, as opposed to what is
required by the two-sample t-test. Thus, if the analyst wants to use the paired test, it might be
the case that he or she will have to throw away some of the data provided the samples to be
worked with are of different sizes. On the contrary, if the situation requires the paired test to
be applied because of the conditions necessary for this test, and the analyst uses the twosample
t-test instead, it will be a serious mistake, and the conclusions based on such a
technique will be completely incredulous.
We shall use the paired test now to find out whether the average weekly consumption of
Kofikola is the same as the average weekly consumption of Kofolisima. First, let us substract
the values on consumption, which are in the same row of table 2, and are in columns Kofikola
or Kofolisima. Doing so, we get the substractions i i iD X Y= − , where iX is the consumption
of Kofikola from the i-th row of table 2 and from column Kofikola, and iY is the consumption
of Kofolisima from the i-th row of the same table and from column Kofolisima. The average
substraction equals 0,408 0,24 0,168D X Y= − = − = . Secondly, let us calculate the sample
standard deviation of the substractions
2
1
1
( ) 0,4076.
1
n
D i
i
S D D
n =
= − =
−
∑
Here, n = 47 = sample size = number of rows in table 2. The test criterion of the paired test is
(0,168/ 0,4076) 47 2,826.
D
D
T n
S
= = ⋅ =
The critical value K = 1( ) (0,05;47 1) 2,013.nt TINVα− = − = Since T K≥ , H0 is rejected.
Excel: Selecting the Data/Data Analysis/Two-sample paired t-test option in Excel and using
subsequently the Sample 1 and Sample 2 options for references to the Excel Spreadsheet area
with data on Kofikola and Kofolisima, we get the following output (Table 7).
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 31 Table
7: Excel output for the paired test
Two-sample paired test of means
Sample 1 Sample 2
Averages 0,408510638 0,240425532
Variances 0,141230342 0,022895467
Observations 47 47
Pears. correlation -0,017650836
Hyp. Difference of means 0
t Stat 2,827157048
P(T<=t) (1) 0,003464955
t krit (1) 1,678660414
P(T<=t) (2) 0,006929909
t krit (2) 2,012895599
T stat corresponds to the test criterion of the paired test, t krit (2) is the critical value of the
test. As is confirmed by the table, Excel calculations correspond to the calculations we made
by hand.
In the next section of the chapter, we shall present some other statistical tests. We will
describe their purpose, and use the study material Studie to show how to proceed in their case.
2.3 MEDIAN TEST
It is one of the tests that doesn’t require too many conditions for its use. As the name of
the test suggests, median test tries to confirm or reject the hypothesis about the median of a
probability distribution. If such a distribution possesses the property that its median is equal to
its mean, the test can be regarded as an alternative to the one-sample t-test. The only condition
that must be met for the median test to be justified is the requirement that the respective data
sampling be made in a population with continuous probability distribution. Thus, normality is
not required here, as in the case of one-sample t-test.
Let us denote the unknown median as µ~, and the size of the data sample used to perform the
test as n. We assume that n is large enough since the precision of the test we are about to
describe improves with n increasing.
1. We test H0: 0
~~ µµ = versus H1: 0
~~ µµ ≠ . Here, 0µ% is a specific value defined by the
statistican.
2. The test criterion is
2m n
T
n
−
= , where m is the number of observations in the data
sample, which are smaller than 0
~µ .
3. The critical value K = z1-α/2, where z1-α/2 is the critical value of the standard normal
distribution N(0,1) for a nivel of test α, i.e. it is the real number z1-α/2 such that the
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 32 probability
of exceeding it equals 1 / 2α− . The critical value can be found either in
statistical tables or using the Excel function NORMSINV(1-α/2).
4. If T K≥ , H0 is rejected. In the opposite case, H0 is accepted.
PROBLEM 6 (Median test)
Let us test the hypothesis that the average age of the cola-based drink consumer is 33 years
(this is supposed to be the median age). The nivel of test is five per cent. The data available
for the test are contained in Studie. Looking at the data, we see that in 20 out of all the 47
cases, the age of the consumers is below 33. Therefore,
2 2 20 47
1,02.
47
m n
T
n
− ⋅ −
= = =
The critical value K = z1-α/2 = NORMSINV(1-0,05/2) = 1,96. This leads to the conclusion that
the null hypothesis is accepted.
2.4 CHI-SQUARED TESTS
The last category of statistical tests we are going to deal with is related to chi-squared
tests. We shall demonstrate two of these tests, as they are frequently used in social surveys.
The first test focuses on the type of the probability distribution the data sample used for the
test came from, the second test verifies the hypothesis of statistical independence of two
random variables. Since a chi-squared distribution is exploited in the two tests, it is clear
where the name of the tests originated.
2.4.1 TESTING A DISCRETE PROBABILITY DISTRIBUTION
As is well known from mathematical statistics, the most frequently used probability
distributions are of discrete or continuous type. The chi-squared test can be used for any of
these two situations. For simplicity, we shall work with discrete distributions only. What
follows is a theoretical description of the test and an example that demonstrates its purpose.
Let X be an observed variable (not necessarily a numerical variable). A type of beverages
consumed may serve as an example of such a variable. Let us assume that there are k different
categories of the variable: 1 2, ,..., kX X X (k different types of beverages, for instance). Let ip
represent the relative frequency of occurrence of the i-th category iX in the population. If we
sample data randomly from this population, and write down the absolute frequencies with
which different categories 1 2, ,..., kX X X occured in the sample, we can view the absolute
frequencies as realizations of the random variables 1 2, ,..., kX X X . The expression
( )
2
1
k
i i
i i
X np
T
np=
−
= ∑
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 33 is
then a random variable itself, following approximately a chi-squared distribution with k-1
degrees of freedom. The higher the n, the more precise the approximation is. The absolute
frequency of the i-th category iX in the sample is called empirical frequency, the term inp is
called theoretical or expected frequency.
This setting can be used to test a hypothesis about the ip ’s, i = 1, 2, …, k, concerning the
variables 1 2, ,..., kX X X . This means from the statistical point of view that parameters of a
probability distribution (multinomial in this case) are tested. To do the test, a null hypothesis
about the parameters ip is formulated, a sampling is carried out from the corresponding
population, and the aforementioned test criterion T is calculated. If 2
1( )kT χ α−≥ , where
2
1( )kχ α− is the critical value for a chi-squared distribution with k-1 degrees of freedom and a
five per cent nivel of test, the null hypothesis is rejected. In the opposite case, when
2
1( )kT χ α−< , the null hypothesis is accepted. The critical value can be found either in
statistical tables or by using the Excel function CHIINV(alpha, k-1).
PROBLEM 7 (Chi-square test)
To demonstrate the test, let us use the data on cola-based drinks. The population is
represented by the Faculty of Business Administration of the Silesian University. We assume
that only cola-based drinks are sold at the faculty. We are interested in whether it is true that
all the three cola-based drinks are consumed in the same amounts. Statistically speaking, this
means that we test whether the variable X, a cola-based drink, follows a uniform distribution.
We set the nivel of test at five per cent.
Table 8 depicts a random sampling result – absolute frequencies for each of the three drinks.
Table 8: Frequencies of consumed drinks
Number of bottles
Kofola 87
Kofikola 93
Kofolisima 101
Source: author’s
We have n = 87+93+101 = 281, k = 3. The null hypothesis is 0 1 2 3: 1/3H p p p= = = . The test
criterion takes the form
( )
2 2 2 2
1
(87 281/ 3) (93 281/ 3) (101 281/ 3)
1,053.
281/ 3 281/ 3 281/ 3
k
i i
i i
X np
T
np=
− − − −
= = + + =∑
The critical value K = CHIINV(0,05,3-1) = 5,99. Therefore, we accept the null hypothesis
about the equal consumption of all the three drinks.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 34 -
2.4.2 CHI-SQUARED TEST OF INDEPENDENCE
There is also another problem related to chi-square testing, and a contingency table is
constructed to solve the problem. Two variables are assumed: a variable A (sex status: male or
female, for instance) and a second variable B (remuneration at work, for example). The
variable A, a classification variable, exists in two forms A1 and A2. Similarly, the variable B
exists in s possible forms 1 2, ,..., sB B B , 2s ≥ . A contingency table (see table 9) is constructed.
Table 9: Contingency table for the chi-squared test of independence
Categories
of A / B
B1 B2 B3
...
Bs Sum
A1 n11 n12 n13
…
n1s n1.
A2 n21 n22 n23
…
n2s n2.
Sum n.1 n.2 n.3
…
n.s n
Source: author’s
The symbol nij stands for the number of cases when the variable A took on the value
(category) Ai, and at the same time the variable B reached the level (category) Bj. The symbol
1
s
i ij
j
n n•
=
= ∑ expresses the number of cases when A fell in the i-th category, regardless of what
category the variable B fell in, and the symbol 1 2j j jn n n• = + represents similarly the number
of cases when B fell in the j-th category. These frequencies are called marginal frequencies.
We want to know whether the two variables A and B are statistically independent.
The test procedure
1. We are testing, for a nivel of test alpha, the hypothesis H0: A and B are independent
variables vs. the alternative hypothesis H1: A a ndB are not independent.
2. The test criterion is of the form
( )
2
2
1 1
Ts
ij ij
T
j i ij
n n
T
n= =
−
= ∑ ∑ ,
where ( ) /T
ij i jn n n n• •= ⋅ are theoretical (expected) frequencies. The values ijn represent
empirical frequencies acquired by random sampling.
3. The critical value is K = 2
1( )sχ α− .
4. Conclusion of the test: if T K≥ , H0 is rejected. In the opposite case, H0 is accepted.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 35 PROBLEM
8
Let A be a variable = sex status of respondents and another variable B = form of remuneration
awarded to respondents in competitive sport events. Table 10 shows the numbers of
respondents who were randomly selected, and fall into different categories classified by the
two variables.
Table 10: data on types of remuneration and sex status of respondents
Sex/Remuneration Financial reward Soft drink Subtotal
Men 78 42 120
Women 46 34 80
Subtotal 124 76 200
Source: author’s
For example, the number 78 tells us that this is the number of respondents who were men and
who said that they had received a financial reward for their sports achievement. The subtotals
are calculated by the poll worker who is to process the data.
Tables 11 and 12 contain the calculations of the theoretical frequencies (left table) and the
terms appearing in the sum of the test criterion (right table). The symbol Eij stands for the
empirical frequency and the symbol Oij describes the corresponding theoretical frequency.
The empirical frequencies are calculated from table 10, taking the appropriate row and
column marginal frequencies (subtotals¨), multiplying them and dividing by the number of all
respondents (which is 200). Therefore, we get, for instance, 74,4 120 124 / 200= ⋅ for the first
theoretical frequency, and a similar procedure applies to other theoretical frequencies, as well.
Table 11: theoretical frequencies Table 12: terms for the test criterion
Oij Fin. reward Soft drinks
Men 74,4 45,6
Women 49,6 30,4
Source: author’s
The final table 13 contains the test criterion T, degrees of freedom df = s-1 of the test, and the
critical value K for 5% nivel of test.
Table 13: The test criterion T and critical value K
(Eij-Oij)^2/Oij
0,174193548 0,284210526
0,261290323 0,426315789
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 36 T
1,14
alfa 0,05
df s-1 = 2-1
K 3,84
Source: author’s
T = 1,14 and the critical value K = CHIINV(0,05; 2-1) = 3,84. Since the test criterion is
smaller than the critical value, we accept the hypothesis that there is no relation between the
form of reward and the sex status of the rewarded. As a final note, the chi-squared test can be
realized in Excel, using the function CHITEST(actual; expected) which has two parameters:
the parameter „actual“ is a reference to the Excel spreadsheet area containing the empirical
frequencies, whereas the parameter „expected“ is a reference to the spreadsheet area with the
expected/theoretical frequencies. The function returns a p-value, which means that the
conclusion of the test is constructed as follows: if the p-value is smaller that the nivel of test
alpha (or equal), the null hypothesis is rejected; if the p-value is greater than alpha, the null
hypothesis is accepted.
CONTROL TEST 2
a. The data appearing in the table below represents the result of a random sampling related
to a variable Y. Using the one-sample t-test, find out whether the population mean of Y is
17,8, the nivel of test being five per cent. As part of the calculations, state the value of
the test criterion and the critical value. Will the conclusion of the test change if the nivel
of test at ten per cent is used instead? Provide the critical value for the latter case.
Y 16 15 17 18 19 14 13
Source: author’s
b. Let the data on two variables Y and X is available:
X 6 25 17 18 29 4 15
Y 16 15 17 18 19 14 34
Source: author’s
Find out, using the F-test, whether both samples came from populations with the same
variance, the nivel of test being set again at five per cent. State the test criterion and
critical value as part of your calculations. Check your results against those provided by
the Data Analysis module of Excel.
c. Perform the two-sample t-test with equal variances and confirm or reject the hypothesis
that the variables X and Y have the same population mean.
d. A poll concerning cell phone trademarks that are popular among customers led to the
results shown in the following table. Set the nivel of test at ten per cent, and test the
validity of the hypothesis that 25% of all customers use the Mobil1 cell phones, 33% of
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 37 all
the customers incline to the Mobil2 cell phones and the remaining 42% of the
customers prefer the Mobil3 trademark. Again, state the test criterion and critical value.
Number of users
Mobil1 2340
Mobil2 3124
Mobil3 3000
Source: author’s
e. Using the test of independence, verify whether severity of car crash depends on sex
status of car driver. Do so with one per cent nivel of test. As part of your calculation,
state the test criterion, the critical value and the test conclusion. Available are following
data:
Man Woman
Minor accidents 134 127
Accidents of intermediate severity 254 301
Severe accidents 14 4
Source: author’s
SOLUTIONS
a. Test criterion = -2,2. Critical value = 2,44. The test criterion in absolute value is smaller
than the critical value, implying that the null hypothesis is accepted. If the nivel of test
was ten per cent, the critical value would be 1,94. In the latter case, the null hypothesis
would be rejected.
b. Test criterion = 1,78. Critical value = 4,28. The null hypothesis on variance equality is
accepted.
c. Test criterion = 0,63. Critical value = 3,05. The null hypothesis on equality of means is
accepted.
d. Test criterion = 43,25. Critical value = 4,6. The hypothesis on uniform distribution is
rejected.
e. Test criterion = 8,65. Critical value = 9,21. We accept the hypothesis that severity of car
crash and car driver sex status are two independent variables.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 38 -
3 REGRESSION ANALYSIS
Regression analysis deals with dependence of a quantitative variable on one or more
quantitative variables. In the case of one variable depending on another variable, we talk
about simple regression, as opposed to the case when there are more explanatory variables. In
the latter case, we talk about multiple regression. In this chapter, the reader should deepen
their knowledge on regression presented in the course Statistics [5], in particular, as regards
the multiple regression. Elementary regression terms and conditions are presented at the
beginning of this chapter. Further, a formula for the calculation of regression coefficients is
derived, as well as a statistical test verifying significance of the coefficients. At the end of the
chapter, statistical significance of the entire regression model is tested.
3.1 THE CONCEPT OF REGRESSION ANALYSIS
Regression analysis aims to find a mathematical relation – an equation which in a
certain sense describes changes of a random variable Y dependent on changes of random
variables X1, X2, …, Xk. We shall assume the standard case presented in literature, i.e. the case
when only some values of the variables X1, X2,…, Xk are known or available. These
realizations of the random variables are denoted xij = the i-th value of the j-th variable Xj. As
far as he values are concerned, they are usually a part of a controlled experiment in which the
analyst defines/selects the values of X1, X2,…, Xk, and then finds or measures the values of Y
that corresponds to the values of X1, X2,…, Xk . The value of Y, measured or found for the i-th
value of X1, X2,…, Xk , is denoted Yi. To give an example, let Y = GDP which is influenced by
factors X1, X2,…, Xk. Different constellations of the factors will give a different GDP value,
the behaviour of GDP being a random variable, as well, since it is almost certain that we will
not be able to define k factors that describe it completely. Thus, GDP and so the variable Y ,
as well, will be in general a random variable, and its probability distribution will change as
the level of the factors X1, X2,…, Xk changes. Therefore, we use the lower index i in the
symbol Yi. In this controlled experiment, which tries to find a concrete form of the relation
between Y and the variables X2, …, Xk on a specified subset of the set of all possible values of
X1, X2, …, Xk, we assume that the relation takes the form ( )1 2, ,..., kY f X X X ε= + . Here, Y
depends on the regression function f, which contains unknown parameters, and on the
random term ε which completes the full description of the random behaviour of Y. The
systematic part of the model ( )1 2, ,..., kf X X X is not able to provide the full description of the
behaviour of Y. As was already outlined at the beginning of the chapter, the problem of
finding an appropriate relation between the variables will be resolved for the case when the
variable Y, the so-called dependent variable, depends on k independent variables or a vector
( )1,..., kX X X= .
The systematic part ( )1 2, ,..., kf X X X can take on different forms:
( )1 2 1 2 1, ,..., kf X X X Xβ β= + ,
( ) 2
1 2 0 1 1 2 2, ,..., kf X X X X Xβ β β= + + , etc.
The parameters 0 1 2, ,β β β are unknown!
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 39 If
the systematic part satisfies ( ) ( ) ( ) ( )1 2 1 1 2 2, ,..., ...k k kf X X X f X f X f Xβ β β= + + + , we talk
about linear regression (linear in parameters), or about linear regression model. We usually
consider the model:
3-1 ( )1 2 1 1 2 2, ,..., ...k k kf X X X X X Xβ β β= + + + .
We shall start with its simplest form in which f is a linear function of one independent
variable:
3-2 ( )1 1 2 1f X Xβ β= + .
Thus, we consider the relation 1 2 1 1( )Y X f Xβ β ε ε= + + = + . Our situation is depicted in
figure 1.
Figure 1: Regression dependence in the form of a line
The graph shows that the behaviour of Y is determined by the systematic part of the model,
i.e. by the function f(X1) which reflects the effect of a single variable on the variable Y.
However, it does not suffice to use the function f(X1) to describe the behaviour of Y, and it is
necessary to add the influence of other factors – those which are represented by the term ߝ.
And what was just said is also true for any specific value of X1, of course: for a value xi1, for
example. At the point xi1, equation Yi = f(xi1) + εi holds true, where f(xi1) is a specific value.
Nevertheless, even though f(xi1) is a specific value, we don’t know this value because the
expression f(xi1) depends on unknown parameters. We may even know the exact mathematical
form of the expression (for instance, we may know that it is a line), and we still won’t be able
to evaluate it. The objective of regression is to estimate the unknown parameters.
To make the estimation, we need to have some data. For the estimation of a regression line,
for instance, data ( ) ( ) ( )11 1 21 2 1, y , , y ,..., , yn nx x x are usually available. In other words, n points
from a plane are available. The first coordinates 11 21 1, ,..., nx x x of these points are the specified
values of the independent variable X1. These values are defined by the analyst, and the
corresponding y’s are obtained later. In our case, we obtained a single value of Y for each
value of X1.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 40 The
estimate of the function ( )1 2 1 1 2 2, ,..., ...k k kf X X X X X Xβ β β= + + + , calculated for the
i-th value of the variables 1 2, ,..., kX X X , is denoted as 1 1 2 2 ...i i i k ikY b x b x b x= + + +
)
.
If Y depends on two variables X1, X2, the points, obtained experimentally, will be of the form:
(x11, x12 ,y1)
(x21, x22 ,y2)
...
(xn1, xn2 ,yn).
These points lie in a three-dimensional space, and are interspersed with a function of the form
1 1 2 2Y b x b x= +
)
. The function approximates the relation between Y and the variables X1, X2.
More generally, if Y depends on k variables X2,…,Xk, we assume that the following points
from a k+1-dimensional space are available:
(x11, x12,…, x1k, y1)
(x21, x22,…, x2k, y2)
...
(xn1, xn2,…, xnk, yn)
These points are interspersed with a hyperplane of the form 1 1 2 2 ... k kY b x b x b x= + + +
)
. This
function approximates the relation between Y and the variables X1, X2,…, Xk.
To justify the procedures to be explained, which result in an estimation of the unknown
regression coefficients, it is imperative that the following conditions are satisfied. The
conditions are related to the random part ε of the regression model:
1. Expected value of εi is zero, i.e. E(εi) = 0 for each i.
2. Variance of εi is constant, independent of i, i.e. Var(εi) = σ2
for each i.
3. Variables εi and εj are not correlated, i.e. Cov(εi, εj) = 0 for i ≠ j.
4. Variables εi ’s are normally distributed, i.e. εi ∼ N(0, σ2
) for each i.
As is usually the case, expected value is denoted as E, variance is denoted as Var and
covariance uses the symbol Cov. If the reader forgot the symbols, we recommend revision of
the foundations of statistics contained in the course Statistics.
3.2 ESTIMATION OF REGRESSION COEFFICIENTS
We assume a regression function of the form ( )1 2 0 1 1, ,..., ...k k kf X X X X Xβ β β= + + + .
We determine the i-th value of each explanatory variable and obtain a vector (xi1, xi2,…, xik).
We do so for i = 1, 2,…, n, so we end up with n vectors (or points). We then find or measure a
particular value of Y for each vector (xi1, xi2,…, xik), i = 1, 2,…, n. Since there are n vectors,
we shall obtain n values of Y. This is all we have to make the estimation. Therefore, aside
from the conditions 1-4, this is what greatly affects the quality of the resulting estimates.
Of course, we talk about estimates because we only work with a data sample. The vector of
unknown regression parameters 0( ,..., )kβ β β=
r
corresponds to the set of all possible points of
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 41 the
form (xi1, xi2,…, xik, yi). Estimating the unknown parameters, we end up with an estimator
0( ,..., )kb b b=
r
.
The vector 0( ,..., )kb b b=
r
is obtained by
3-3
( )
1
. .T T T
b X X X Y
−
=
r r
,
where X, the matrix of regressors, satisfies
3-4 11 12 1
21 22 2
1 2
1 ...
1 ...
... ... ... ... ...
1 ...
k
k
n n nk
x x x
x x x
X
x x x
 
 
 =
 
 
 
,
and
3-5
1 2(Y ,Y ,...,Y )T
nY =
r
.
Symbol ZT
denotes the transposition of matrix Z, Z-1
means the inverse of Z. To evaluate 3-3,
the following data, as mentioned previously, must be available
(x11, x12,…, x1k, y1)
...
(xn1, xn2,…, xnk, yn).
PROBLEM 1
Estimate dependence of electricity consumption Y on power-supply distance X1 and amount of
electricity supplied X2. The regression function is assumed to be of the form
( )1 2 0 1 1 2 2, .f X X X Xβ β β= + + Available are the following data:
Table 14: data for problem 1
X1 X2 Y
1,2 3,6 3,2
1,3 3,7 3,3
1,3 3,8 3,4
1,4 3,8 3,5
1,4 3,9 3,6
1,5 3,9 3,6
1,5 4 3,7
1,6 4 3,8
1,6 4,1 3,9
1,7 4,2 4
Source: author’s
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 42 -
SOLUTION
Table 14 represents points which are used to construct the matrices X and Y:
1 1,2 3,6
1 1,3 3,7
1 1,3 3,8
1 1,4 3,8
1 1,4 3,9
1 1,5 3,9
1 1,5 4
1 1,6 4
1 1,6 4,1
1 1,7 4,2
X
 
 
 
 
 
 
 
=  
 
 
 
 
 
 
  
3,2
3,3
3,4
3,5
3,6
3,6
3,7
3,8
3,9
4
Y
 
 
 
 
 
 
 
=  
 
 
 
 
 
 
  
r
We shall now calculate the vector
0
1
2
b
b b
b
 
 
=  
 
 
r
in several steps, using 3-3:
1 1,2 3,6
1 1,3 3,7
1 1,3 3,8
1 1,4 3,8
1 1 1 1 1 1 1 1 1 1 10 14,5 39
1 1,4 3,9
1,2 1,3 1,3 1,4 1,4 1,5 1,5 1,6 1,6 1,7 14,5 21,25 56
1 1,5 3,9
3,6 3,7 3,8 3,8 3,9 3,9 4 4 4,1 4,2
1 1,5 4
1 1,6 4
1 1,6 4,1
1 1,7 4,2
T
X X
 
 
 
 
 
  
  ⋅ = ⋅ =  
     
 
 
 
 
  
,8 .
39 56,8 152,4
 
 
 
  
( )
1
245,2 108 103
108 60 50 .
103 50 45
T
X X
−
− 
 ⋅ = − 
 − − 
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 43 -
3,2
3,3
3,4
3,5
1 1 1 1 1 1 1 1 1 1 36
3,6
1,2 1,3 1,3 1,4 1,4 1,5 1,5 1,6 1,6 1,7 52,56 .
3,6
3,6 3,7 3,8 3,8 3,9 3,9 4 4 4,1 4,2 140,82
3,7
3,8
3,9
4
T
X Y
 
 
 
 
 
        ⋅ = ⋅ =            
 
 
 
 
  
Using 3-3, we now obtain:
( )
0
1
1
2
245,2 108 103 36 0,78
108 60 50 52,56 0,60 .
103 50 45 140,82 0,90
T T T
b
b X X X Y b
b
−
− −       
       = ⋅ ⋅ ⋅ = − ⋅ = =       
       − −       
r r
Therefore, the resulting regression function is 1 20,78 0,60 0,90Y x x= − + +
r
.
Theoretical values
Theoretical values 1 2, ,..., nY Y Y
) ) )
are calculated from the relation 1 20,78 0,60 0,90Y x x= − + +
)
by
inserting specific values x1 and x2 to the equation:
1 0,78 0,60 1, 2 0,90 3,6 3,18Y = − + ⋅ + ⋅ =
)
,
2 0, 78 0,60 1,3 0,90 3,7 3,33Y = − + ⋅ + ⋅ =
)
,
...
10 0, 78 0, 60 1, 7 0,90 4, 2 4, 02Y = − + ⋅ + ⋅ =
)
.
In matrix form, the resulting values can be written as
1 1,2 3,6 3,18
1 1,3 3,7 3,33
1 1,3 3,8 3,42
1 1,4 3,8 3,48
0,78
1 1,4 3,9 3,57
0,6
1 1,5 3,9 3,63
0,9
1 1,5 4 3,72
1 1,6 4 3,78
1 1,6 4,1 3,87
1 1,7 4,2 4,02
T
Y X b
   
   
   
   
   
   − 
    = ⋅ = ⋅ =    
        
   
   
   
  
     
r)
1
2
10
ˆ
ˆ
.
.
.
.
.
.
.
ˆ
y
y
y
 
 
 
 
 
 
 
=  
 
 
 
 
 
 
   
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 44 Vector
of residuals
The difference between the theoretical value and empirical value is called residual. In vector
form, the difference e Y Y= −
r )r
represents the vector of residuals. In our example, we have:
3,2 3,18 0,02
3,3 3,33 0,03
3,4 3,42 0,02
3,5 3,48 0,02
3,6 3,57 0,03
3,6 3,63 0,03
3,7 3,72 0,02
3,9 3,78 0,02
3,9 3,87 0,03
4 4,02 0,02
e Y Y
    
    −    
    −
    
    
    
= − = − =    
−    
    
   
   
   
   
    −    
r )r
1
2
10
.
.
.
.
.
.
.
e
e
e
 
 
 
 
 
 
 
=  
 
 
  
  
  
  
    
The differences are calculated by substracting the corresponding vector coordinates.
Variance of the estimates of regression coefficients
Since finding regression coefficients results in estimates of the unknown population
coefficients, it is convenient to introduce variances of the estimates. The variances reflect the
precision of the estimates, and can be found on the main diagonal of the matrix
3-6 ( )
12
( ) ,T
Var b s X X
−
= ⋅ ⋅
r
where
2
2 1
n
i
i
e
s
n k
=
=
−
∑
is an estimate of the variance of ߝ. Also,
ie = i-th residual,
n = sample size (number of points we work with),
k = number of parameters in the regression model.
For our example, we have:
2
2 1 0,006
0,0008571.
10 3
n
i
i
e
s
n k
=
= = =
− −
∑
Thus,
( ) .
0386,00429,00883,0
0429,00514,00926,0
0883,00926,02102,0
4550103
5060108
1031082,245
0008571,0)(
12










−
−
−
=










−−
−
−
⋅=⋅⋅=
−
XXsbVar T
r
The main diagonal of the rightmost matrix contains the coefficient variances:
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 45 -
s2
(b0) = 0,2102, and the standard deviation estimate s(b0) = 0,4584.
s2
(b1) = 0,0514, and the standard deviation estimate s(b1) = 0,2267.
s2
(b2) = 0,0386, and the standard deviation estimate s(b2) = 0,1965.
When the regression model and the variances of the coefficients are estimated, the result is
usually expressed in the form of an equation with the estimated standard deviations, i.e. the
square roots of the estimated variances, written under the estimated coefficients that appear in
the equation:
1 20, 7 8 0, 6 0 0, 9 0Y x x= − + +
)
(0,4584) (0,2267) (0,1965)
It may happen that the order of magnitude of the differences among the coefficients can be
quite substantial. For instance, it can be the case that b1 = 200 and b2 = 0,02. It is then
reasonable to ask the question whether there is any point in adding the small coefficient to the
model. To find the answer to this question, one can use the following statistical test.
3.3 TESTING SIGNIFICANCE OF REGRESSION COEFFICIENTS
The structure of the test is as follows:
1. The tested hypothesis is:
H0: βi = 0,
H1: βi ≠ 0.
2. The test criterion is
( )
i
i
b
T
s b
= ,
where bi is the estimate of βi, s(bi) is the estimated standard deviation of bi .
3. The critical value K = tn-k(α) for a nivel of test α.
4. If T > K, we reject the hypothesis H0 and accept the alternative hypothesis H1 according
to which βi is considered to be nonzero or statistically significant. In the opposite case, the
null hypothesis is accepted, and the tested parameter is thought to be equal to zero or
statistically insignificant.
In our case, we have
1
1
1
0,60
2,65
( ) 0,0514
b
T
s b
= = = ,
2
2
2
4,58
( )
b
T
s b
= = ,
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 46 where
tn-k(α) = t10-3 (0,05) = 2,365. Since T1 > 2,365 and also T2 > 2,365, both population
coefficients are statistically significant, and should be considered in the model.
As was said before, one needs to have a sample of points (xj1, xj2,…, xjk, yj) to find the vector
of coefficients ( )kbbb ,...,1=
r
, the estimate of the coefficients ( )1,..., kβ β β=
r
. In other words, the
vector of observations Y and the matrix of regressors X must be known.
In practice, tha analyst has to decide how to select the values xij and how many of them
should be included in the model. Answers to these questions have a major impact on the final
estimates of the coefficients because the subsequent calculations are only a routine procedure.
There is a statistical discipline which, among other things, aims to answer these questions.
The discipline is called design of experiments. We shall analyse some fundamental designs of
experiments in later chapters of this textbook.
3.4 CONFIDENCE INTERVALS FOR REGRESSION COEFFICIENTS
Confidence intervals for the parameters β1,…,βk are intervals in which the parameters
lie with probability 1-α. Each of these intervals is of the form
3-7 )]().(),().([ ipniipni bstbbstb αα −− +− ,
where
bi = estimate of βi,
s(bi) = estimated standard deviation of bi,
tn-p(α) = critical value of a Student’s distribution,
n = sample size (number of points),
p = number of parameters in the model,
α = nivel of test,
The unknown parameter βi lies in the interval 3-7 with probability 1-α. It is necessary that the
random term ߝ of the regression model is normally distributed for the interval to hold true.
3.5 TESTING MODEL SIGNIFICANCE
Significance of the regression model can be verified by the following test, which again
requires that the random term of the model is normally distributed. The structure of the test is
as follows:
1. The hypothesis is:
0...:H 210 ==== kβββ ,
or, using vectors, 0:H0
rr
=β .
0:H1
rr
≠β .
2. The test criterion: ,
)1/(
)/(ˆ
−−
=
knS
kS
T
e
Y
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 47 -
where
2
ˆ
1 1
1
( ) ,
n n
i iY
i i
S Y Y Y Y
n= =
= − =∑ ∑
)
,
2 2
1 1
( ) .
n n
e i i i
i i
S Y Y e
= =
= − =∑ ∑
)
3. The critical value is K = )(1, α−−knkF , where )(1, α−−knkF is the critical value of a Fisher’s
distribution with df1 = k a df2 = n-k-1 degrees of freedom. In Excel, one can obtain the value
with the function FINV(α; df1; df2).
4. If KT ≥ , H0 is rejected. In the opposite case, H0 is accepted.
In our example, we have:
5,346
)310/(006,0
)2/(594,0
=
−
=T , K = F2,7(0,05) = 4,73.
Since T exceeds the critical value K, H0 is rejected and the model is viewed as satisfactory, i.e.
we reject the hypothesis that all regression coefficients except the constant term β0 are zeros.
SUMMARY
This chapter dealt with regression analysis which aims to find a relation between a
quantitative variable Y, or an explained/dependent variable, and other quantitative variables
called explanatory/independent variables. In particular, we were concerned with linear
regression models (linear in parameters). At the end of the chapter, different statistical tests
were presented, regarding statistical significance of the regression coefficients and the model
as a whole, as well as confidence intervals for the estimates of the coefficients.
The text was accompanied by examples. The following terms were explained: linear
regression, estimates of regression coefficients, theoretical value, residual, variance and
standard deviation of regression coefficients, test of regression coefficients, test of the model,
confidence intervals for regression coefficients.
Some more examples follow.
PROBLEM 2
a) Estimate the regression coefficients of the model 0 1 1 2 2Y b b x b x= + +
)
,
b) Calculate the theoretical values of Y,
c) Calculate the residuals of the model,
d) Calculate the variance of the coefficient estimates,
e) Test significance of the coefficients.
f) For the following entry values 





=
5121
3111
0X , predict the dependent variable Y0.
Perform the tasks a)-f), using the following data:
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 48 y
x1 x2
10 1 0
25 3 -1
32 4 0
43 5 1
58 7 -1
62 8 0
67 10 -1
71 10 2
Source: author’s
SOLUTION
a. The estimates of the coefficients:










=
35
2710
368
YX T
,










=




















−−
−−
−
=
26,0
59,6
47,6
35
2710
368
.
60840240
4064384
2403842887
4664
1T
b
r
.
b. The theoretical values:
( )13,06 , 25,98 , 32,83, 39,68 , 52,34 , 59,19 , 72,11, 72,89Y =
)
.
c. The residuals:
( )89,1,11,5,81,2,66,5,32,3,83,0,98,0,06,3 −−−−−=e
r
.
d. The variances of the coefficient estimates:
Since 65,912
=∑i
ie ,










=










−
−−
−
−
=
39,2......
...25,0...
......35,11
60840240
4064384
2403842887
4664
1
.
38
65,91
)(bVar
r
.
e. The test of coefficients:
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 49 -
92,1
37,3
47,6
)0(
0
0 ===
bs
b
T , 18,13
5,0
59,6
1 ==T , T2 = 0,17 .
K = 571,2)05,0()( 38 == −− tt pn α .
Statistical significance pertains only to 1β because T1 > K.
f. Two predictions of Y0 (at two different points of X):
Since 





=
5121
3111
0X ,






=
















=
85,86
74,79
26,0
59,6
47,6
.
5121
3111
0Y .
PROBLEM 3
Find out whether production depends on corporate investments. A potential dependence is
reflected by the parameter 1β in the regression function with two unknown parameters. We
know, based on twelve data, that the estimate of 1β is 1 2,1622.b = We also know the standard
deviation of the estimate is 1( ) 0,615516.s b = Verify or reject the existence of the dependence
by testing the hypothesis: 1 0β = . The nivel of test is five per cent.
SOLUTION
Since
1
1
,
( )
b
T
s b
=
we get
2,1622
3,513.
0,615516
T = =
The appropritate critical value, related to a Student’s distribution with 12 – 2 = 10 degrees of
freedom, is 10 (0,05) 2,228t = . Since 3,513 > 2,228, we reject the null hypothesis about the zero
value of the coefficient. Thus, the coefficient is statistically significant, and the dependence
may be assumed to exist.
PROBLEM 4 (HOTEL SERVICES)
Find the linear regression model which describes a dependence of total monthly revenues Y
(in tens of thousands of crowns) of a hotel on revenues 1X generated by the catering services
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 50 of
the hotel (in tens of thousands of crowns) and on 2X which is a product of the number of
beds at the hotel and the number of days in a given month. The data is in the following table.
Data
Y 1x 2x
12,0 2,0 150
8,0 1,2 94
76,4 14,8 811
17,0 8,3 254
21,3 8,4 399
10,0 3,0 95
12,5 4,8 149
97,3 15,6 312
88,0 16,1 952
25,0 11,5 247
38,6 14,2 400
47,3 14,0 312
Source: author’s
SOLUTION
The vectors for the dependent and independent variables have the following form:
12,0 1 2,0 150
8,0 1 1,2 94
76,4 1 14,8 811
17,0 1 8,3 254
21,3 1 8,4 399
10,0 1 3,0 95
,
12,5 1 4,8 149
97,3 1 15,6 312
88,0 1 16,1 952
25,0 1 11,5 247
38,6 1 14,2 400
47,3 1 14,0 312
Y X
   
   
  
  
  
  
  
  
  = =  
  
  
  
  
  
  
  
     
.
















Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 51 -
Therefore
( )
( )
1
1
12,0 113,90 4175,0 453,4
113,9 1428,43 51958,5 , 6006,8 ,
4175,0 51958,50 2266001 230647,8
0,343 0,02629 0,0003
0,026 0,006234 0,000094 ,
0,00003 0,000094 0,00000266
T T
T
T T
X X X Y
X X
X X X
−
−
   
   = =   
      
− − 
 = − − 
 − − 
⋅( )
9,126450
3,729273 .
0,033091
T
Y b
− 
 = =  
  
r
The regression function is
1 29,126450 3,729273 0,033091 .Y x x= − + +
)
PROBLEM 5
Test statistical significance of the coefficients from the previous example: 21,ββ . The nivel of
test is five per cent.
SOLUTION
First of all, the standard deviations of the coefficients are calculated. Their values are:
.0283,0)(,371,1)( 21 == bsbs
The corresponding test criterion is
)( j
j
bs
b
T = , and so
.1693,1
0283,0
033091,0
,7201,2
371,1
729273,3
21 ==== TT
The critical value of the test is found for a Student’s distribution with 12 – 3 = 9 degrees of
freedom and the five per cent nivel of test: .26,2)05,0(9 =t Comparing the test criterions with
the critical value, we see that the null hypothesis concerning the zero value of the coefficient
1β is rejected. The case of the other parameter is different, however. The parameter 2β seems
to be insignificant and the second variable should be removed from the model.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 52 CONTROL
TEST 3
Yes/No answers:
3.1 Regression analysis examines dependence among quantitative variables?
3.2 Deviation of an empirical value of Y from its theoretical value, modelled by a regression
function, is called residual?
3.3 Regression analysis deals only with linear functions?
3.4 The test of significance of regression coefficients requires the critical value of a normal
distribution?
3.5 The null hypothesis in the test of model significance is: 0...:H 210 ≠≠≠≠ kβββ ?
3.6 The classical regression model assumes that random terms in the model have _______
expected value and __________ variance.
3.7 The test examining the zero value of an individual regression coefficient is called
__________
3.8 If the model 0 1 1 ... k kY b b x b x= + + +
)
contains the term 0b , the first column of the matrix
of regressors X consists of value(s) __________
3.9 Regression analysis exploits dependence among__________ variables.
3.10 Variances of estimated regression coefficients can be found on __________
__________ of the matrix ( ) ( ) 12 −
= XXsbVar T
r
.
3.11 A personnel department gathered the following data on age (X) of 20 randomly selected
employees and the amount of time (Y) they spent out of work due to health reasons.
x y x y
20 4 58 20
35 14 46 13
35 15 43 16
34 10 33 10
32 10 29 10
28 9 36 11
25 12 48 14
46 15 55 15
38 15 36 14
50 16 19 6
Source: author’s
Estimate regression coefficients of the model 0 1Y b b x= +
)
.
3.12 A statistical office examined dependence of yearly savings on yearly income. Both
variables are related to families with two children. The result of the survey is in the
following table.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 53 -
Income
(thousands
of crowns)
104 125
1
146
1
167
1
111
1
135
1
189
1
196
2
205
2
210
1
170 230
Savings
(thousands
of crowns)
6 5,6
9
9,2
1
14
8
8
9
9,1
2
20,5
2
29
2
23,2
3
38,5
2
25 40
Source: author’s
Find the linear regression model that explains a dependence of savings on income.
Using the model, estimate the savings of a family whose yearly income is 205 thousand
crowns.
3.13 Eight families were randomly selected from national accounting records. Their gross
yearly income (= explanatory variable x, measured in crowns) and their yearly expenses
on industrial products (= explained variable Y, measured in crowns) were analysed. The
results are in the table.
x
211399 306502 250251 264138 274060 297046 328645 249987
Y
42276 72341 49852 53827 54914 60409 71729 47997
Source: author’s
a. Estimate the linear regression function which describes a dependence of expenses
on income.
b. Calculate the theoretical expenses of a family with income exceeding 300 thousand
crowns.
3.14
a. Using the data of problem 3.13, calculate the standard deviation of the estimates
,ib i = 0, 1.
b. Using the data of problem 3.13, calculate the test criterion used to test the statistical
insignificance of 1b .
3.15 The following data is available on the production of France (= variable Y in millions of
euros), its amount of fixed capital (= variable X1 in millions of euros) and employment
(= variable X2 in thousands of people). Estimate regression coefficients of the model
2211021 ),( xxXXf βββ ++= , where 1 2( , )Y f X X ε= + .
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 54 Economic
sector Yi 1ix 2ix
Agriculture 288443 18781 1055
Food and drinks 393828 13990 551
Power 330300 33813 223
Semi-products 602182 32022 1101
Production equipment 426720 19520 965
Household goods 34008 1258 49
Transportation means 185887 10462 358
Consumer products 427766 16392 1030
Construction 436926 19828 1472
Trade 495319 36354 2691
Transportation 417147 58196 1268
Market services 1002132 116083 4617
Insurance services 61827 2053 158
Financial services 709297 6908 441
Non-market services 840622 136923 6148
Source: author’s
SOLUTIONS
3.1 yes
3.2 yes
3.3 no
3.4 no
3.5 no
3.6 zero, constant
3.7 t-test
3.8 ones
3.9 quantitative
3.10 the main diagonal
3.11 1,394 0,296Y x= +
)
,
3.12 26,399 0,274Y x= − +
)
; 29 711 crowns
3.13
a. 19599,4 0,2796 ,Y x= − +
)
b. for 300000 ,ix crowns= we get 64298iY =
)
crowns. At least 64298 is the answer.
3.14
a. 1( ) 0,03375,s b =
b. The test criterion T = 8,284, the critical value ⇒= 447,2)05,0(6t we accept the
hypothesis H1 on dependence of expenses on income.
3.15 1 2263684,7 2,2331 66,7912 .Y x x= + +
)
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 55 -
4 CORRELATION ANALYSIS
In the previous chapter, we were solving the problem of finding a functional relation
which would describe dependence of one variable Y on other explanatory variables
represented by a vector X. Mathematically speaking, the relations were linear in parameters.
In this chapter, we will be preoccupied with the problem of measuring intensity of
dependence among variables. There is more than one way how to do it. Perhaps the simplest
way of measuring dependence has to do with what is called correlation analysis.
Correlation analysis is closely related to regression [2], as it benefits from the theory of linear
regression models. The objective of correlation is different, however. It does not seek a
reasonable form of relations among variables because it a priori assumes that relations are
linear in parameters and even in variables, as well. Instead, it focuses on the construction of
measures of dependence among the variables.
This chapter is accompanied by examples to give a better understanding of the subject. After
studying correlation, the reader is advised to calculate the problems at the end of the chapter.
4.1 CORRELATION COEFFICIENT
In the simplest form, we study dependence between two random variables Y and X. If
this is the case, paired correlation coefficient ρxy is used to measure the level of linear
dependence between the two variables. The coefficient is defined as
4-1 ( , )
( ). ( )
xy
Cov X Y
X Y
ρ
σ σ
= for ( ) 0, ( ) 0X Yσ σ> > ,
0= otherwise.
Here, ( , ) ( ) ( ) ( )Cov X Y E XY E X E Y= − ⋅ is the covariance of the random variables X and Y,
the characteristic having been defined in chapter one. Also, ( )Xσ and ( )Yσ are the standard
deviations of X and Y, respectively. The symbol E stands for the expected value of a random
variable. Expected value was explained in the course Statistics. The paired correlation
coefficient is an element of the closed interval [ 1,1]− , i.e. [ 1,1]xyρ ∈ − . If xyρ = 0, we say that
the variables X and Y are uncorrelated. If xyρ = 1 or xyρ = -1, an exact functional relation
exists between X and Y, the function being a line. If xyρ = 1, the line is increasing. If xyρ = -1,
the line is decreasing.
If xyρ = 0, we can only conclude that the variables are uncorrelated. We cannot say that they
are (statistically) independent. Although independent variables are uncorrelated, the opposite
statement is not generally true.
PROBLEM 1
Let us calculate the correlation coefficient xyρ for the data in table 15:
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 56 Table
15: Entry data for correlation analysis
X -2 -1 0 1 2
Y 4 1 0 1 4
Source: author’s
All the pairs occur with the same probability p. The following table 16 adds some preparatory
calculations to make things easier.
Table 16: Preliminary calculations
Xi Yi Xi.Yi
-2 4 -8
-1 1 -1
0 0 0
1 1 1
2 4 8
Σxi = 0 Σyi = 10 Σxi.yi = 0
We have 2
( , ) ( ) ( ) ( ) 0i i i i
i i i
Cov X Y E XY E X E Y p x y p x y= − ⋅ = − =∑ ∑ ∑ , and thus xyρ = 0.
At the same time, it can be seen that the two variables are not independent – on the contrary,
they are even perfectly dependent, one of the variables being the second power of the other.
Formula 4-1 defines the population/theoretical correlation coefficient, which in most cases
cannot be calculated since the population characteristics ( , )Cov X Y , ( )Xσ and ( )Yσ will
most likely be unknown. The presented example is devised artifically, of course. For these
reasons, the population coefficient is in practice usually estimated by its sample version rxy ,
which is calculated from a data sample. The sample correlation coefficient is defined by
4-2
2 2 2 2
. . .
. ( ) . ( )
i i i i
xy
i i i i
n x y x y
r
n x x n y y
−
=
   − −   
∑ ∑ ∑
∑ ∑ ∑ ∑
.
To decide reasonably whether there is any amount of linear dependence between Y and X ,
the correlation is tested for significance. The null hypothesis of the test states that ρxy = 0, and
the alternative hypothesis says the opposite – there exists a nonzero correlation. To perform
the test, the sample correlation coefficient is used.
Testing zero value of paired population correlation
1. The null hypothesis is H0: ρxy = 0 vs. the alternative H1: ρxy ≠ 0.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 57 -
2. The test criterion is of the form:
T =
2
. 2
1
xy
xy
r n
r
−
−
,
where n = number of sample pairs (xi, yi).
3. The critical value of the test for a nivel of test alpha is K = tn-2(α). Thus, it concerns a
Student’s distribution with n-2 degrees of freedom.
4. If T < K, H0 is accepted, i.e. Y is not linearly dependent on X. In the opposite case, H1 is
accepted, which means that Y is (at least to a certain extent)) linearly dependent on X.
Let us note that there are conditions that must be satisfied for the test to be valid: the main
condition is that the pairs were drawn from a two-dimensional normal distribution.
PROBLEM 2
Let us have the following sample pairs made up of values xi and yi (the first two columns of
table 17):
Table 17: Result of a random sampling, preparatory calculations
xi yi xi yi xi
2
yi
2
-2 -5 10 4 25
-1 -3 3 1 9
0 0 0 0 0
1 1 1 1 1
2 4 8 4 16
Σ = 0 Σ = -3 Σ = 22 Σ = 10 Σ = 51
Source: author’s
Using 4-2, we obtain
2
5.22 0.( 3)
0,9918.
(5.10 0).(5.51 ( 3) )
xyr
− −
= =
− − −
This value is a clear signal that there probably is a linear relationship between Y and X.
However, the sample size is very small, and so we prefer to test the (in)significance of the
correlation, using a one per cent nivel of test.
T =
2
0,9918. 5 2 1,718
13,443.
0,0161 0,9918
−
= =
−
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 58 Since
α = 0,01, the critical value is K = t5-2(0,01) = TINV(0,01; 3) = 5,84. As T > K, there is a
significant (nonzero) linear dependence of Y on X . This is also confirmed by the p-value of
the test, which is TDIST(13,443; 3; 2) = 0,00089. The value is substantially smaller than the
nivel of test, suggesting that the correlation is significant for nivels of test 0,00089 and higher,
i.e. for all reasonable nivels of test.
4.2 CORRELATION INDEX
If the regression function, based on which the correlation of two variables is assessed,
isn’t linear, it is possible to measure a dependence of two variables with correlation index:
4-3
ˆY
xy
Y
S
I
S
= ,
where
2
ˆ
1
( )
n
iY
i
S Y Y
=
= −∑
)
,
2
1
( ) .
n
Y i
i
S Y Y
=
= −∑
The symbols were already used in regression analysis. The calculation of Ixy is more laborous
than that of rxy because the regression function has to be found first, so that the theoretical
values of the dependent variable iY
)
are available. The values iY are measured, and their
average is Y . The theoretical values iY
)
correspond to the values of the explanatory variable X
appearing in the regression, as was described in the chapter on regression analysis:
( )i iY f x=
)
.
The index satisfies 0 ≤ Ixy ≤ 1. Discussions about the resulting value of the index are similar to
those about rxy, testing its significance is not performed, however. Ixy can also be used for the
case of a regression line. It is then identical to the absolute value of the paired correlation
coefficient rxy.
4.3 SPEARMAN’S RANK CORRELATION COEFFICIENT
If ranks of two variables X ,Y are known, not their original values, Spearman’s rank
correlation coefficient rs is used instead to measure dependence of the variables:
4-4 2
2
6
1
( 1)
i
i
S
d
r
n n
⋅
= −
−
∑
.
Here, id is the difference of the i-th ranks of X and Y , and n is the number of numerical pairs
of the two variables that are available through a data sampling.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 59 PROBLEM
3
Products were sorted by their quality, the sorting having been implemented by two
committees: specialists were on one committee, laymen selected from the general public on
another committee. Determine whether the resulting assessments of the product quality
depends on what committee is considered for this purpose. Here, the dependence is
understood in the sense of a correlation. Entry data are in table 18, together with the
differences in the rankings.
Table 18: product rankings
Product Ranks by laymen Ranks by specialists di di
2
1 7 8 -1 1
2 9 9 0 0
3 8 7 1 1
4 10 10 0 0
5 6 6 0 0
6 5 4 1 1
7 3 5 -2 4
8 4 3 1 1
9 2 2 0 0
10 1 1 0 0
Source: author’s
2
2
6
6.8
1 1 0,95.
( 1) 10.99
i
i
S
d
r
n n
= − = − =
−
∑
The coefficient can be used to test statistical independence (not only correlation !) of the two
variables:
1. The tested hypothesis is H0: X, Y are independent vs. H1: X, Y are not independent.
2. The test criterion is:
( 1) ST n r= − ⋅ .
3. The critical value K is that of the standard normal distribution N(0,1). For a nivel of test
alpha, the value is calculated with the Excel function NORMSINV(1-alpha/2).
4. If T K≥ , we reject H0. In the opposite case, we accept H0.
If H0 is accepted, we know the variables are independent, and thus uncorrelated as well. If the
null hypothesis is rejected, we know the variables are not independent, but we don’t know if
they are uncorrelated or not. The test is approximately valid provided that 30n ≥ and the
random vector (X,Y) follows a two-dimensional continuous probability distribution.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 60 -
4.4 MULTIVARIATE DEPENDENCE- THE CASE OF TWO VARIABLES
If we want to examine the linear dependence of a variable Y on variables 1 2, ,..., pX X X ,
p>1, we use either:
a. coefficients of partial correlation,
b. coefficients of multivariate correlation.
Ad a. Partial correlation coefficient 1 2 ,..., pyx x xr • measures the intensity of the linear dependence
of Y on X1 provided a certain effect of variables 2,..., pX X is removed. These are the variables
listed behind the symbol „·“. Partial correlation tries to solve the problem that the effect of X1
might be distorted by the contemporaneous effects of variables 2,..., pX X . We shall restrict
our analysis to the case of p = 2.
The coefficient of partial correlation appears again in two forms – as a population coefficient
and a sample coefficient. In the latter case, and for p = 2, the sample version is calculated as
( )( )
1 2 1 2
1 2
2 1 2
2 2
1 1
yx yx x x
yx x
yx x x
r r r
r
r r
•
−
=
− −
,
or
( )( )
2 1 1 2
2 1
1 1 2
2 2
1 1
yx yx x x
yx x
yx x x
r r r
r
r r
•
−
=
− −
.
In both cases, the coefficients take on a value from interval [-1,1]. As can be seen from 4-5
and 4-6, to calculate the partial correlation, it is necessary to evaluate various combinations of
paired correlations. Since we talk about sample partial correlation, it is possible to test the
significance of its population version.
Testing statistical significance of partial correlation (p = 2):
1. 1 20 : 0yx xH ρ • = , 1 21 : 0yx xH ρ • ≠ .
2. The test criterion is:
1 2
1 2
2
3
1
yx x
yx x
r n
T
r
•
•
−
=
−
.
3. The critical value for a nivel of test alpha is K = ( )α3−nt = TINV(α,n-3).
4. IF ≥T ( )3nt α− , the coefficient of partial correlation is significant, i.e. nonzero.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 61 The
test is valid provided the random vector (Y, X1, X2) follows a three-dimensional normal
distribution. It is also assumed that n > 3.
Ad b. Coefficient of multiple correlation measures dependence of a variable Y on all
explanatory variables pXXX ,...,, 21 . If two explanatory variables are considered, the sample
version of the coefficient satisfies
1 1 2 1 2 2
1 2
1 2
2 2
2
2
1
yx yx yx x x yx
y x x
x x
r r r r r
r
r
•
− +
=
−
, 1 2
0 1y x xr •≤ ≤ .
The significance of the coefficient can also be tested:
Testing statistical significance of multiple correlation:
1. 1 20 : 0y x xH ρ • = vs. 1 21 : 0y x xH ρ • ≠ .
2. The test criterion:
( )
( )
1 2
1 2
2
2
3
2 1
y x x
y x x
r n
T
r
•
•
⋅ −
=
⋅ −
.
3. The critical value of the test is related to a Fisher’s distribution this time. The distribution
has 2 and n-3 degrees of freedom: the critical value is written as ( )2, 3nF α− for a nivel of
test alpha. In Excel, it can be obtained with the function FINV(α,2,n-3).
4. If T ≥ ( )2, 3nF α− , the coefficient of multiple correlation is statistically significant (the null
hypothesis is rejected). In the opposite case, it is statistically insignificant (the null
hypothesis is accepted).
The test is valid on condition that the random vector (Y, X1, X2) follows a three-dimensional
normal distribution. Is is also assumed that n > 3.
SUMMARY
In this chapter, we became familiarized with another important statistical term: correlation
analysis. We learnt how to compute the correlation coefficient, correlation index and the
Spearman’s rank correlation coefficient. The end of the chapter discussed partial correlation
and multiple correlation. The theory and examples worked with the case when two
explanatory variables are considered because the calculations become more cumbersome if
more explanatory variables are considered.
The text is now followed by more examples.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 62 PROBLEM
4
Check if there is any linear dependence among the following variables. Calculate the partial
correlation coefficients and test the smaller one on significance. The nivel of test is 5%. Also,
calculate the coefficient of multiple correlation and test its significance for 5% nivel of test.
Use table 19 for the calculations. The dependent variable is Y, the other variables are
explanatory.
Table 19: Entry data for problem 4
Y 12 8 76,4 17 21,3 10
X1 2 1,2 14,8 8,3 8,4 3
X2 150 94 811 254 399 95
Y 12,5 97,3 88 25 38,6 47,3
X1 4,8 15,6 16,1 11,5 14,2 14
X2 149 312 952 247 400 312
Source: author’s
SOLUTION
The paired correlations are as follows:
1 2 1 2
0,85, 0,75, 0,73.yx yx x xr r r= = =
Inserting them in equations
( )( )
1 2 1 2
1 2
2 1 2
2 2
1 1
yx yx x x
yx x
yx x x
r r r
r
r r
•
−
=
− −
,
( )( )
2 1 1 2
2 1
1 1 2
2 2
1 1
yx yx x x
yx x
yx x x
r r r
r
r r
•
−
=
− −
,
we get the partial correlations:
1 2
0,67yx xr • = , 2 1
0,36yx xr • = .
Testing of 2 1
0,36yx xr • = :
1. 2 10 : 0yx xH ρ • = vs. 2 11 : 0yx xH ρ • ≠ .
2. The test criterion: 2 1
2 1
2
12 3
1
yx x
yx x
r
T
r
•
•
−
=
−
= 1,16.
3. The critical value: ( )12 3 0,05 2,262t − = .
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 63 -
4. Since |T| < t9(0,05), we accept the null hypothesis anad conclude the partial correlation is
insignificant.
In the end, we evaluate the multiple correlation coefficient and test its significance. Using
equation 4-7, we get:
1 1 2 1 2 2
1 2
1 2
2 2
2
2
0,87.
1
yx yx yx x x yx
y x x
x x
r r r r r
r
r
•
− +
= =
−
As we can see, the value is greater than all the paired correlation coefficients. Testing the
significance, we have:
1. 1 20 : 0y x xH ρ • = (no linear dependence) vs. 1 21 : 0y x xH ρ • ≠ .
2. The test criterion:
( )
( )
1 2
1 2
2
2
12 3
2 1
y x x
y x x
r
T
r
•
•
−
=
⋅ −
= 14,54.
3. The critical value = FINV(0,05,2,9) = 4,26.
4. Since the test criterion falls in the critical region, we reject the null hypothesis, and we
conclude that there is a combined linear effect of the X’s on Y.
PROBLEM 5
The sample coefficient of paired correlation 0,23xyr = has been calculated, based on a data
sample of size n = 25. Verify for 1% nivel of test whether there is a linear indepdence
between the variables X and Y in the population.
SOLUTION
The test criterion satisfies
2
2
1
xy
xy
r n
T
r
−
=
− 2
0,23 25 2
1,133.
1 0,23
−
= =
−
The critical value of the test, found either in statistical tables or in Excel, is equal to
t23(0,01) = 2,8.
Since 1,133 < 2,8, we cannot reject the null hypothesis, i.e. existence of linear dependence has
not been proved.
PROBLEM 6
Canard Company has been monitoring a potential dependence of its operational costs per unit
of production Y on total production X (in thousands of pieces).
Table 20: Costs Y and production X of Canard Company
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 64 ix
60 71 92 144 192 306
iy 5157 2620 1986 1582 1100 954
ix 437 481 747 989 1383
iy 729 456 200 196 110
Source: author’s
Calculate the correlation index provided that a hyperbolic dependence of the form
a
Y b
X
ε= + +
is assumed. Here, a and b are unknown regression coefficients.
SOLUTION
The least squares method is used to estimate the coefficients a and b. The estimates ˆˆ,a b are
then inserted in the regression equation, and the theoretical values ˆˆ ˆ( / )i iy a x b= + are found
for different values of ix . Also, the average value y , calculated from all the empirical values
of the dependent variable Y, must be evaluated. Equation 4-3 then gives
( )
( )
2
2
ˆ 19813814
0,945.
22155242
i
yx
i
y y
I
y y
−
= = =
−
∑
∑
The index suggests a high level of dependence of the costs on the production, which is not
surprising, of course. However, what we are mainly interested in is that it is the hyperbolic
equation that seems to describe the relation well, as suggested by the index.
CONTROL TEST 4
Yes/No answers:
4.1 Correlation coefficient measures a dependence of Y on X?
4.2 Correlation coefficient can take on any value from interval [0,1] ?
4.3 The null hypothesis of the test that verifies significance of the correlation coefficient
assumes that two variables are uncorrelated?
4.4 It is much simpler to calculate the index of correlation than the paired correlation
coefficient?
4.5 Spearman’s correlation coefficient can take on any value from interval [ 1,1]− ?
Complete the statement:
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 65 -
4.6 Correlation analysis seeks a measure of __________
4.7 If 1xyr = , then the line which describes the relation is __________
4.8 The correlation index can take on any value from interval __________
4.9 If values of variables X, Y represent ranks, then __________ correlation coefficient is
used to describe a linear dependence between the two variables.
4.10 If Y is linearly dependent on ( )mXXXX ,...,, 21= , we use the coefficient of __________
__________to measure the dependence.
4.11 Calculate the paired correlation between coal extraction (in thousands of tons) and costs
per ton of the extracted coal (in crowns). The available data are in the table.
Mine
number č.
ix iy
1 350 37
2 351 38
3 329 38
4 329 38,5
5 327 37,5
6 322 39,1
7 321 39,6
8 316 42,1
9 298 42,9
10 286 43,5
∑ 3229 396,2
Source: author’s
4.12 National accounts were used for a random selection of eight families. The accounts
show the gross yearly income X of the families (in crowns) and their yearly expenses Y
on consumer products. The data are in the table below. Calculate the correlation index
for the case the linear dependence takes the form of a line, and calculate the paired
correlation coefficient, as well.
ix 211399 306502 250251 264138
iy 42276 72341 49852 53827
ix 274060 297046 328645 249987
iy 59914 60409 71729 47997
Source: author’s
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 66 -
4.13 Ten films were presented to the jury at a film festival, and viewers also participated in
the rating of the movies. The final ranking is in the table.
Film A B C D E F G H I J
Rankings by jury 5 7 9 1 2 8 3 4 6 10
Rankings by viewers 1 6 4 3 8 7 2 5 10 9
Source: author’s
Estimate the correlation between the rankings, using Spearman’s correlation coefficient.
Test significance of the coefficient for 5% nivel of test.
4.14 Calculate the coefficient of multiple correlation for the data in the next table. The data
describes a dependence of output volume Y on fixed capital X1 and employment X2.
Economic sector iy 1ix 2ix
Agriculture 288443 18781 1055
Food and Drinks 393828 13990 551
Power sector 330300 33813 223
Semi-products 602182 32022 1101
Production equipment 426720 19520 965
Household goods 34008 1258 49
Transit means 185887 10462 358
Consumer goods 427766 16392 1030
Construction sector 436926 19828 1472
Source: author’s
SOLUTIONS
4.1 yes
4.2 no
4.3 yes
4.4 no
4.5 yes
4.6 measures dependence
4.7 rising
4.8 [0,1]
4.9 Spearman’s
4.10 Multiple correlation
4.11 0,8967yxr = −
4.12 0,9196yx yxr I= =
4.13 Spearman’s correlation 0,38sr = , 3,44T = , 1,64.K = The coefficient is significant.
4.14 1 2
2
0,6069.y x xr • =
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 67 -
5 METHODS FOR SALES PREDICTIONS
Time series theory represents today a very important part of econometrics. The theory
enables us to describe systems that change their behaviour in time. It is necessary to say that
the dynamics of these systems deepens as the world globalization advances. National
economy is a typical example of where the time series analysis is exploited. We might be
interested in monthly movements of the aggregate price levels, published by the national
statistical office, currency exchange rate closing quotes, etc. Time series, however, do not
originate only in economy, but in other spheres of human activity, as well. For instance,
birthrates and death rates are observed in demography, maximal nad minimal temperatures are
recorded in meteorology, or blood pressure readings of a patient, as part of preventive checkups,
are documented by the doctor. The time series analysis aims to understand the
mechanism which generated the series in question. Understanding the mechanism allows one,
to an extent, to control the functioning of the system which generated the series, and thus to
set a desired future course of the system by defining appropriately its input parameters. One
may also use the insight into the mechanism for prediction of the future behaviour of the
system. The system which created a time series is described by a mathematical model.
The time series theory is very extensive. It may as well be the most extensive branch of
statistics, as some scholars note. In this chapter, we shall deal with the so-called classical time
series theory. If we recall the general form of a regression model, the equation consisted of a
systematic part and a random part. Whereas the systematic part reflected the systematic effect
of the most important factors on the modelled variable, the random part represented the effect
of all other and less important factors the separate influence of which is hard, if not
impossible, to capture. Different approaches exist, regarding how to build a model that will
approximate the mechanism of generating the time series. The classical approach focuses on
the systematic part of the regression model.
The classical analysis assumes that the systematic part of the model can be decomposed into
several elements of a specific type, which will shed more light on the origin of the series. It is
understood that it is easier to detect these elements when they are separated rather than when
their effect is aggregated. Another reason that stands behind the decomposition is the effort to
discover potential seasonality, since it is a common practice to deprive the series of
seasonality for different reasons.
In our case, to keep things simpler, we shall also assume that the mathematical model
describing the time series containts only one explanatory variable t which represents a point in
time.
5.1 TIME SERIES
A time series {yt} (TS) is a sequence of values that represent a realization of a sequence
of random variables. We shall be mainly interested in economic time series, and in time
development of sales, in particular.
It is usually assumed for a time series that:
• the main factor of change is the time t,
• equidistant time intervals, i.e. there is always the same time distance between any two
neighbouring points in time t and t’, for which we get values yt and yt’ of the series.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 68 A
mathematical model is used to describe the time series. The main objective of building the
model is to use it for prediction of the future values of the series. Two types of predictions are
distinguished: point prediction and interval prediction.
5.2 TIME SERIES MODEL DECOMPOSITION
It is assumed that the model of a time series can be decomposed into four elements
which describe different aspects of the time development of the analysed variable:
• trend component Tt,
• seasonal component St,
• cyclical component Ct,
• random component tε .
The trend component describes the fundamental character of the time development of the
series. It tells us the essential character of its movement (whether it rises or drops, whether its
level will eventually taper off or accelerate upwards, etc.). The trend expresses the systematic
and long-term effect of factors that keep affecting the series in the same way. The trend is
either rising or declining. If it’s neither rising nor declining, it is a series with no trend.
The seasonal and cyclical components, which combined form the periodic component,
capture regular oscilations of the series. The former relates to oscilations that take place
within one year. These oscilations repeat every year, at the same moment. The oscilations can
be attributed to natural phenomena (different seasons of the year – spring, summer, autumn,
winter), or social habits (construction activities are more pronounced in the summer than in
winter, for instance). The important feature of seasonality is that the regular oscilations of the
same type take place every twelve months at the latest. The cyclical component represents
the effect of factors that give rise to longer-term oscilations. We also talk about oscilations
around the trend, and the time delay between two oscilations is more than twelve months. It is
usually difficult to describe mathematically the cyclical component. Therefore, it is
sometimes not included in the time series model at all. The difficulty is given by the fact that
the cyclical oscilations are not as regular, and they often vary in their intensity.
The trend, seasonal and cyclical components form the deterministic component of the
model. It is usually assumed that their combined effect is a result of their addition, i.e. the
following model is assumed
5-1 , 0, 1, 2,...t t t t tY T S C tε= + + + = ± ± .
In this case, we talk about the additive model of a time series. Two special cases of 5-1 are
worth mentioning, as they often appear in economic applications: 1) the case without the
periodic component, i.e. the case S Ct t= = 0 , so that we have
5-2 , 0, 1, 2,...t t tY T tε= + = ± ± .
In the other case, 0tC = is assumed, which turns 5-1 into
5-3 , 0, 1, 2,...t t t ty T S tε= + + = ± ± .
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 69 This
is a time series model with seasonality.
Apart from the additive structure of 5-1, there is also a multiplicative version of the model:
5-4 , 0, 1, 2,...t t t t ty T S C tε= ⋅ ⋅ ⋅ = ± ± .
The main objective of the time series analysis is to quantify the individual components of
the time series model.
The stochastic process that generated the time series can also be analysed with other
mathematical means, including the so-called adaptive approaches, such as moving averages
and exponential smoothing. We shall talk about moving averages later.
The following modelling techniques focus on the practical aspect of working with time series,
as is the case of other statistical methods in this course. The reader who is more interested in
the time series analysis may take specialized courses that cover the subject in a greater detail.
5.2.1 TREND
As was mentioned already, time t is now assumed to be the only factor determining the
dynamics of the analysed variable. The assumption, although largely simplifying the reality,
makes it much more straightforward to model the time series under consideration and separate
its individual components. One of these components – the trend – is the most important part
the series.
Let us assume the model can be written as in 5-2. The trend ܶ௧ of this model is very often
described by a linear function, polynomial of degree two, exponential function, modified
exponential function, logistic curve or Gompertz’s curve. The functions differ in their
complexity, which further affects the way their parameters are estimated. If we work with a
linear function or polynomial of degree two, both these function are linear in parameters, and
thus their parameters can be estimated with the least squares method explained in the chapter
on regression. The case of the other functions is different, since the mathematical description
of the curves is not linear in parameters any more. Therefore, a different method must be used
to find estimates of the unknown model parameters.
Linear trend (or polynomial of degree two)
Assuming a linear trend, the model 5-2 can be written as
5-5 0 1 , 0, 1, 2,...t tY t tβ β ε= + + = ± ± .
Estimates of the parameters ߚ଴, ߚଵ are obtained with the least squares method.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 70 PROBLEM
1
The following table 21 contains a time series data.
Table 21: A time series for problem 1
yt 14,1 15,3 17,7 18,2 20,5 22,8 23,4 25,5 27,9 28,9 31 33,1 35,2
t 1 2 3 4 5 6 7 8 9 10 11 12 13
Source: author’s
Excel: you can have the series depicted in Excel, together with the trend calculated by the
least squares method. The output is in figure 2.
Figure 2: The time series and its trend in the form of a line
To get the figure in Excel (version 2010), the following steps must be taken: highlight the area
of Excel containing the time series data Yt, t = 1, 2, …, n, and select
Insert →→→→ Graph →→→→ XY dot →→→→ with straight connecting lines.
This procedure will draw the graph of the original series. If it is necessary to convert the scale
on the x-axis of the graph into values 1,2, …, n, click on the graph and press the right button
of the computer mouse, choose „Select data“ and „Adjust the x-axis“, as proposed in the
dialogue window. If you click on the graph again, you will see Graph tools at the top of Excel
and Trend line in the corresponding window. Here, you may select the linear trend line and
also its equation from the options.
One can proceed similarly in the case of polynomials of degree two, the equation of which is
ܶ௧ ൌ ߚ଴ ൅ ߚଵ‫ݐ‬ ൅ ߚଶ‫ݐ‬ଶ
, ‫ݐ‬ ൌ 0, േ1, േ2, ….,
or in the case of polynomials of higher degrees.
y = 1,7495x + 11,877
0
5
10
15
20
25
30
35
40
1 2 3 4 5 6 7 8 9 10 11 12 13
Series
Series
Lineární (Series)
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 71 Logistic
trend
Logistic curves belong to the set of S-curves. This type of curves is often used in situations
where a certain business cycle occurs. These cycles experience phases including the phase of
saturation, which is what logistic curves capture well because they have a horizontal
asymptote. For instance, when a new product is brought to the market, it is expected that it
will take some time before customers register the product and try it out. Thus, at the early
stages of the product marketing, the product sales volumes will go up slowly. Later, however,
when the product comes into use, the sales growth curve will probably be steeper as more
customers start to move from older versions of the product to the new version. When the sales
volumes reach their climax, the product will dominate the market, as it becomes a fad. Later,
competitors will catch up, and the sales volumes of the product will start to weaken. All these
stages can be depicted by the logistic curve (figure 3).
Figure 3: The S-shape of a logistic trend
Logistic trend is given by equation
1
0 1
, 0, 0, t = 1,2,...
1
t t
T
κ
κ β
β β
= > >
+
.
The function has an S-curve shape for ߚଵ ൏ 1, ߚ଴ ൐ 1. The unknown parameters of the trend
can be estimated using a method of selected points: Let the length of the time series be ܶ,
where ܶ is an odd number, and let us select chronologically the first, the middle (the p-th, say)
and the last observation of the series. Then the parameter estimates satisfy
5-6
ܾଵ ൌ ඨ
ሺ1/‫ݕ‬௣ሻ െ ሺ1/‫ݕ‬்ሻ
ሺ1/‫ݕ‬ଵሻ െ ሺ1/‫ݕ‬௣ሻ
೛షభ
,
݇ ൌ
‫ݕ‬ଵሺ1 െ ܾଵ
௣ିଵ
ሻ
൫‫ݕ‬ଵ ‫ݕ‬௣⁄ ൯ െ ܾଵ
௣ିଵ,
ܾ଴ ൌ
ሺ1/‫ݕ‬ଵሻ െ ሺ1/‫ݕ‬௣ሻ
ܾଵ െ ܾଵ
௣ · ݇.
If T is even, it is necessary to select other values of the series: for example, the first, the p-th
and the r-th such that r-pൌp-1, and then formula 5-6 can be used again.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 72 PROBLEM
2
Table 22 contains values of a time series. Describe the series with a logistic trend, using the
method of selected points.
Table 22: Values of a time series to be captured by a logistic trend
t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
yt 0,4 0,6 0,6 0,7 0,7 0,8 0,9 0,9 1 1,1 1,2 1,2 1,8 1,4 1,6 1,7
t 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
yt 2,3 2,5 2,2 2,6 2,3 2,6 2,7 2,8 3,2 3 3,1 3,1 3,3 3,4 3,6
Source: author’s
The method of selected points gives
ܾଵ ൌ ඨ
ሺ1/‫ݕ‬௣ሻ െ ሺ1/‫ݕ‬்ሻ
ሺ1/‫ݕ‬ଵሻ െ ሺ1/‫ݕ‬௣ሻ
೛షభ
ൌ ඨ
ሺ1/1,73ሻ െ ሺ1/3,56ሻ
ሺ1/0,35ሻ െ ሺ1/1,73ሻ
భఱ
ൌ 0,872996836,
݇ ൌ
‫ݕ‬ଵሺ1 െ ܾଵ
௣ିଵ
ሻ
൫‫ݕ‬ଵ ‫ݕ‬௣⁄ ൯ െ ܾଵ
௣ିଵ ൌ 4,2309685,
ܾ଴ ൌ
ሺ1/‫ݕ‬ଵሻ െ ሺ1/‫ݕ‬௣ሻ
ܾଵ െ ܾଵ
௣ · ݇ ൌ 12,70162866.
Thus, the final model of the series is
‫ݕ‬ො௧ ൌ
4,23
1 ൅ 12,7 · 0,87௧
.
Table 23 contains the theoretical values of the series generated by the final model. This series
and the original series are shown in figure 4 for visual comparison.
Table 23: The original and modelled (theoretical) time series
t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
yt 0,4 0,6 0,6 0,7 0,7 0,8 0,9 0,9 1 1,1 1,2 1,2 1,8 1,4 1,56 1,73
࢟ෝ࢚ 0,4 0,4 0,5 0,5 0,6 0,7 0,7 0,8 0,9 1 1,1 1,3 1,4 1,5 1,64 1,79
t 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
yt 2,3 2,5 2,2 2,6 2,3 2,6 2,7 2,8 3,2 3 3,1 3,1 3,3 3,44 3,56
࢟ෝ࢚ 1,9 2,1 2,2 2,4 2,5 2,7 2,8 2,9 3 3,2 3,3 3,4 3,5 3,54 3,62
Source: author’s
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 73 Figure
4: The original series and its logistic model
There are more trends to choose from, of course, such as exponential trend, modified
exponential trend, Gompertz’s curve, etc. As we have said at the beginning of the chapter, the
readers interested particularly in this subject, may take a specialized course devoted to time
series. We note that a suitable trend can be selected, using differences of the original data:
∆ଵ
‫ݕ‬௧ ൌ ‫ݕ‬௧ െ ‫ݕ‬௧ିଵ, ∆ଶ
‫ݕ‬௧ ൌ ∆ଵ
‫ݕ‬௧ െ ∆ଵ
‫ݕ‬௧ିଵ. The selection rule based on the differences is as
follows:
Table 24: Trend selection
Criterion Trend
1
yt ≈ constant Linear
1
yt ≈ linear, 2
yt ≈ constant Quadratic
yt - yt-1 ≈ Gauss curve Logistic
5.2.2 SEASONAL COMPONENT – THE CASE OF CONSTANT SEASONALITY
Effect of seasonal factors can be described not only by a suitable moving average (to be
discussed later), but also by a mathematical curve to which the principles of regression are
applied. This concept is based on expanding the regression function, which contains the trend
component already, by adding the seasonal component to it, the component depending on
unknown parameters to be estimated, as well. As a result, more parameters of the regression
function in question are to be estimated – some of these parameters relate to the trend
component, while others belong to the seasonal component. The seasonal component is
represented by auxiliary variables ‫ݔ‬௜, i = 2, 3, ..., s, where s is the number of seasons the
model works with. Each variable ‫ݔ‬௜ is a dichotomous variable, which means that it can only
take on values 0 and 1. The variable ‫ݔ‬௜ takes on value 1 for the i – th season and 0 otherwise
(for all the other seasons). There are s-1 auxiliary variables in the model because otherwise, if
the number was equal to s, perfect multicollinearity would exist, which is a problem that
inhibits one from estimating the model parameters (the inverse of the matrix ்ܺ
ܺ ceases to
exist). Therefore, the effect of one of the seasons is incorporated into the absolute term of the
model.
0
0,5
1
1,5
2
2,5
3
3,5
4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Series
Model
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 74 The
expanded model is of the form
‫ݕ‬௧ ൌ ܶ௧ ൅ ߙଶ‫ݔ‬ଶ ൅ ߙଷ‫ݔ‬ଷ ൅ ‫ڮ‬ ൅ ߙ௦‫ݔ‬௦ ൅ ߝ௧.
If a polynomial trend is assumed, we may write
‫ݕ‬௧ ൌ ߚ଴ ൅ ߚଵ‫ݐ‬ ൅ … ൅ ߚ௞‫ݐ‬௞
൅ ߙଶ‫ݔ‬ଶ ൅ ‫ڮ‬ ൅ ߙ௦‫ݔ‬௦ ൅ ߝ௧,
where the absolute term ߚ଴ contains the effect of the first season. This representation of
seasonality also assumes that the sum of all seasonal fluctuations equals zero, i.e. the
fluctuations cancel out. Also, independence of the seasonality on the trend is assumed as well
as the additive decomposition of the entire regression model. Should there be a dependence
between the trend and seasonality, the so-called proportional seasonality might be more
convenient for this situation. Let us demonstrate now how to work with the model.
PROBLEM 3
Table 25 contains values of a time series that spans a four-year time period. These are
quaterly data and, as suggested by figure 5, seasonality probably occurs in each quater of the
year. We also adopt the view, based on the graph, that the trend component of our model
could be linear. Therefore, we decided to use the model
‫ݕ‬௧ ൌ ߚ଴ ൅ ߚଵ‫ݐ‬൅ߙଶ‫ݔ‬ଶ൅ߙଷ‫ݔ‬ଷ ൅ ߙସ‫ݔ‬ସ ൅ ߝ௧.
Table 25: Quaterly time series data
Season(i) Sz(1) Sz(2) Sz(3) Sz(4) Sz(1) Sz(2) Sz(3) Sz(4)
t 1 2 3 4 5 6 7 8
yt 1,74 -0,3 2,27 -2,7 1,19 -1 1,51 -2,8
Season(i) Sz(1) Sz(2) Sz(3) Sz(4) Sz(1) Sz(2) Sz(3) Sz(4)
t 9 10 11 12 13 14 15 16
yt 1,46 -0,2 2,35 -3 1,39 -1,3 1,7 -2,5
Source: author’s
Figure 5: Time series data from table 25
-4
-2
0
2
4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Series
Series
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 75 Let
us estimate the unknown coefficients of the model, using the least squares method (see the
chapter on regression analysis). We get
β଴ൌ1,605, βଵൌ-0,023, αଶൌ-2,1, αଷൌ0,56, αସൌ-4,13.
Thus,
ܻ෠௧ ൌ 1,605 െ 0,023‫ݐ‬ െ 2,1‫ݔ‬ଶ ൅ 0,56‫ݔ‬ଷ െ 4,13‫ݔ‬ସ.
However, this is not where the calculation ends. We must realize that the value 1,605 includes
the effect of the first-quater seasonality, and we would like to know that effect. Also, the
estimates ߙ௜ do not represent the effect of the corresponding i-th quarter yet. For example, the
value 0,56 reflects an increase in the third quarter, taking into account the effect of the first
quarter which is automatically absorbed in the value 1,605. To isolate the seasonal effects
Sz୧, i = 1, 2, 3, 4, for all the quarters, we calculate
Szଵ ൌ αଶ
൅ αଷ ൅ αସ
4
, Szଶ ൌ αଶ ൅ Szଵ, Szଷ ൌ αଷ ൅ Szଵ, Szସ ൌ αସ ൅ Szଵ.
In our case, ܵ‫ݖ‬ଵ ൌ 1,4175, so the other seasonal effects are ܵ‫ݖ‬ଶ ൌ െ0,68, ܵ‫ݖ‬ଷ ൌ 1,977,
ܵ‫ݖ‬ସ ൌ െ2,71. Using one more auxiliary variable xଵ for the first season, we now have
yො୲ൌሺ1,605-1,4175ሻ-0,023t൅1,4175xଵ-0,68xଶ൅1,977xଷ-2,71xସ.
We can evaluate the theoretical values given by the resulting model. This is done in table 26.
These values are visually compared to the original values of the time series (figure 6). The
graph proves that the model has been selected well enough.
Table 26:Theoretical values given by the regression model
t 1 2 3 4 5 6 7 8
y-theoretical 1,582 -0,5 2,096 -2,617 1,49 -0,6 2,004 -2,709
t 9 10 11 12 13 14 15 16
y-theoretical 1,398 -0,7 1,912 -2,801 1,306 -0,8 1,82 -2,893
Source: author’s
Figure 6: Comparison of the theoretical and empirical time series data
-4
-3
-2
-1
0
1
2
3
1 2 3 4 5 6 7 8 9 10111213141516
Series
Model
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 76 -
5.2.3 PROPERTIES OF THE RANDOM COMPONENT OF A REGRESSION MODEL
We considered in our models, based on their decomposition into the trend and seasonal
component, a random component as well. It is this component that gives the final answer to
how the time series in question behaves. We also used the principles of regression to estimate
the unknown parameters that appeared in our model. Therefore, care should be taken to make
sure that such estimated parameters have good statistical properties. To achieve this goal, it is
necessary that the random part of the model satisfies the conditions of the classical regression
model. These conditions are listed in the chapter on regression, and we repeat them here for
convenience:
1. Expected value of εi is zero, or E(εt) = 0 for each t.
2. Variance of εt is constant, i.e. independent of t: Var(εt) = σ2
for each t.
3. Variables εi, εj are uncorrelated, i.e. Cov(εi, εj) = 0 for i ≠ j.
4. Variable εi follows a normal distribution N(0, σ2
) for each t.
If the condition 2 holds, we talk about homoscedasticity (in the opposite case, when the
condition does not hold, we talk about heteroscedasticity).
The conditions above should be verified with appropriate statistical methods. We shall focus
now on the third condition which typically isn’t met in the case of time series. Before we do
so, we note the following: condition 1 is not usually verified, and is assumed to be true. If
condition 4 holds together with all the other listed conditions, the least squares estimates are
the best of all unbiased estimates in a certain sense. If „only“ the first three conditions hold,
which is not to be taken for granted, of course, the least squares estimates will „only“ be the
best estimates of all the so-called linear unbiased estimates. There are some terms we just
used that require an explanation. We shall not provide an explanation at this point because the
main reason for making the note is to stress that we can still get reasonably good estimates of
the unknown regression coefficients, using the least squares method, even if condition 4 does
not hold. As far as condition 2 is concerned, one may use the Goldfeld-Quandt’s test to verify
the validity of the condition, or some other more general test, such as the White’s test.
However, heteroscedasticity is a problem, typical of cross-section data, not of time series data,
and so we shall not be preoccupied with it here. For these reasons, we shall focus exclusively
on the problem of autocorrelation (condition 3). An in-depth coverage of the other problems
may be found in different school subjects (in Econometrics, for instance).
The analysis of condition 3 is based on analysing the residuals of the model, i.e. it is based on
the values ˆˆ , 0, 1, 2,...t t t te Y T S t= − − = ± ± . Here, the original regression model, describing the
time series, is of the form , 0, 1, 2,...t t t tY T S tε= + + = ± ± . In the expression ˆˆ
t t t te Y T S= − − ,
tY represents an individual value of the series, ˆ
tT is an estimate of the trend, and ˆ
tS is an
estimate of the seasonal component of the model.
To verify that the random terms of the model are uncorrelated, the Durbin-Watson’s test is
frequently used.
5.2.4 DURBIN-WATSON’S TEST
The test verifies the null hypothesis: the random terms of the model are uncorelated
versus the alternative hypothesis: the random terms are correlated, the correlation taking the
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 77 following
first-order autoregressive form AR(1): ߝ௧ ൌ ߩߝ௧ିଵ ൅ ‫ݑ‬௧, where ‫ݑ‬௧ satisfies the
conditions of the classical regression. The AR(1) model also contains an unknown parameter
ߩ, which is the population paired correlation coefficient measuring the correlation between ߝ௧
and ߝ௧ିଵ. The test analyses validity of the null hypothesis that there is no autocorrelation in
the model versus the alternative hypothesis that there is an autocorrelation of the form AR(1).
The test is realized in several steps. First, the least squares estimates of the unknown
regression parameters of the model are found, and the estimates are used to calculate the
residuals of the model ݁௧. Second, the residuals are used to construct the following test
criterion
5-7
ܶ ൌ
∑ ሺ݁௧െ݁௧ିଵሻଶ்
௧ୀଶ
∑ ݁௧
ଶ்
௧ୀଵ
,
where T is the length of the times series. Special statistical tables are then used to compare the
criterion with the critical value of the test. The tables are at the end of this textbook. To use
the tables properly, for the given number of observations ܶ, nivel of test ߙ and the number of
model parameters ݇, which does not include the absolute term of the model, a lower level ݀௅
and an upper level ݀ு are found in the tables. Also, the sample paired correlation r between
the neighbouring residuals has to be calculated according to
5-8
‫ݎ‬ ൌ
∑ ݁௧݁௧ିଵ
்
௧ୀଶ
∑ ݁௧
ଶ்
்ୀଵ
.
If ‫ݎ‬ is positive, the conclusion to the test is such that if the test criterion 5-7 is greater than
݀ு, the null hypothesis is accepted, whereas if the criterion is smaller than ݀௅, the null
hypothesis is rejected. If ‫ݎ‬ is negative, the statistic ܶ‫כ‬
ൌ 4 െ ܶ is used instead of T, and the
just-described conclusion is worded based on ܶ‫כ‬
. If T or ܶ‫כ‬
falls between the values ݀௅ and
݀ு, the test is inconclusive, however, it is recommended that an autocorrelation be assumed
preventively because this is usually the case in time series models.
PROBLEM 4
Table 27 contains fictitious data on households’ monthly expenses on food (in millions of
crowns). The data spans the time period from January 2000 (t = 1) to March 2001 (t = 15).
Table 27: Monthly expenditures Yt
Yt 141 145 142 147 146 154 150 158 157 165 164 170 167 174 175
t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Source: author’s
Given the character of the series (figure 7), we shall assume that it follows a linear trend
y୲ൌβ଴൅βଵx୲൅ε୲ with a single explanatory variable ‫ݔ‬௧ ൌ ‫.ݐ‬ The seasonal component is not
assumed. We start with the estimation of the parameters, and the resulting model will be
tested for autocorrelation, using the Durbin – Watson’s test.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 78 Figure
7: The time series Yt
The least squares method gives
࢈ ൌ ൬
ܾ଴
ܾଵ
൰ ൌ ሺࢄᇱ
ࢄሻିଵ
ࢄᇱ
ࢅ ൌ ቀ
0,295 െ0,028
െ0,028 0,0036
ቁ · ቀ
2355
19552
ቁ ൌ ቀ
136,66
2,543
ቁ.
Inserting the estimates in the model, we get the theoretical values ܻ෠௧ ൌ ܾ଴ ൅ ܾଵ‫ݐ‬ and the
residuals ݁௧ ൌ ܻ௧ െ ܻ෠௧, ݁௧ିଵ ൌ ܻ௧ିଵ െ ܻ෠௧ିଵ. Calculations necessary for the test are in table 28.
Table 28: Residuals and preparatory calculations
et et-1 (et - et-1) 2
et
2
et . et-1
1,8 3,24
3,25 1,8 2,1025 10,5625 5,85
-2,28 3,25 30,5809 5,1984 -7,41
0,17 -2,28 6,0025 0,0289 -0,3876
-3,37 0,17 12,5316 11,3569 -0,5729
2,08 -3,37 29,7025 4,3264 -7,0096
-4 2,08 42,6409 19,8025 -9,256
1 -4 29,7025 1 -4,45
-2,54 1 12,5316 6,4516 -2,54
2,91 -2,54 29,7025 8,4681 -7,3914
-0,62 2,91 12,4609 0,3844 -1,8042
2,82 -0,62 11,8336 7,9524 -1,7484
-2,71 2,82 30,5809 7,3441 -7,6422
1,74 -2,71 19,8025 3,0276 -4,7154
0,2 1,74 2,3716 0,04 0,348
Sum 272,547 85,9438 -48,7297
For the Durbin – Watson’s test, we get the criterion ‫ܹܦ‬ ൌ 272,547/85,9 ൌ 3,1. The
estimated correlation between the residuals is according to 5-8: ‫ݎ‬ ൌ െ48,73/85,9 ൌ െ0,56.
In the tables at the end of the textbook, we find ݀‫ܮ‬ ൌ 1,077 and ݀‫ܪ‬ ൌ 1,361 for ܶ ൌ 15, the
number of regressors without the absolute term = 1 and the nivel of test ߙ ൌ 0,05. Since the
sample paired correlation between the residuals is negative, we shall use the alternative test
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 79 criterion
T*
= 4 െ ‫ܹܦ‬ ൌ 0,9. This number is very low, which means that we may assume the
presence of an AR(1) autocorrelation in the model.
5.3 MOVING AVERAGES
The trends we have devoted ourselves to assumed that the objective was to find a single
mathematical curve that would intersperse and describe the entire times series. Such an effort
can be justified, for instance, in situations when the analysed time series is fairly short, so that
it makes sense to explain the way it was generated with a single function. Another example is
the situation when the user of the final model is interested in a longer-term trend of the series
rather than its short-term fluctuations. However, we are often encountered with
circuimstances under which using a single function to describe the series behaviour is a too
ambitious objective. To give an example, let us consider a company that is interested in the
short-term future of the series. If this is the case, the company will probably also be interested
in the time period of the series that has just preceded the present, for it is this recent past that
will most likely affect the future behaviour of the series to a great extent, as opposed to the
more distant past. In such a case, it makes more sense to work with models that reflect
potential shifts in the behaviour of the series. The recognition of the shifts means, in other
words, that we distinguish between the behaviour of the series from the recent past and its
behaviour from the distant past. This approach does not assume that the mechanism which
generated the series remains the same across time. Approaches which assume shifts in the
behaviour of the series are called adaptive approaches, and include, among other techniques,
moving averages.
Moving averages are based on the principle that only a part of the series is modelled with a
selected function, and for later analyses, only one of the values of the function – the middle
value - is used as a representative. The function used to model the part of the series depends
on unknown parameters which can be estimated with the least squares technique. It is
assumed that the parameters may change in time.
In this chapter, we shall again work with an additive regression function ܻ௧ ൌ ܶ௧ ൅ ߝ௧, where
ߝ௧ satisfies all the required conditions, and we also assume that the analytical form of the
trend remains the same across time. However, as we move from one part of the series to
another, the model used to describe the new part will generally have different parameters.
Therefore, it is said that models with changeable parameters are applied. There are different
types of moving averages. If a linear function is used to model various sections of the time
series, we talk about a simple moving average. If a second-order polynomial is used for these
purposes, we talk about a weighted moving average. Let us take an example to see the
theoretical and practical aspects of working with moving averages.
5.3.1 SIMPLE MOVING AVERAGES
Let ܻଵ, ܻଶ, … , ܻ௡ be a time series. To use the moving average technique, the length m of
the series to be modelled must be determined first. Also, the order of the polynomial that will
be used as a model must be selected. If a linear function is used, it will be a polynomial of
order one. The length m is usually chosen to be an odd number which can be written as
݉ ൌ 2‫݌‬ ൅ 1, where ‫݌‬ is a positive integer. Each part o the series to be modelled has its center
point. These are the values ܻ௧, where ‫ݐ‬ ൌ ‫݌‬ ൅ 1, ‫݌‬ ൅ 2, … , ݊ െ ‫.݌‬ This means that the
subsequent part of the series to be modelled is the previous part of the series shifted forward
by one observation. This is how the modelled section of the series is moved forward, hence
the name of the technique. The center of the first section is given by ܻ௣ାଵ, the center of the
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 80 second
part equals ܻ௣ାଶ, etc. … until the center of the final part of the series is ܻ௡ି௣. The k-th
part of the series includes observations ܻ௣ା௞ି௣, ܻ௣ା௞ିሺ௣ିଵሻ, … , ܻ௣ା௞, ܻ௣ା௞ାଵ, … , ܻ௣ା௞ା௣,
which can be written as ൛ܻ௣ା௞ା௝ൟ
௝ୀି௣
௣
, or ൛ܻ௧ା௝ൟ
௝ୀି௣
௣
for ‫ݐ‬ ൌ ‫݌‬ ൅ ݇. To model the part of the
series with center point at time ‫,ݐ‬ using the least squares, means to minimize the criterion
5-9
෍ ൣܻ௧ା௝ െ ܾ଴ሺ‫ݐ‬ሻ െ ܾଵሺ‫ݐ‬ሻ݆൧
ଶ
.
௣
௝ୀି௣
Calculating the corresponding partial derivatives and putting them equal to zero to find the
minimum, we arrive at equations
5-10
2 ෍ ൣܻ௧ା௝ െ ܾ଴ሺ‫ݐ‬ሻ െ ܾଵሺ‫ݐ‬ሻ݆൧ሺെ1ሻ ൌ 0,
௣
௝ୀି௣
2 ෍ ൣܻ௧ା௝ െ ܾ଴ሺ‫ݐ‬ሻ െ ܾଵሺ‫ݐ‬ሻ݆൧ሺെ݆ሻ ൌ 0.
௣
௝ୀି௣
We said the principle of moving averages lies in working with a single representative of the
modelled part of the series, which is to be the theoretical value of the center-point observation
of that part. At this center point, ݆ ൌ 0, which means that ܻ෠௧ ൌ ܾ଴ሺ‫ݐ‬ሻ must hold. Thus, it
suffices to calculate only the absolute term of the model, using 5-10. Doing so, we get
5-11
ܾ଴ሺ‫ݐ‬ሻ ൌ
1
݉
෍ ܻ௧ା௝.
௣
௝ୀି௣
As we can see, the center-point representative of a given section of the series is the simple
average of all the observations that make up the section. And this is where the name of the
technique – simple moving average - came from.
PROBLEM 5
Table 29 contains a time series which we will now model with a simple moving average of
length five.
Table 29: A time series for problem 5
t 1 2 3 4 5 6 7 8 9 10
Yt 34 40 37 42 45 47 44 51 52 58
t 11 12 13 14 15 16 17 18 19 20
Yt 55 64 59 66 68 62 72 75 72 77
Source: author’s
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 81 Since
݉ ൌ 5, we have ‫݌‬ ൌ 2. The first theoretical value is ܻ෠ଷ ൌ
ଵ
ହ
∑ ܻଷା௝ ൌ 39,6ଶ
௝ୀିଶ , the
second theoretical value is ܻ෠ହ ൌ
ଵ
ହ
∑ ܻସା௝ ൌ 42,2ଶ
௝ୀିଶ , etc.…until the last theoretical value is
ܻ෠ଵ଼ ൌ
ଵ
ହ
∑ ܻଵ଼ା௝ ൌ 71,6ଶ
௝ୀିଶ . Extending table 29 with the theoretical values, we get table 30.
Table 30: The original time series and the moving averages of length five
t 1 2 3 4 5 6 7 8 9 10
Yt 34 40 37 42 45 47 44 51 52 58
average 40 42 43 46 48 50 52 56
t 11 12 13 14 15 16 17 18 19 20
Yt 55 64 59 66 68 62 72 75 72 77
average 58 60 62 64 65 69 70 72
Source: author’s
Figure 8 compares the empirical and theoretical values of the series. The original series was
shortened by leaving out the first two and the last two observations.
Figure 8: The time series and its moving averages
5.4 MAKING PREDICTIONS WITH TIME SERIES MODELS
One of the most important reasons why any time series model is constructed is to use
the model for predictions of the future behaviour of the time series. Sometimes we also talk
about making extrapolations. Predictions are based on the principle that the past mechanism
which generated the series will keep its properties unchanged in the future. Thus, one may
expect the model to work reasonably well even for future realizations of the series.
Let the time series model be of the form t t tY T ε= + , t = 1, 2,…, n, where tT is a linear or
quadratic trend (i.e. a second-order polynomial), and n be the present time point. The point
prediction n hY +
% of the uknown value t hY + of the series at a time point n + h, where h is
positive and represents the time horizon of the point prediction, is given by:
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 82 -
5-12 n h n hY T+ +=% .
Here, hnT + is the trend evaluated at n + h. This value is not known, but can be estimated by
the given trend regression function. The point prediction allows to estimate the future
realization of the time series with a single number in a very simple and straightforward way –
the future time point n+h is used and inserted in the estimated trend component of the model.
Apart from point prediction, interval prediction for t hY + is constructed, as well. The interval
prediction constructed at a time n for a time period n+i is given by the following confidence
interval:
- If the trend is linear, the 95% confidence interval is of the form
5-13 [ n iY +
% – tn-2(0,05) ( )ns Q i , n iY +
% + tn-2(0,05) ( )ns Q i ],
where
5-14
2
1 1
ˆ
2
n n
t t
t t
Y T
s
n
= =
−
=
−
∑ ∑
and
5-15 2
2 2
1
1 ( )
( ) 1n n
t
n i t
Q i
n
t nt
=
+ −
= + +
−∑
, ( 1) / 2t n= + .
- If the trend is quadratic, the 95% confidence interval is given by
5-16 [ n iY +
% – tn-3(0,05) ( )ns Q i , n iY +
% + tn-3(0,05) ( )ns Q i ],
where
5-17
2
1 1
ˆ
3
n n
t t
t t
Y T
s
n
= =
−
=
−
∑ ∑
and
5-18
2 1 2
2
1 1 1
1 2 4
( ) 1 1, ,( ) ( ) 1, ,( ) ,
. . .
1
TT
nQ i n i n i X X n i n i X
n n
−
 
 
    = + + + ⋅ ⋅ ⋅ + + =     
 
 
.
Any 95% confidence interval tells us that it contains the unknown realization of Y with
probability = 0,95.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 83 -
SUMMARY
In this chapter, we studied several important types of time series and possibilities of
their use for making predictions. First, we examined the trend components of time series
models, the components being expressed in the form of a polynomial or S-curve. In the latter
case, a procedure of estimating the unknown parameters of the curve was demonstrated. We
also analysed the seasonal component of the model, taking the regression approach again.
Finally, the random component of the model and its properties were clarified for the model to
be credible when exploited for predictions. One of these properties, the lack of
autocorrelation, was examined further with the Durbin-Watson’s statistical test. At the end of
the chapter, formulas for point and interval predictions were presented for the case when the
trend in the time series model is either linear or quadratic.
CONTROL TEST 5
5.1 The deterministic component of a time series model consists of (more of the following
answers may be correct):
a. trend component
b. trend and seasonal components
c. trend, seasonal and cyclical components
d. seasonal and cyclical components
5.2 The periodical component of a time series model consists of:
a. seasonal component
b. trend and seasonal components
c. trend, seasonal and cyclical components
d. seasonal and cyclical components
5.3 Select an item from the left column of the following scheme, and decide what item from
the right column it belongs to:
(1) Additive time series model (A)
has components summed
together
(2) Multiplicative t.s. model (B) is a line
(3) Linear trend in a t.s. model (C)
has components multiplied
together
5.4 Complete the sentences:
a. If the variance of the random part of a model is constant, the property is called
__________.
b. Random components of a model should be mutually __________.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 84 -
5.5. The following table contains a time series. Model the series with a quadratic trend (no
other component, apart from the random one, is assumed in the model). Use the least squares.
t 1 2 3 4 5 6 7 8 9 10
Yt 1,2 6,3 14,3 37,1 76,5 125 274 349 499 578
t 11 12 13 14 15 16 17 18 19 20
Yt 711 859 987 1114 1135 1349 1506 1680 1721 1890
Source: author’s
5.6. Model the time series below with moving averages of length five.
t 1 2 3 4 5 6 7
Yt 18,683 15,236 20,552 20,988 30,598 23,22 38,375
T 8 9 10 11 12 13 14 15
Yt 43,698 47,813 61,403 62,002 68,386 63,904 68,247 67,818
Source: author’s
5.7 Using the data in the tables below, estimate the model
ܻ௧ ൌ ߚ଴ ൅ ߚଵ‫ݔ‬௧ଵ ൅ ߚଶ‫ݔ‬௧ଶ ൅ ߝ௧,
and verify the validity of the no-autocorrelation assumption with the Durbin-Watson’s test.
The nivel of test is five per cent.
t 1 2 3 4 5 6 7 8
xt1 3,3 3,4 3,5 3,5 3,4 3,3 3,4 3,2
xt2 5,9 6 6,2 6,3 6,3 5,9 5,9 5,8
Yt 25,3 23,02 19,9 20,95 18,59 16,15 15,22 17,26
t 9 10 11 12 13 14 15 16 17
xt1 3,2 3,1 3,1 3,1 3,2 3,1 3,1 3 3
xt2 5,5 5,4 5,2 4,8 4,8 4,7 4,6 4,5 4,5
Yt 18,98 20,09 18,65 17,79 20,84 16,69 18,33 16,79 16,48
Source: author’s
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 85 -
SOLUTIONS
5.1 c.
5.2 d.
5.3 (1) – (A), (2) – (C), (3) – (B)
5.4 a. homoscedasticity b. uncorrelated.
5.5 The least squares method applied to the model
2
0 1 2Y t tβ β β ε= + + +
gives the estimates
0
ˆ 127,β = − 1
ˆ 38,34,β = 2
ˆ 3,27.β =
5.6 Moving averages:
t 3 4 5 6 7 8 9 10 11 12 13
average 21,2 22,1 26,7 31,4 36,7 42,9 50,7 56,7 60,7 64,8 66,1
5.7 The least squares give the estimates 0 3,5b = , 1 3,88b = , 2 0,52b = . The residuals are
t 1 2 3 4 5 6 7 8 9
e(t) 5,898 3,173 -0,439 0,566 -1,407 -3,256 -4,568 -1,703 0,17
t 10 11 12 13 14 15 16 17
e(t) 1,727 0,389 -0,266 2,401 -1,313 0,379 -0,719 -1,03
The sample correlation is
17
1
2
17
2
1
40,3
0,67.
59,37
t t
t
t
t
e e
r
e
−
=
=
= = =
∑
∑
Further, the Durbin-Watson’s statistic equals
17
2
1
2
17
2
1
( )
71,84
1,21.
59,37
t t
t
t
t
e e
DW
e
−
=
=
−
= = =
∑
∑
Since the model contains two parameters (the absolute term is excluded from the test), k is
equal to 2. The sample size is n = 17. Therefore, the Durbin-Watson’s test table provides
us with values dL = 1,015 and dH = 1,536. Since DW lies between the two values, the test
is inconclusive, as to whether there is an autocorrelation or not in the model.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 86 -
6 ANALYSIS OF VARIANCE
Chapter 2 of this textbook dealt with two-sample t-tests. Chapter 4 is devoted to their
extension in the form of analysis of variance - ANOVA. Analysis of variance ranks among
one of the most frequently used statistical procedures in marketing as well as other areas of
data analysis. The method enables one to assess the potential influence of a qualitative or
quantitative variable on another quantitative variable. For example, it is possible to evaluate
effects of different forms of a promotional campaign on the sales of a product. In this case,
different promotional campaigns reresent different categories of the observed qualitative
variable = promotional campaign. The sales are then the quantitative variable in question. The
potential effect can be expressed mathematically in such a way that the expression analyses
whether a change in the level of the qualitative/quantitative variable changes the population
mean of the other observed quantitative variable. In this sense, ANOVA tests if there are any
differences among the population means of the quantitative variable.
Mathematically speaking, the basic idea of ANOVA is given by a decomposition of what is
called the total variability of the observed variable. The decomposition is made up of
different sources of the total variability. There is more than one term forming the
decomposition. Some of the terms represent the main sources of the total variability. Another
term is called the residual variability, which reflects the influence upon the total variability
of all the other minor sources. Depending on how many main sources or factors appear in the
decomposition, we talk about one-way ANOVA, two-way ANOVA and so on.
A shrewd observer might come up with the suggestion that the two-sample t-test could be
used several times instead. Such a procedure would test potential differences of various
population means under scrutiny. In this case, if none of these tests ended up being
significant, i.e. the null hypothesis of equal population means woud always be accepted for
each pair, we could conclude that all the population means are the same, i.e we would
conclude that the factor has no effect. Theoretically speaking, it is possible to proceed this
way, but at the cost of credibility of this procedure. Recall that every statistical test is
accompanied by errors, and if a whole series of tests is realized, the probability of these errors
may cumulate to unbearable levels. This is the reason why ANOVA was developed as a
special procedure to keep the probability at a reasonal level.
We shall discuss the one-way and two-way ANOVA in this chapter. ANOVA stands for
„ANalysis Of VAriance“. After having studied the subject matter on ANOVA, we encourage
the readers to try to resolve theproblems presented in this book on their own, and check their
results against those presented at the end of the chapter.
6.1 ONE-WAY ANOVA
There are many situations when k independent data samples are available, and the
samples do not come from the same population. The sample sizes are 1 2, ,..., kn n n , k being
equal or greater than 2. The sample mean ix can be calculated for the i-th sample, as well as
the sample variance 2
is . In practice, these samples usually originate by classifying the
population into k classes by a factor X, and then drawing randomly data of size in from each
of these k classes. The variable X is called factor and its levels or categories are given
beforehand, so the factor is called controllable. The categories of X are denoted 1 2, ,..., kx x x .
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 87 Let
the factor X is observed at k levels (categories) and may potentialy influence a statistical
quantitative variable Y. Values of Y randomly obtained for the i-th category ix of X are
denoted 1 2, ,..., ii i iny y y . It is convenient to organize the entry data into table 31.
Table 31: ANOVA table
Factor level Data sample for the factor level Sample size Sample
mean
Sample
variance
1
1111211 nj y,...,y,...,y,y 1n 1y 2
1s
2
2222221 nj y,...,y,...,y,y 2n
2y 2
2s
M M M M M
i
iinijii y,...,y,...,y,y 21 in iy 2
is
M M M M M
k
kknkjkk y,...,y,...,y,y 21 kn ky 2
ks
Total N y 2
s
The main principle of the analysis of variance is to decompose the total variability of the
observed variable. The total variability, measured by the sum of squared deviations of the
individual values of the variable from their average, is divided by the decomposition into a
part that reflects a variability within the samples and a part which reflects a variability
between the samples.
The total variability is usually measured by sample variance:
( )
2
2
1
ij
i j
y y
s
N
−
=
−
∑∑
.
In analysis of variance, we are interested only in the numerator of the variance, where
1 1
1 ink
ij
i j
y y
N = =
= ∑∑ .
We shall denote the total sum of squares, which represents the total variability, as yS :
6-1
( )
2
1 1
ink
y ij
i j
S y y
= =
= −∑∑ .
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 88 The
symbol v,yS will be used for the within-group variability, which is also called residual
variability:
6-2
( )
2
,
1 1
ink
y v ij i
i j
S y y
= =
= −∑∑ .
The among-group variability, denoted myS , , is defined as:
6-3
( )
2
,
1
k
y m i i
i
S y y n
=
= − ⋅∑ .
Expressions 6-1 to 6-3 use y , the sample average of all the values of y, as well as the
subgroup averages iy (see table 31).
Using algebraic operations, the following fundamental formula for the one-way analysis of
variance can be derived:
6-4 , ,y y m y vS S S= + .
Anglo-Saxon scholarly literature and software may denote the just-described variabilities with
other symbols, as well. For instance,
Sy = SD (D for Difference),
Sy,m = ST (T for Treatment),
Sy,v = SR (R for Residual).
We shall use our symbols in the rest of this chapter.
6.1.1 ANOVA HYPOTHESES
Analysis of variance is a statistical test. Therefore, we work with a pair of hypotheses:
a null hypothesis and an alternative hypothesis. Before specifying the test, we emphasize that
ANOVA has its conditions under which it was derived. The method assumes that each of the
k random samples comes from a normal distribution, and the distributions have the same
variance. Also, the samples were drawn independently of each other. The prerequisite of
normality can be tested in more than one way, using the chi-square test, Anderson-Darling’s
test, Kolmogorov-Smirnov’s test, Shapiro-Wilk’s test, etc. Regarding the condition of
constant variance, we described earlier the F-test which verifies the hypothesis of equality of
two variances. In analysis of variance, more than two samples are usually worked with, and
for this case, an extension of the F-test exists in the form of Bartlett’s test.
Let us return to ANOVA. Assuming the factor X is observed at k levels, the following relation
is considered to hold true.
6-5 i iµ µ α= + , i = 1, 2,..., k.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 89 Here,
iµ is the population average of the variable Y, corresponding to the i-th level of factor
X, µ is a constant and iα is called effect. It is this effect that is supposed to express the
potential difference among the population means of Y, the differences in means being caused
by different levels of factor X. Now, we may ask the question whether all k samples came
from the same populations. In other words, whether the populations the samples came from
have the same means. Whether the effects iα are all equal to zero would be yet another
equivalent question. This question forms the null hypothesis of ANOVA:
6-6 H0: 1 2 ... kµ µ µ= = = .
or
6-7 H0: 1 2 ... 0kα α α= = = = .
The alternative hypothesis is the negation of 6-6 or 6-7. In the first case, this means that the
alternative hypothesis takes the form: H1 – there exist indices i and j, such that i jµ µ≠ .
The test criterion of ANOVA is
6-8 ,
,
/ ( 1)
/ ( )
y m
y v
S k
T
S N k
−
=
−
,
which follows a Fisher’s distribution with k-1 and N-k degrees of freedom. The critical value
of the test 1, ( )k N kF α− − for a nivel of test alpha is tabulated, or it can be obtained with the Excel
function FINV(α , k-1, N-k).
To sum up, testing the null hypothesis of ANOVA is characterized by the following steps:
Step 1. Select a nivel of test α. Alpha is usually 0,1, 0,05, 0,01, or 10%, 5%, 1%, respectively.
Step 2. Calculate the test criterion T according to 6-8, where formulas 6-2 and 6-3 are used to
get the within-group variability and among-group variability, respectively. Formulas 6-9 and
6-10 may also be used. They are more convenient if the variabilities are to be calculated on a
calculator.
6-9 2
2
1 1 1 1
1i in nk k
y ij ij
i j i j
S y y
N= = = =
 
= −  
 
∑∑ ∑∑ ,
6-10 2
2
,
1 1 1
1 ink k
y m i i ij
i i j
S n y y
N= = =
 
= −  
 
∑ ∑∑ ,
6-11 , ,y v y y mS S S= − .
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 90 Step
3. Compare T from step 2 with the critical value 11 , ( )k N kF α− − . If F < 11 , ( )k N kF α− − , the
null hypothesis H0 is accepted and the factor X can be pronounced not influential in relation to
the variable Y. If 11 , ( )k N kF F α− −≥ , the null hypothesis H0 is rejected, meaning the factor X has
a statistically significant influence on the variable Y.
If the test confirms that the factor X affects Y, we may ask which population means are
different. It can be the case that only population means are different, while all the other
population means are the same. There are methods that try to answer this question, one of
them being devised by Scheffé and one by Tukey.
PROBLEM 1
The following table contains data obtained through several independent random samplings.
The observed factor is the number of octanes used to describe the quality of car fuel (90, 91,
95, 98 octanes are usually available). Thus, the factor is monitored at four possible levels. For
each of the levels, five car drivers using the fuel of the corresponding quality were randomly
selected (see table 32). In this case, all samples have the same size, which is not required for
one-way ANOVA. We want to know whether the quality of the fuel affects fuel consumption
(car mileage). To answer the question, we shall employ ANOVA.
Table 32: Car mileage for different types of fuel
Factor levels 90 91 95 98
8,1 7,7 7,6 7,5
8 7,8 7,6 7,8
Samples 7,9 7,9 7,5 7,6
7,8 7,6 7,6 7,5
8,2 7,8 7,6 7,5
Source: author’s
The nivel of test is set at 5%. Regarding the among-group variability, we must calculate the
column averages (or group averages, more generally speaking). These are 8, 7,76, 7,58 and
7,58 for the first, second, third and fourth column of the table, respectively. The total average
is 7,73. Using 6-3 with 5in = for every i, we have , 0,594.y mS = The within-group variability
is 2 2 2 2 2
, (8,1 8) (8 8) ... (8,2 8) (7,7 7,76) ... (7,5 7,58) 0,228.y vS = − + − + + − + − + + − =
We have N = 20 values altogether and the number of factor levels is k = 4. Therefore,
,
,
/ ( 1) 0,594 / 3
13,895.
/ ( ) 0,228 /16
y m
y v
S k
T
S N k
−
= = =
−
The critical value K = FINV(0,05,3,16) = 3,2389. Since the test criterion is greater than K, we
reject the hypothesis that fuel quality has no effect on car mileage. In other words, it seems
the factor does have an influence on car mileage.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 91 Excel:
The same procedure can be performed in Excel with its Data Analysis module. The
module offers one-way analysis of variance (see figure 9).
Figure 9: The dialogue window of the Data Analysis module
In the subsequent dialogue window (figure 10), it is necessary to insert as the Input Range a
reference to the area of the Excel spreadsheet that contains the data samples to be worked
with in ANOVA:
Figure 10: Insertion of information for ANOVA
The analyst must also confirm in the dialogue window whether each sample represents a
column in the spreadsheet table, or a row (see figure 10). In our example, we work with
columns. Once the nivel of test alpha is confirmed at 5%, or changed, and the location of the
ANOVA output is selected, Excel returns a result of the form shown in table 33.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 92 Table
33: ANOVA results presented by Excel
ANOVA
Source of variability SS Difference MS F P-value F krit
Among-groups 0,594 3 0,198 13,89474 0,0001015 3,238872
Within-groups 0,228 16 0,01425
Total 0,822 19
In this table, „F“ represents the test criterion, and „F krit“ stands for the critical value. As we
can see, our calculations were correct.
6.1.2 A MEASURE OF DEPENDENCE
Variability of iy ’s around y is caused by a dependence of Y on X. We described such
variability with the among-group sum of squares ,y mS . The within-group variability ,y vS is,
on the other hand, induced by factors other than X. Higher ,y mS implies a stronger dependence
of Y on X. Based on 6-4, this dependence can be measured, using the determination ratio,
denoted P2
:
6-12 ,2 y m
y
S
P
S
= .
The square root of P2
is called the correlation ratio.
P2
can take on any value from interval [0,1].The stronger the dependence of Y on X, the closer
the characteristic is to one, and the closer the among-group sum of squares is to the total sum
of squares (total variability). On the contrary, the within-group variability approaches zero in
this situation. The closer the determination ratio is to zero, the smaller the part of the total
variability which is accounted for by the among-group variability. In this case, the
dependence of Y on X is weak.
SUMMARY
If we are interested in whether there is no statistically significant difference between
two population means, we can verify such hypothesis or surmise with the two-sample t-test.
Analysis of variance (ANOVA) enables us to verify the hypothesis that there is no difference
between two or more population means. The procedure makes it also possible to test if
different levels of one factor or more factors have any effect on another quantitative variable.
Analysis of variance is based on the idea that total variability of a variable can be broken
down to sub-variabilities each of which reflects its own source of variation. One type of
variability is called residual variability, and is generated by sources that are not of interest to
us, and are hard or impossible to identify. Depending on how many sources of the total
variability we work with, we talk about one-way ANOVA, two-way ANOVA, three-way
ANOVA, etc. This chapter was devoted to the first type: one-way ANOVA.
In one-way ANOVA, the total variability/sum of squares is divided into two parts: one part
represents the influence of the only factor considered, while the other part is represented by
the residual variability/sum of squares. We assume that the only factor X is observed at
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 93 k
possible levels, and we formulate the null hypothesis that all samples, each of which is
obtained randomly for the given level of X, came from the same population. To verify the null
hypothesis, we use statistic 6-8, which follows a Fisher’s distribution if the null hypothesis is
true. The appropriate critical value is found for a given nivel of test alpha. In the end, the null
hypothesis is either accepted, meaning that X has no effect, or we reject the null hypothesis, in
which case the factor does exert an influence.
Apart from the testing itself, which provides a yes/no answer to the existence of an effect of
the factor, we can also measure the amount of the effect. This is done by evaluating the
determination ratio, the values of which belong to closed interval [0, 1]. The stronger the
influence, the higher the value of the ratio. Its square root is called the correlation ratio.
CONTROL TEST 6
6.1 One-way ANOVA serves for (check the correct answer(s)):
a. calculating frequency distribution of individual variables
b. testing presence of effect of a factor on a quantitative variable
c. finding probability distribution
d. testing mutual correlation of statistical variables
6.2 In ANOVA, we
a. test the null hypothesis that population means are the same,
b. test the null hypothesis that two variables are mutually dependent,
c. test the null hypothesis that value of a variable is different from a preedefined value
d. test the null hypothesis that two statistical variables are mutually independent.
6.3 ANOVA uses the critical value of:
a. a student’s distribution,
b. a Pearson’s distribution,
c. a Fisher’s distribution,
d. a normal distribution.
6.4 Determine whether the following statements are true (write T) or false (write F):
a. The F-test of equal variances should be used before ANOVA.
b. The determination ratio takes on values from interval [0,1].
c. The smaller the among-group variability, the stronger the dependence between X
and Y.
d. The ANOVA test criterion must fall to a certain set for the ANOVA null hypothesis
to be accepted. This set is a union of two intervals.
e. The variance of ANOVA sample/group averages reflects the within-group
variability.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 94 -
6.5 Complete the statement:
a. If the ANOVA test criterion F falls to critical region, the variable Y can be
considered to be __________ on/of X for a given nivel of test.
b. ANOVA, where the single factor is observed at l different levels, and the total
number of all observations of Y equals m, works with a Fisher’s distrbution with
__________and __________degrees of freedom.
c. The ANOVA test criterion F is always __________ (positive/negative).
d. One-way ANOVA tests__________ Y on/of a factor X.
6.6 Complete the statement:
a. The square root of the determination ratio is called __________ __________.
b. If the ANOVA test criterion F falls to __________ __________ , the null
hypothesis is rejected.
c. To find the critical region in ANOVA, we must know ___________________
and __________.
SOLUTIONS
6.1 b.
6.2 a.
6.3 c.
6.4 T, T, F, F, F
6.5 a. dependent, b. l-1 and m-l, c. positive d. independence
6.6 a. correlation ratio, b. critical region, c. degrees of freedom of a Fisher’s distribution
and the nivel of test.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 95 -
7 TWO-WAY ANOVA AND LATIN SQUARES
We got acquainted with one-way ANOVA in the previous chapter. Now we shall work
with more factors. We are in a situation where influence of two or three factors on a
quantitative variable is examined, the factors being again qualitative or quantitative.
Therefore, we shall work with two-way and three-way ANOVA. Two-way and three-way
ANOVA have their own experimental plans. More on experimental plans will be presented in
later chapters. These plans can be designed in such a way that not much data is necessary for
ANOVA to yield credible results. The plans play an important role in statistics since the more
factors appear in the analysis, the more data is needed, and the increase in the amount of data
is then fairly steep.
If influence of two factors on a quantitative variable Y is observed, we talk about two-way
ANOVA. As in the previous chapter, random sampling can be performed for different
combinations of the two factors considered, and the data provides us later with the possibility
of examining a potential influence of each of the two factors individually. An interaction of
the two factors can also be regarded as another factor. We will skip interactions in our
ANOVA presentation. Analogous statements are true for the three-way ANOVA case in
which three major factors appear, and if it is required, two-factor interactions and a threefactor
interaction may be analysed, as well.
As was already said, a lot of data is required when more factors are included in ANOVA.
Therefore, it is often the case that only one observation is made for each combination of the
factors. We then talk about ANOVA with a single observation in each subgroup. The case of
the same number of observations in each subgroup was exploited in the previous chapter.
However, while one-way ANOVA can do without this experimental plan, the case of two-way
and three-way ANOVA is more complicated, and it is strongly recommended that the
requirement of the same number of observations be complied with whenever possible in
practice since in the opposite case, ANOVA may be carried out in more than one way, and
each of the ways give generally different results. This is not the case when the number of
observations in each subgroup is the same.
7.1 TWO-WAY ANOVA
In two-way ANOVA, two factors are considered. In this case, the total variability,
inroduced in chapter six, is decomposed again, but this time into more terms each of which
reflects an effect of the corresponding factor. As opposed to the case of one-way ANOVA,
where two terms appeared in the decomposition, two-way ANOVA leads to three such terms.
The additional term represents an effect of the second factor.
The decomposition of the total variability takes the following form:
7-1 A B RS S S S= + + ,
where
7-2 2
1 1
( )
k n
ij
j i
S y y
= =
= −∑∑ ,
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 96 -
7-3
( )
2
1
n
A i
i
S k y y
=
= −∑ ,
7-4
( )
2
1
k
B j
j
S n y y
=
= −∑
7-5 .R A BS S S S= − −
Here, n is the number of levels/categories of factor A, k is the number of levels/categories of
factor B. There are nk observations altogether with a single observation in each subgroup. The
symbol iy is used to denote the average observation when A is at its i-th level, while jy is the
average observation when B is at its j-th level. The symbol y denotes the average of all
observations, as usual. The scheme of the experiment is given in table 34.
Table 34: Data scheme for two-way ANOVA
Factor B: levels
B1 B2 … Bk
A1
A2
Factor A: levels .
.
An
jy ijy in the i-th row, j-th column iy
The symbol iy can be regarded as the row average in our table representation, and the symbol
jy as the column average. The sum AS reflects the effect of A, the sum BS represents the
effect of B, the sum RS reflects the effect of all the other factors. The total variability of all
observations is described by the term S . Since two factors are involved in this analysis, twoway
ANOVA consists of two statistical tests. Each of the tests examines the statistical
significance of one of the factors.
7.1.1 EFFECT OF FACTOR A
We test the null hypothesis H0: Factor A has no effect on the variable Y. The alternative
hypothesis says the opposite: H1: Factor A has an effect on the variable Y.
The test criterion T is of the form
7-6 / ( 1)
/ ( 1)
A
R
S n
T
S nk n k
−
=
− − +
.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 97 The
critical value K = 1, 1( )n nk n kF α− − − + for a nivel of test alpha, i.e. it concerns a Fisher’s
distribution with n-1 and nk-n-k+1 degrees of freedom.
If T K≥ , we reject the null hypothesis. In this case, we may conclude that the factor A does
have an effect on Y. If T K< , we accept the null hypothesis, and the factor A does not have
an effect on Y.
7.1.2 EFFECT OF FACTOR B
We test the null hypothesis H0: Factor B has no effect on the variable Y. The alternative
hypothesis says the opposite: H1: Factor B has an effect on the variable Y.
The test criterion T is of the form
7-7 / ( 1)
/ ( 1)
B
R
S k
T
S nk n k
−
=
− − +
.
The critical value K = 1, 1( )k nk n kF α− − − + for a nivel of test alpha.
IF T K≥ , we reject the null hypothesis, meaning that the factor B affects Y. On the other
hand, if T K< , the null hypothesis is accepted, and the factor does not affect Y.
PROBLEM 1
Two factors A, B are given. The factor A is observed at three levels, the factor B is considered
at four levels. A single observation is available for each combination of the two factors. We
assume the observations originated independently of each other, and they follow a normal
distribution with equal variances. The nivel of test being five per cent, we want to test the
potential influence of the two factors. The observations are in table 35.
Table 35: Entry data for problem 1
B
B1 B2 B3 B4
A1 24 25 25 23
A A2 22 21 22 25
A3 21 22 21 21
Source: author’s
Table 36 is an extension of table 35, containing the row and column averages as well as the
total average. Also, n = 3 and k = 4. Using formulas 7-2 through 7-5, we get
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 98 Table
36: ANOVA and various averages
B
B1 B2 B3 B4 averages
A1 24 25 25 23 24,25
A A2 22 21 22 25 22,5
A3 21 22 21 21 21,25
averages 22,33333 22,66667 22,66667 23 22,66667
Total
average
2
1 1
( ) 30,66,
k n
ij
j i
S y y
= =
= − =∑∑
( )
2
1
18,166,
n
A i
i
S k y y
=
= − =∑
( )
2
1
0,66,
k
B j
j
S n y y
=
= − =∑
11,833.R A BS S S S= − − =
Therefore:
To test the effect of A:
/ ( 1) 18,166 / 2
4,6.
/ ( 1) 11,833/ 6
A
R
S n
T
S nk n k
−
= = =
− − +
To test the effect of B:
/ ( 1) 0,66 / 3
0,1126.
/ ( 1) 11,833/ 6
B
R
S k
T
S nk n k
−
= = =
− − +
The critical values are:
for the test of A:
K = FINV(0,05,3-1,12-3-4+1) = 5,143.
for the test of B:
K = FINV(0,05,4-1,12-3-4+1) = 4,757.
As we can see, none of the factors seems influential.
Excel: The same problem can be solved in Excel, using the Data Analysis module. In this
module, Anova: two factors without repetition is selected (figure 11).
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 99 Figure
11: The dialogue window of the Data Analysis module in Excel
After selecting ANOVA in the dialogue window, a reference is made to the area of the
spreadsheet containing the entry data, and the nivel of test is confirmed at five per cent or
changed. Excel then returns the following result (table 37):
Table 37: Two-way ANOVA results provided by Excel
ANOVA
Source of
variability SS Difference MS F P-vaue F krit
Rows 18,16666667 2 9,083333 4,605634 0,06137 5,143253
Columns 0,666666667 3 0,222222 0,112676 0,949497 4,757063
Error 11,83333333 6 1,972222
Total 30,66666667 11
The interpretation of the table is the same as in the case of one-way ANOVA. The second
column contains the terms making up the total variability in 7-1. The F symbol is the test
criterion for the two tests of ANOVA and F krit stands for the corresponding critical values.
7.2 THREE-WAY ANOVA (LATIN SQUARES)
A special case of a three-factor procedure, called Latin squares, belongs to analysis of
variance, as well. We shall describe the procedure at the end of this chapter. Latin squares
rank among classical methods of experimental design. The name of the procedure dates back
to the eighteenth century when L. Euler (1707 – 1783) presented to the Petrohrad-based
academy a problem on 36 commissioned officers: the task was to position officers of 6
different ranks from 6 different regiments on a square in such a way that each row and column
of the square contained officers of all ranks from all regiments. More generally speaking:
Let us have objects with two properties of interest: A and B (example: A = rank, B =
regiment). Each property may take on n different forms (example: n = 6, 6 different military
ranks, such as private, corporal, sergeant, captain, major and colonel; 6 different regiments).
The task is to set up the n2
objects, each having a different A and B properties, so that each
row and column was occupied by objects that do not have the same A and B properties
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 100 (example:
the first row is occupied by the private from the first regiment, the corporal from
the second regiment, etc.). Such a scheme is called the Latin square of order n. The wellknown
mathematical result discovered by Euler himself says that at least one Latin square of
order n exists for any integer n. We will use Latin squares for three-way analysis of variance.
Let us have three factors an effect of which on another variable Y is conceivable at the
moment. Since three factors are involved in the analysis, it is hard to represent the whole
experiment with a two-dimensional table. Another problem is that the number of data
increases significantly as more factors are included in the analysis. However, it is possible to
consider each combination of the factor levels, and utilize a single observation for each of
these combinations. Such a procedure allows us to represent the experiment with a table. The
heading of the table will describe different levels of two factors, while the interior of the table
will contain a level of the third factor together with the single observation realized by an
experiment. The levels of the third factor will be selected in such a way that the whole scheme
results in a Latin square.
Let us denote the three factors A, B and C. When talking about a Latin square of order n = 3,
we may describe our experiment in the form of table 38
Table 38: Three-way ANOVA scheme in the form of a Latin square
a b c
b c a
c a b
One side of the square represents the three levels of factor A. The adjoining side of the square
represents the three levels of factor B. The interior of the table contains the levels of the third
factor C. To read the square correctly, we say, for instance, that when both the factor A and B
are at their first level, the factor C is as well at its first level (this corresponds to the element
[1,1] of the table). For this combination of the levels, a single observation of the variable Y is
realized and inscribed into the table. The analogous procedure holds true for the other
elements of the table. One of the merits of this experiment is that we need only 9 observations
instead of 27 we would have needed if we had wanted to use all the possible combinations of
the factors. At the same time, the design of the experiment is such that the analysis of variance
will give credible results.
The total variability S in three-way ANOVA has the form
7-8 A B C RS S S S S= + + + ,
where
7-9 2
1
( )
n
A i
i
S n y y••
=
= −∑
reflects the effect of factor A, and iy •• is the average of Y when A is at its i-th level,
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 101 -
7-10 2
1
( )
n
B j
j
S n y y• •
=
= −∑
reflects the effect of factor B, and jy• • is the average of Y when B is at its j-th level,
and
7-11 2
1
( )
n
C k
k
S n y y••
=
= −∑
reflects the effect of factor C, where ky •• is the average of Y when C is at its k-th level.
Finally,
7-12 2
( )ijk
i j k
S y y= −∑∑∑
is the total sum of squares. The expression contains individual observations of Y, i.e. the terms
ijky , and also the average of all observations y .
Also,
7-13 R A B CS S S S S= − − − .
is the residual sum of squares.
Three-way ANOVA comprises three statistical tests. Each of these tests relates to one of the
factors. Also, in each of the tests, the null hypothesis has the form: H0 – the tested factor is
insignificant. The alternative hypothesis is: H1 the tested factor has an effect. The tests
criteria for each test are summarized in table 39:
Table 39: Three-way ANOVA
Source of
variability
Sum of squares Degrees of
freedom
Estimate of
variance
Test criterion
Factor A SA dfA=n-1 MSA=SA / dfA FA=MSA / MSR
Factor B SB dfB=n-1 MSB=SB / dfB FB=MSB / MSR
Factor C SC dfC=n-1 MSC=SC / dfC FC=MSC / MSR
Residuals SR dfR=(n-1)(n-2) MSR=SR / dfR
Total S dfT=n2
-1
If FA ≥ 1,( 1)( 2) ( )n n nF α− − − , the null hypothesis is rejected, and factor A is considered significant.
If the opposite inequality holds, the null hypothesis is accepted.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 102 If
FB ≥ 1,( 1)( 2) ( )n n nF α− − − , the null hypothesis is rejected, and factor B is considered significant.
If the opposite inequality holds, the null hypothesis is accepted.
If FC ≥ 1,( 1)( 2) ( )n n nF α− − − , the null hypothesis is rejected, and factor C is considered significant.
If the opposite inequality holds, the null hypothesis is accepted.
PROBLEM 2
Fuel emission Y is studied, and its potential dependence on the following three factors:
Factor 1 = petrol ingredient (A, B, C, D),
Factor 2 = car driver (I, II, III, IV),
Factor 3 = vehicle used (1, 2, 3, 4).
The result of the corresponding experiment is in table 40
Table 40: Entry data for the three-way ANOVA in problem 2
driver\vehicle 1 2 3 4
I A : 21 B : 26 D : 20 C : 25
II D : 23 C : 26 A : 20 B : 27
III B : 15 D : 13 C : 16 A : 16
IV C : 17 A : 15 B : 20 D : 20
We are testing the potential effect of the individual factors on Y, the nivel of test being five
per cent.
According to formulas 7-9 through 7-13, we have
2
1
1
( ) 40.
n
i
i
S n y y••
=
= − =∑
2
2
1
( ) 216.
n
j
j
S n y y• •
=
= − =∑
2
3
1
( ) 24.
n
k
k
S n y y••
=
= − =∑
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 103 -
2
( ) 296.ijk
i j k
S y y= − =∑∑∑
16.R A B CS S S S S= − − − =
The test criteria are:
For factor 1:
(40 / 3)
5.
(16 / 6)
T = =
For factor 2:
(216 / 3)
27.
(16 / 6)
T = =
For factor 3:
(24 / 3)
3.
(16 / 6)
T = =
The critical value is the same in all three tests: K = FINV(0,05,3,6) = 4,757. This means the
factors 1 and 2 are statistically significant, whereas type of vehicle used is not.
SUMMARY
This chapter described three-way ANOVA, in particular its special form called Latin
squares, and also two-way ANOVA with a single observation in each subgroup. We explained
the purpose of these methods and the mathematical technique behind it. The reader became
familiar with the terms two-factor and three-factor analysis of variance, Latin squares,
decomposition of total variability and ANOVA table. The following problems allow the
reader to practise the methods, including one-way ANOVA from the previous chapter.
CONTROL TEST 7
1) The nivel of test being 5 per cent, test if parsley yields depend on type of fertilizer used.
All the necessary observations are in table 41.
Table 41: Entry data for ANOVA
Fertilizer
Yields (1kg/10m2
)
A 40 42 45 40 44 47
B 76 75 82 68
C 60 58 62 64 70
Source: author’s
2) Evaluate intensity of dependence between the parsley yields and the type of fertilizer,
using the determination ratio characteristic.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 104 -
3) Six drivers were randomly selected, each of them experiencing a ride with different
gasoline. Test whether gasoline consumption depends on type of gasoline used and/or
driver. The nivel of test is 5%. The data are in table 42.
Table 42: Entry data for ANOVA
Driver
Gasoline A B C D E F Averages
Aral 7,5 6,9 7,9 7,3 6,9 7,8 7,38
Shell 7,6 7,2 7,5 8 7,3 8,2 7,63
Benzina 7,2 8,1 7,8 7,6 7,8 6,9 7,57
Slovnaft 7 7,3 7,2 7,5 8,2 7,7 7,48
Averages 7,33 7,38 7,6 7,6 7,55 7,65 7,5
Source: author’s
SOLUTIONS
1) There are three types of fertilizers, i.e. k = 3, and the corresponding samples of
observations are of the size 1 2 36, 4 5n n n= = = , respectively. The total number of
observations is N = 15.
Tested is the hypothesis 0 1 2 3H : µ µ µ= = , i.e. parsley yields do not depend on type of
fertilizer used. To perform the test, we make the following preliminary calculations:
• Conditional averages 1
in
ij
j
i
i
y
y
n
=
=
∑
, for 1,2...,i k= , where ijy are observations.
• Total average
1 1 1
1 1ink k
ij i
i j i
y y y
N k= = =
= =∑∑ ∑ ,
• Among-group variability ( )
2
,
1
k
y m i i
i
S n y y
=
= −∑ , where: in is the number of observations in
the i-th group, iy is the sample average in the i-th data group.
• Within-group variability ( )
2
,
1 1
ink
y v ij i
i j
S y y
= =
= −∑∑ .
• Total variability , ,y y m y vS S S= + .
We get:
1
2
3
40 42 ... 47
43,
6
76 75 ... 68
75,25,
4
62,8,
y
y
y
+ + +
= =
+ + +
= =
=
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 105 -
43 75,25 62,8
60,35.
3
y
+ +
= =
( ) ( ) ( ) ( )
2 2 2 2
,
1
6 43 60,35 4 75,25 60,35 5 62,8 60,35 2724,188.
k
y m i i
i
S n y y
=
= − = − + − + − =∑
( ) ( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
2 2 2 2
,
1 1
2 2 2
2 2 2
40 43 42 43 ... 47 43
76 75,25 75 75,25 ... 68 75,25
60 62,8 58 62,8 ... 70 62,8 223,55.
ink
y v ij i
i j
S y y
= =
= − = − + − + + − +
+ − + − + + − +
+ − + − + + − =
∑∑
The results are summarized in table 43:
Table 43: ANOVA output
Source of
variability
Sums of
squares
Degrees of
freedom
Averaged
sums of
squares
Test
criterion F
Factor x
(among-group
variability)
2724,188 k – 1 = 2 1362,1 73,12
Residual
(within-group)
variability
223,55 N – k = 12 18,63
Total variability 2947,74 N – 1 = 14
The test criterion 73,12T = , the critical value 2,12 (0,05) 3,89F = , the critical region is
[3,89; )C = + ∞ . Since T belongs to the critical region, we reject the null hypothesis. Parsley
yields depend on type of fertilizer used.
2) To answer the question: „How strong is the relationship between type of fertilizer used
and parsley yields?“, we calculate the correlation ratio
,y m
y
S
P
S
= ,
where
myS , is the among-group variability,
yS is the total variability.
We get
2724,188
0,92 0,96
2947,74
P = = = .
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 106 Raising
the result to the second power, we obtain the determination ratio 2
0,922P = . A value
of the determination ratio which is close to one signals a strong dependence of parsley yields
on type of fertilizer used.
3) The table with entry data already contains the subgroup averages needed for testing
dependence of fuel consumption Y on type of gasoline X1 and driver X2. There are four
levels of X1, 4=k , and six levels of X2, 6=r . In the case of X1, we test the null
hypothesis: factor X1 is not influential, versus the alternative hypthesis: X1 is influential.
The hypotheses for the second factor X2 are analogous.
The individual sums of squares are:
( ) ( ) ( )1
4
2 2 2
.
1
6 7,38 7,5 ... 7,48 7,5 0,21.X i
i
S r y y
=
 = − = − + + − =
 ∑
( ) ( ) ( )2
6
2 2 2
.
1
4 7,33 7,5 ... 7,65 7,5 0,358.X j
j
S k y y
=
 = − = − + + − =
 ∑
The most straightforward way to calculate the residual sum of squares RS is to calculate first
the total variability S, and then evaluate the residual sum of squares from equation
1 2R X XS S S S= + + .
( ) ( ) ( ) ( ) ( )
24 6
2 2 2 2
1 1
7,5 7,5 6,9 7,5 ... 8,2 7,5 7,7 7,5 3,79.ij
i j
S y y
= =
= − = − + − + + − + − =∑∑
Thus, 3,22.RS =
The test criterion for the first factor is
( )( )
1
0,21
1 3 0,33.
3,22
151 1
X
R
S
kT
S
k r
−= = =
− −
The critical value is 3,15(0,05) 3,29F = . Since 0,33 < 3,29, we cannot reject the null
hypothesis. Therefore, type of gasoline used seems to have no effect on fuel consumption.
The test criterion for the second factor is
( )( )
2
0,36
51 0,33.
3,22
151 1
X
R
S
rT
S
k r
−= = =
− −
The critical value is 5,15(0,05) 2,9F = . Since 0,34 < 2,9, we accept the null hypothesis again,
therefore type of person driving the car does not affect fuel consumption either.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 107 -
8 FULL FACTORIAL EXPERIMENTAL PLANS
This chapter covers the foundations of design of experiments, a branch of statistics
dominated in industrial applications. To experiment means to change working conditions so
that the best possible working procedures are found, and more knowledge is acquired about
the product and related working process. The best possible working procedures can be viewed
in the following sense: denoting Y the observed quality characteristic of a product (or we can
work with more such characteristics 1 2
, ,..., k
Y Y Y ) and denoting A, B, C,… the factors which
potentially affect the product, their levels being 1 2 3
, , ,...A A A for factor A, 1 2 3
, , ,...B B B for factor
B and so on, design of experiments aims to determine which of the factors are influential and
what their optimal levels are with respect to optimal levels of the quality characteristics.
8.1 FOUNDATIONS OF EXPERIMENTING AND ITS APPLICATIONS
In our context, experimenting means analysing various combinations of the levels of
those factors that are thought to affect the observed quality characteristic of a product. The
characteristic is a response, a result of an experiment. The response is related to a certain
combination of the factor levels. Analysing the relations is how one can attain the objectives
outlined in the previous paragraph.
The outlined objectives may be achieved in more than one way. Some of the appropriate
procedures were already described in the previous chapters. For example, to determine which
factor is influential, analysis of variance can be exploited for this purpose. Regression analysis
could also be used to define suitable levels of the factors with respect to the process output,
represented in the regression equation by the dependent variable. The problem with these
methods is that they often require a lot of data. Also, the methods are based on a set of
conditions that must be met for the particular method to work correctly. Some of the
conditions are even impossible to verify with a prescribed high probability, such as the form
of the regression model governing the relation between several variables. For these and other
reasons, design of experiments as a separate branch originated decades ago, to help solve
reasonably the drawbacks just explained. We shall start the description of the discipline by
explaining the meaning of several elementary terms the discipline works with.
Factor: a parameter or independent variable affecting the observed quality characteristic of a
product. We denote a factor with a capital letter, such as A, and its levels with the same letter
with a lower index, such as 1
A (the first level of factor A).
Two elementary types of factors exist:
a) regulated factor is a variable thought to affect the observed quality characteristic. The
levels of the variable can be set up and maintained, and it is desirable to do so.
b) noise factor is a factor that has an adverse effect on the quality characteristic. The levels
of the factor cannot be set up and maintained during the experiment, or it is not desirable
to do so.
Interaction of factors: a combined effect of two or more factors. In this case, the effect of
one factor of the interaction generally depends on the effect of another factor of the
interaction. The interaction of factors A and B is denoted as AB.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 108 Design
of experiments is applied in areas, such as: simulation, product design and
development, process design and development, testing and validation, solving production
quality problems, measurement system analysis and improvement.
8.2 EXPERIMENTAL PROCEDURE
The steps that must be taken to detect influential factors and their optimal levels form an
experimental procedure. These steps are: planning the experiment (using brainstorming, for
instance), designing the experiment, realizing the experiment, analysing the the results of the
experiment.
Planning the experiment means to set up an experimental team in the first place.
Representatives of the departments that design the product and the process leading to the
product should be members of the team. The team should have 2-15 members. The team
members ought to participate in brainstorming sessions which will determine what quality or
output characteristics of the product will be monitored, and which factors should be
considered together with their starting levels.
The planning phase results in defining the objective of the experiment, the objective being
related to the product under scrutiny, and it should also define the characteristics of the
product, based on which it will be possible to make a judgement on whether the objectives
were achieved. Another result of the planning phase is a list of factors that could potentially
play a role in defining the product quality.
These information are used in the second phase – in designing the experiment, which results
in an experimental plan. Such a plan is a table consisting of individual experimental runs.
Each run, which is represented by a row in the table, tells the experimenter the levels the
factors should take on during the experimental run. Later, we shall demonstrate this procedure
in an example.
Realizing the experiment takes place either separately at a laboratory or at the production line.
The latter case is more challenging, of course, since the production capacity on the one hand,
and the requirement to analyse the production through an experiment on the other hand, can
lead to a clash of interests. Therefore, in the latter case, night shifts or weekends usually
provide an opportunity to run the experiment outside laboratory.
Analysing the results of the experiment means to seek a combination of the input factors that
will optimize (at least approximately) the observed characteristic of the product. In the final
step, the optimal factor set-up is verified by subsequent experiments and/or simulations.
PROBLEM 1 (Spring)
This problem demonstrates how to set up a full factorial experimental plan, using coded
variables.
What we are now interested in is how much pressure an industrial spring can withstand, the
spring being compressed by a machine tool until it breaks down. The following factors are
considered in the experiment.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 109 L
= length of the spring,
G = width of the spring,
T = material of the spring.
It is to be determined which factors influence the service life of the spring.
SOLUTION
Let us construct a table of the factor levels considered: we shall use two levels for each factor,
therefore the corresponding experimental plan is called a two-level plan (there are also threelevel
plans).
Table 44: Factors and their levels
Factor Symbol Lower level Upper level
úroveň- +
Spring length
pružiny
L 10 cm 15 cm
Spring width G 5 mm 7 mm
Spring material T A B
Source: author’s
There is more than one way how to build an experimental plan that will prescribe individual
experimental runs. The so-called full factorial plan is one of the most frequently used
schemes:
Table 45: Full factorial plan
Run L G T Y
1 10 5 A
2 15 5 A
3 10 7 A
4 15 7 A
5 10 5 B
6 15 5 B
7 10 7 B
8 15 7 B
We talk about a full factorial plan because such a plan contains all possible combinations of
the factor levels. The symbol Y will be used to denote the result of an experiment, i.e. the
response to a specific combination of the factor levels.
It is more suitable to prescribe an experimental plan, using the following symbols: the lower
and the upper level of a factor is denoted -1 and +1, respectively. Table 45 then takes the form
of table 46
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 110 Table
46: Full factorial experimental plan using coded variables
Run L G T Y
1 -1 -1 -1
2 +1 -1 -1
3 -1 +1 -1
4 +1 +1 -1
5 -1 -1 +1
6 +1 -1 +1
7 -1 +1 +1
8 +1 +1 +1
The conversion of the original variables to the coded variables, but not only their upper and
lower levels as we do here, can be done according to equation
8-1 max min
0
max min
2
2
c
x x
x
x
x x
+
−
=
−
,
where
0
x is the variable in the original physical units,
cx is the coded variable,
maxx is the upper level of x,
minx is the lower level of x.
For instance, the conversion of factor L leads to
15 10
10
2 1
15 10
2
cL
+
−
= = −
−
,
or the upper level of factor G, G = 7, is
7 5
7
2 1
7 5
2
cG
+
−
= = +
−
.
A full factorial experimental plan containing k factors represents 2k
n = experimental runs. For
example, if k = 3 factors are considered in an experiment, there will be 3
2 8n = = experimental
runs altogether. Therefore, there will be eight rows in the table of the experimental plan.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 111 Experimental
plan prescribes how and under which conditions to proceed with the
experiment. After applying the plan, values of the observed variable Y resulting from the
experiment can be recorded. In our example, each experimental run has been carried out
twice, and the results are recorded in table 47.
Table 47: Full factorial plan and measurements
Run Factor Factor Factor Output Output Average
L G T 1Y 2Y Y
1 - - - 77 81 79
2 + - - 98 96 97
3 - + - 76 74 75
4 + + - 90 94 92
5 - - + 63 65 64
6 + - + 82 86 84
7 - + + 72 74 73
8 + + + 92 88 90
Source: author’s
Table 47 ends the preparatory and experimental work. What follows now are calculations
which are to determine the influential factors. These factors may also include interactions LG,
LT, GT and LGT, therefore interactions are usually included in experimental plans, as well.
This can be done after the experiment ran its course. The levels of the interactions are
calculated additionally by multiplying the elements of the table that lie in the same row and
columns of those factors which form the interaction. For instance, to calculate the levels of the
interaction LG, the signs, or ones with the corresponding sign, are taken from a given row and
the columns marked „L“ and „G“, and then they are multiplied. Table 48 represents the result
of this procedure.
Table 48: Full factorial plan and interactions
Run L G T LG LT GT LGT
1 - - - + + + -
2 + - - - - + +
3 - + - - + - +
4 + + - + - - -
5 - - + + - - +
6 + - + - + - -
7 - + + - - + -
8 + + + + + + +
8.3 EFFECT OF A FACTOR AND ITS SIGNIFICANCE
Effect of a factor is the change in quality characteristic Y, induced by changing the level
of that factor from -1 to +1. To calculate the effect, we shall use the sign method which, in the
table of an experiment, multiplies each value from the column of Y by the corresponding
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 112 number
one of the analysed factor (the one from the same row of the table), the multiples are
summed together, and the sum is divided in the end by one half of the number of
experimental runs (number of rows in the table). For instance, to calculate the effect of factor
L in our example, we get
( ) ( )
1
79 97 75 92 64 84 73 90 18
4
efekt L = − + − + − + − + = .
The effect of T is
( ) ( )
1
79 97 75 92 64 84 73 90 8
4
efekt T = − − − − + + + + = − .
Analogous procedure is used for other factors including their interactions:
( ) ( )
1
79 97 75 92 64 84 73 90 1.
4
efekt LG = − − + + − − + = −
The effects of all the factors from our example are recorded in the last row of table 49.
Table 49: Full factorial plan and factor effects
Run L G T LG LT GT LGT Y
1 - - - + + + - 79
2 + - - - - + + 97
3 - + - - + - + 75
4 + + - + - - - 92
5 - - + + - - + 64
6 + - + - + - - 84
7 - + + - - + - 73
8 + + + + + + + 90
Effect 18 1,5 -1 -8 0,5 6 -0,5
Source: author’s
Now, we would like to find out which of the effects is significant. To do that statistically,
variance of the factor effect 2
eσ , the effect being a random variable since we work with a data
sample, must be estimated. The variance is the same for all factor effects under suitable
conditions:
8-2 2
2 4
e
N
σ
σ = ,
where N is the number of experimental runs (including their repetitions if there are any). In
our example, 16=N as the table has eight rows, i.e. there are eight runs, and each run is
implemented twice. If runs are repeated, we can calculate
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 113 -
8-3 2 2
2 1 1
1
...
...
k k
k
s s
s
υ υ
υ υ
+ +
=
+ +
,
where 1i inυ = − , in is the number of repetitions of the i –th experimental run, and 2
is is
sample variance of Y corresponding to the i-th experimental run. Now, the estimate of 8-2 is
8-4 2
2 4
e
s
s
N
= .
8.3.1 STATISTICAL TEST OD FACTOR SIGNIFICANCE
Estimate 8-4 is used to test significance of factor effect:
1. The null hypothesis is Ho: Factor effect is insignificant, the alternative is H1: Factor effect
is significant.
2. The test criterion is of the form
e
efekt
t
s
= .
3. The critical value K = ( )1 2 ... kn n n nt α+ + + − , where 1,..., kn n are the numbers of repetitions of the
first, second,…, k-th experimental run, respectively; in our example, 2=in for each i; n is
the number of experimental runs without the repetitions, i.e. the number of rows in the
experimental plan. As we can see, the critical value concerns a Student’s distribution.
4. If ( )1 2 ... kn n n nt t α+ + + −≥ , the null hypothesis is rejected, and the factor is considered
significant. In the opposite case, the factor is regarded as insignificant.
PROBLEM 2
Let us return to the Spring example. We have:
( ) ( )1 2 ... 16 8 0,05 2,306kn n n nt tα+ + + − −= = .
Using 8-3, we also get:
5
8
828282282
=
+++++++
=s ,
and this result inserted in 8-4 leads to
2
2 4 4 5
1,25
16
e
s
s
N
⋅
= = = , tj. 1,12es = .
This allows us to evaluate the test criteria for all the factors including interactions. The criteria
are in table 50.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
Table 50: Test of factor significance
Run Y1
1 77
2 98
3 76
4 90
5 63
6 82
7 72
8 92
Source: author’s
The critical value K = TINV
criterion for the factors L, T and
factors are not, as far as their effect on the spring service life is concerned.
8.3.2 GRAPHICAL ASSESSMENT OF
If the experimental runs
graphical method for detecting
name, is based on constructing a graph, on the horizonta
designated. The vertical axis of the graph records the
8-5
where i = 1,2, ..., m , m being the number of all the factors of the experiment including
interactions. More precisely, the graph is a set of points
smallest effect among all the calculated effects.
the central line running through the middle section of the graph sugg
If the graph takes the form of an S
some of the points of the graph will turn away from the approximately linear middle section
of the S-curve. Those are the points that s
When using the graphical method, it is convenient to set up a table similar to table 51 which
contains the factor effects sorted in the ascending order.
; STATISTICAL METHODS FOR ECONOMISTS
- 114 Test
of factor significance
Y2 Effect
81 8
96 2 L = 18
74 2 G = 1,5
94 8 LG = -1,0
65 2 T = -8,0
86 8 LT = 0,5
74 2 GT = 6,0
88 8 LGT =-0,5
= TINV(0,05,8) = 2,306, which exceeds in absolute value the test
nd GT. Therefore, the three factors are significant, the remaining
factors are not, as far as their effect on the spring service life is concerned.
PHICAL ASSESSMENT OF FACTOR SIGNIFICANCE
runs are not repeated, the test above cannot be used. In such case
detecting the influential factors exists. The method, as suggested by its
name, is based on constructing a graph, on the horizontal axis of which factor effects are
designated. The vertical axis of the graph records the following values:
( )100 0,5
i
i
P
m
−
= ,
being the number of all the factors of the experiment including
More precisely, the graph is a set of points [efect(i), Pi], where
all the calculated effects. The points of the graph that seem
through the middle section of the graph suggest the significant factors.
If the graph takes the form of an S-curve, which is the case under certain conditions, then
some of the points of the graph will turn away from the approximately linear middle section
curve. Those are the points that signal which factors are influential.
When using the graphical method, it is convenient to set up a table similar to table 51 which
contains the factor effects sorted in the ascending order.
t
16,07
1,34
-0,89
-7,14
0,45
5,36
-0,45
, which exceeds in absolute value the test
herefore, the three factors are significant, the remaining
factors are not, as far as their effect on the spring service life is concerned.
re not repeated, the test above cannot be used. In such cases, a
. The method, as suggested by its
l axis of which factor effects are
being the number of all the factors of the experiment including
where efect(i) is the i-th
The points of the graph that seem to lie ouside
est the significant factors.
curve, which is the case under certain conditions, then
some of the points of the graph will turn away from the approximately linear middle section
ignal which factors are influential.
When using the graphical method, it is convenient to set up a table similar to table 51 which
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 115 Table
51: Auxiliary data for graphical assessment of factor significance
Number
r
1 2 3 4 5 6 7
Effect -8,0 -1,0 -0,5 0,5 1,5 6,0 18
Factor T LG LGT LT G GT L
iP 7,14 21,42 35,71 50 64,29 78,57 92,86
Source: author’s
The second and the fourth row of table 51 are coordinates of the points that form the graph
(figure 12).
Figure 12: Points determining significance of factors
The graph shows that it is the points of the factors the effects of which turn out to be
significant that lie outside the line running through the middle section of the graph. These
points relate to factors L, T and GT.
8.3.3 GRAPH OF INTERACTIONS
Significant interactions are usually accompanied by graphs that allow for a discussion
on the optimal levels of the factors making up the interactions. For instance, the interaction
GT can be accompanied by a graph which will outline the effect of the factor G on Y,
depending on the level of the factor T. To do that, we can scrutinize the full experimental
plan, and select the values of Y from the plan, corresponding to different levels of G and T.
Table 52 contains these values, together with the average value of Y.
Table 52: Responses of Y to various levels of G and T
G T Response 1 Response 2 Average Y
- - 79 97 88
+ - 75 92 83,5
G T Response 1 Response 2 Average Y
- + 64 84 74
+ + 73 90 81,5
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 116 Connecting
the points [ ] [ ]5, 88 , 7, 83,5 , we get the average response of Y to a change in G
provided the factor T is fixed at its lower level. Connecting the points [ ] [ ]5, 74 , 7, 81,5 , we get
the average response of Y to a change in G provided the factor T is fixed at its upper level (see
figure 13).
Figure 13: Interaction GT and its effect on Y
As suggested by the graph, to maximize Y, for instance, it is best to have T at its lower level.
We also see that the interaction has a certain effect on Y because the slope of the line,
describing the effect of G, changes with a change in the level of T.
8.4 REGRESSION MODEL OF THE 23
EXPERIMENT
Once the effects of main factors and their interactions are detected, it is possible to
construct a regression model of the experiment. The model will describe dependence of Y on
the factors. Incomplete quadratic model of the 3
2 experiment, which works with factors A, B,
C, is of the form
8-6 0 1 2 3 12 13 23 123
ˆy b b A b B b C b AB b AC b BC b ABC= + + + + + + + .
It is a model containing all major factors and their interactions as explanatory variables, but
not the quadratic terms A2
, B2
,… of the major factors. Each of the coefficients 1 2 123, ,...,b b b
can be calculated as one half of the effect of the factor the coefficient belongs to in the model.
The absolute term of the model is .0 Yb = These are exactly the values one would get by
applying the least squares method to the matrix of regressors, represented by the full
experimental plan.
In our Spring example, we have
ˆy = 81,75 + 9L – 4T + 3GT.
There are many reasons why such a model is constructed. It is constructed
1. to determine local minima/maxima of the factors involved,
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 117 -
2. to determine the direction of the so-called dynamic experimental planning, which shifts
the experiment to a new subset in the domain of ‫ݕ‬ො, the new subset serving as an area for a
new experiment. The shift is usually carried out, using the gradient of the model found.
3. to make local predictions of the quality characteristic ˆy .
SUMMARY
The reader has been acquainted with foundations of design of experiments in this
chapter. It is possible to create a full experimental plan, or a fractional/partial experimental
plan which will be analysed in the next chapter. Fractional plans are constructed if conditions
of the experiment are such that not all experimental runs can be performed (the whole
experiment would be too costly, for instance). Each factor participating in the experiment may
potentially affect Y, a variable of interest. An effect is present if a change in the level of the
factor leads to a change in Y. The presence of the effect can be verified either by a graphical
method or a statistical test. Such procedures were described in the chapter.
Planned experiments follow an experimental plan which prescribes individual experimental
runs to be carried out and the order of the runs in which they are to be realized. There are two
terms that must be distinguished in connection with experimenting: experimental run, which
leads to measurements of Y, the variable of interest, under a given set of conditions; these
conditions are represented by a specific row in the experimental plan; experiment, which is
the set of all experimental runs.
The aim of experimental planning is to determine which factors have a statistically significant
effect on a quality characteristic Y, and to determine the optimal level of the significant
factors so that the variable Y is optimized and/or stabilized. Stability of Y means that the
variable remains optimal or close to its optimal state under various conditions (environment,
product treatment, etc.). If this is the case, we talk about product robustness.
The following terms were described in the chapter: experimental plan, experiment,
experimental run, factor effect, test of factor significance, model of experiment.
The following examples provide further assistance in the study of this subject matter.
PROBLEM 3
A full experimental plan was constructed for two factors A and B. Each experimental run was
realized twice. The result of the experiment is in table 53.
Table 53: Result of an experiment with two factors
A B 1Y 2Y
- - 5 6
+ - 5 5
- + 7 6
+ + 5 4
Source: author’s
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 118 -
Calculate:
a. Effects of the factors A, B and AB,
b. Write the equation of the incomplete quadratic model for this experiment,
c. Estimate variance of the factor effects,
d. Test significance of A, B and AB (the nivel of test is 5%).
SOLUTION
a. First, we shall attach average responses of Y to table 53. By doing so, we get table 54.
Table 54: Average response to different combinations of factor levels
A B 1Y 2Y Y
- - 5 6 5,5
+ - 5 5 5
- + 7 6 6,5
+ + 5 4 4,5
The effects are:
( )
1
5,5 5 6,5 4,5 1,25
2
Ae = − + − + = − ,
( )
1
5,5 5 6,5 4,5 0,25
2
Be = − − + + = ,
( )
1
5,5 5 6,5 4,5 0,75
2
ABe = − − + = − .
b. The equation of the model is:
1,25 0,25 0,75
ˆ 5,375
2 2 2
y A B AB= − + − .
c. The estimated variance is:
2
2 4
e
s
s
N
= ,
where
2 0,25 0 0,25 0,25
0,1875.
4
s
+ + +
= =
Thus,
2 4 0,1875
0,094.
8
es
⋅
= =
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 119 The
estimated standard deviation is 0,31es = .
d. We test the hypotheses:
:0H Factor (effect) is not statistically significant;
:1H Factor (effect) is statistically significant.
The test criterion is
e
efekt
t
s
= . We get:
4,03At = − , 0,8Bt = , 2,41ABt = − .
The test criteria are compared to the following critical value:
( ) ( )8 40,05 0,05 2,776N nt t− −= = ,
where N is the number of all experimental runs including their repetitions, and n is the number
of experimental runs excluding their repetitions.
Since 4,03 2,776, 0,8 2,776, 2,41 2,776− > < − < , A is significant, the other factors are not.
PROBLEM 4
Using the graphical method, find out which factor is significant. The factor effects and other
data are contained in table 55.
Table 55: Entry data for graphical assessment of factor significance
i 1 2 3 4 5 6 7
Effect -8 -1 -0,5 0,5 1,5 6 18
Factor C AB ABC AC B BC A
iP
Source: author’s
SOLUTION
Expanding table 55 by calculating
( )100 0,5
i
i
P
m
−
= , mi ,...,2,1= , m = the number of all
factors including their interactions, i.e. m = 7 in this case, we have:
Table 56: Pi’s for graphical evaluation of factor effects
i 1 2 3 4 5 6 7
Effect -8 -1 -0,5 0,5 1,5 6 18
Factor C AB ABC AC B BC A
iP 7,14 21,42 35,71 50 64,29 78,57 92,86
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 120 The
resulting graph is:
Figure 14: Graphical evaluation of effects
The central line running through the middle section of the graph does not seem to contain the
„points“: 92,86; 7,14; 78,57. This suggests that the factors A, C and BC are influential.
CONTROL TEST 8
Yes/No answers:
8.1 Experimental plan defines the order in which experimental runs are carried out?
8.2 A full plan working with 4 major factors consists of 8 experimental runs?
8.3 Factor effect can take on positive values only?
8.4 When testing factor significance, the corresponding critical value is related to a Fisher’s
distribution?
8.5 When testing factor significance with the graphical method, those factors which lie
outside the central line of the graph are regarded as significant?
Complete the statement:
8.6 Experiment is a system of __________
8.7 A full plan with three major factors has __________experimental runs.
8.8 The null hypothesis of the test of factor significance is: __________
8.9 The graphical method of testing factor significance is used when __________ is/are
not available.
8.10 The graph constructed for testing factor significance requires calculation of P(i) which
is given by the formula __________
8.11 Complete the table so that it represents a full experimental plan:
Run A B
1
2
3
4
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 121 -
8.12 The table below represents a full plan for factors A and B. Each experimental run has
been realized twice.
A B 1Y 2Y
- - 2,3 2,6
+ - 3,1 2,9
- + 3 3,5
+ + 1,9 2,2
Source: author’s
Calculate:
a. the effect of the factors A, B and AB,
b. the model of the experiment,
c. the estimate of the factor effect variance.
8.13 Test whether the effects of A, B and AB from 8.12 are significant (nivel of test = 5%).
8.14 Draw the graph of the interaction AB from 8.12. Depict the effect of A on Y, depending
on the level of B. What level of B maximizes Y?
SOLUTIONS
8.1 yes
8.2 no
8.3 no
8.4 no
8.5 yes
8.6 runs
8.7 823
=
8.8 insignificant
8.9 repetition of individual runs
8.10.
( )100 0,5
,i
i
P
m
−
= where mi ,...,2,1= and m is the number of all the factors.
8.11.
Run A B
1 - -
2 + -
3 - +
4 + +
8.12.
a. effect(A) = – 0,325; effect(B) = – 0,075; effect(AB)= – 0,875.
b. 2,69 0,1625 0,0375 0,4375Y A B AB= − − − .
c. 2
s = 0,029; es = 0,12.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 122 -
8.13. Only AB is significant, since 7,29 2,776− > .
8.14. The graph shows that the maximal value of Y is achieved for the upper level of B, and
the interaction proves to have an effect on Y, as the two lines resemble a cross.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 123 -
9 TWO-LEVEL FRACTIONAL PLAN
The previous chapter presented full factorial plans. It is, however, not always possible to
construct such a plan, for different reasons including financial limitations. In such cases,
fractional plans are used instead. Fractional plan is a partial plan which does not work with all
possible combinations of the factors observed. As we shall learn in this chapter, different
degrees of full plan reduction exist, the so-called half plans being among them. It is also the
half plans we shall devote to, in particular.
Full factorial plans prescribe experimental runs for each factor, whereas fractional plans set
down runs only for a subset of the factors, for main factors, and the remaining factors
(secondary factors) are calculated as combinations of the main factors. The calculation is later
used to define runs for the secondary factors. In this way, the total number of experimental
runs can be reduced, which leads to a fractional plan.
If 2k
denotes the number of experimental runs of a full experiment, k being the number of its
factors excluding interactions, then pk−
2 denotes the corresponding fractional plan, p being
the degree of the reduction in this case. For instance, if we want to cut down 128 runs of a 27
full experiment by one half to
17
7
2
2
2 −
= ,
we get a half plan with 642 17
== −
n runs. This is the smallest possible degree of reduction of
the full plan. Plans originating from full plans by applying the smallest degree of reduction
are called half plans.
The degree of reduction p can be greater than one, such as four, leading to a 47
2 −
fractional
plan, which will have n = 8 runs. If the number of factors exluding interactions is k = 7, the
degree of reduction equal to 4 would be the highest possible, based on the rule that the
number of runs should be at least as high as the number of all factors. In the opposite case, too
few data is available to perform a meaningful analysis. To give another example, if k = 15, the
highest possible degree of reduction is 11 since 1115
2 −
= 16. The degree of reduction equal to
12 would lead to 1215
2 −
= 8, which is too small a number. The degrees of reduction between
two and the highest possible degree define central plans. For example, 27-1
and 27-4
belong
to this category.
To sum up, fractional factorial plans can be divided into
a. Half plans, representing the smallest degree of reduction of full plans,
b. Plans, representing the highest possible degree of reduction of full plans,
c. Central plans.
Let us take a closer look at the half plans now.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 124 -
9.1 HALF PLANS
We shall demonstrate in an example the effect of reducing a full plan by one half. To
understand better what is going on, however, we need to define other fundamental terms from
the theory of experimental plans. We do so now.
We shall use the symbol I to denote the factor whose column in the experimental plan
contains only ones. We call this variable identity factor. We also define multiplication of two
factors: such an algebraic operation applied to two factors yields as its result a factor whose
column in the experimental plan contains values obtained by multiplying the ones from the
corresponding rows of the plan and from the columns of the two factors that appear in the
multiplication. The multiplication possesses the following basic algebraic properties:
A.A = I
A.I = I.A = A
(A.B).C = A.(B.C)
A.B = B.A
Suppose now that A, B, C, D, E are factors for which a half plan is to be constructed. To do
so, one must select four of these factors (the main factors) for which the full plan will be
created. These can be, for instance, the factors A, B, C, D. The remaining factor(s), the factor
E in this case, will be defined as a combination (multiplication) of the main factors: let us
define E = ABCD. In this way, instead of working with the 25
full plan, we shall work with the
25-1
half plan. Only one half of all two-level combinations of all the factors is set up before the
experiment is carried out, whereas the remaining two-level combinations are assigned by the
multiplication.
Now, not every combination of the factors is appropriate. Every combination forms a word.
Such a word consists of letters. Number of letters defines the length of the word. The equality
E = ABCD is called the plan generator. The pk−
2 factorial plans contain p generators. Since
E.E = E. ABCD,
we get
I = ABCDE.
Words that yield the indentity factor I are called defining equations. The shortest word among
the defining equations is the resolution of the plan. The length of the shortest word is
designated by the corresponding Roman figure in the symbol of the plan: for instance,
15
2 −
V
applies to our example. Using the defining equations, we can find factor pairs (interactions, in
general) with the same columns in the experimental plan. Such pairs are called
interchangeable, and they play a role in the entire data analysis based on experimental
planning. To illustrate the idea, if the plan generator is E = ABCD, the defining equation is I
= ABCDE. In this case, an interchangeable pair for the interaction DE, for instance, is
obtained by multiplying the defining equation by DE:
I = ABCDE /. DE
DE.I = DE.ABCDE.
Hence,
DE = ABC.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 125 The
two interactions have the same column of ones in the experimental plan. The following
problem shows the principles of working with half plans.
PROBLEM 1 (Dyestuff)
The amount of dyestuff Y left in a piece of fabric is observed. The amount depends on five
factors: A = pH, B = temperature, C = concentration of the solution used for tinting, D =
finishing temperature of tinting, E = finishing time. We are to construct a half experimental
plan, and use it to detect influential factors. The necessary entry data is in table 57.
Table 57: Entry data for problem 1
Factor Symbol -1 +1
pH A 4,5 5,5
temperature B C0
70 C0
80
concentrationn C 1 g/l 3 g/l
finishing temp. D C0
170 C0
190
finishing time E 50s. 70s.
Source: author’s
SOLUTION
The main factors are chosen to be the factors A, B, C, D, while E is selected to be the
secondary factor. Based on the half plan, constructed as described above and depicted by
table 58, the experiment is performed, leading to the following values of Y:
Table 58: The output of the experiment
Run A B C D E =ABCD Y
1 - - - - + 6,4
2 + - - - - 9,9
3 - + - - - 8,1
4 + + - - + 6,6
5 - - + - - 9,0
6 + - + - + 5,3
7 - + + - + -5,1
8 + + + - - -1,0
9 - - - + - 10,6
10 + - - + + 12,7
11 - + - + + 12,9
12 + + - + - 11,2
13 - - + + + 2,4
14 + - + + - 9,7
15 - + + + - 4,1
16 + + + + + 4,0
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 126 The
factor effects are calculated here the same way as for full experimental plans. For
instance, the effect of D is
( )( )1
( ) 6,4 9,9 ... 1 10,6 12,7 ... 4 4,8
8
efekt D = − − − − − + + + + = .
The effects of other factors are obtained similarly. However, since we work with a half plan,
interchangeable pairs exist in this case. As we said before, such pairs are factors that have the
same column of ones in the experimental plan. The effect calculated for one of these factors
does not belong to that factor alone anymore! Now it represents the influence of all the
interchangeable factors together. This is illustrated in table 59. If the effect was not attributed
to all the interchangeable factors, the effect would lose its original interpretation it had in the
case of the full experimental plan.
Table 59: Effects in a half plan
Factor Effect
A + BCDE 0,0
B + ACDE -4,4
C + ABDE -5,0
D + ABDE 4,8
E + ABCD -0,8
AB + CDE 0,2
AC + BDE -0,6
AD + BCE -0,6
AE + BCD 0,5
BC + ADE -4,2
BD + ACD 1,1
BE + ACD -0,2
CD + ABE 0,7
CE + ABD -0,5
DE + ABC 2,4
For instance, the first zero effect represents the collective influence of the factors A and
BCDE, the interaction which is interchangeable for A.
9.2 GRAPHICAL EVALUATION OF FACTOR EFFECT
The graphical method we used to verify significance of factor effects in the case of full
experimental plans can be exploited for half plans, as well. This means we calculate the Pi’s:
Table 60: Sorted effects and their Pi’s
i 1 2 3 4 5 6
Factor C + ABDE B + ACDE BC + ADE E + ABCD AD + BCE AC + BDE
Effect -5 -4,4 -4,2 -0,8 -0,6 -0,6
iP 3,3 10 16,6 23,3 30 36,6
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 127 i
7 8 9 10 11 12
Factor CE + ABD BE + ACD A + BCDE AB + CDE AE + BCD CD + ABE
Effect -0,5 -0,2 -0,0 0,2 0,5 0,7
iP 43,3 50 56 63,3 70 76,6
i 13 14 15
Factor BD + ACD DE + ABC D + ABCE
Effect 1,1 2,4 4,8
iP 83,3 90 96,6
Once the Pi’s are calculated, we construct the familiar graph (figure 15).
Figure 15: Graphical evaluation of factor effect significance
As we can see, the half plan brought us results similar to those obtained with the full
experimental plan. This means we get similar results with fewer experimental runs. However,
the results are not generally the same, and a certain amount of information has been lost after
all, because the calculated effect belongs to the combined influence of several factors, such as
A + BCDE, B + ACDE etc. The fact that a particular effect belongs to A + BCDE does not
mean that exactly one half of the effect is caused by A and the other half is caused by BCDE!
Generally speaking, it is not known what part of the total effect belongs to A or BCDE in this
case. None the less, it is known that the longer the word representing the interaction/factor,
the smaller its contribution to the total effect. Therefore, it is advisable that the fractional
plans be generated in such a way that the interchangeable pairs are very long interactions.
PROBLEM 2
Let us have five factors A, B, C, D, E, where factor E is to be generated as a secondary factor.
There is more than one way of generating E. Let us compare the consequences of two
scenarios:
a. E = AB,
b. E = ABCD.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 128 -
SOLUTION
The defining equations are:
a. I = ABE
b. I = ABCDE
In a), we have a
15
2 −
III plan, whereas in b), we have a
15
2 −
V plan. The plan b) is better because
its resolution is V, which leads to the following discovery: when seeking the interchangeable
pairs for A, for instance, we have
a. A = BE
b. A = BCDE
In the second case, the interchangeable pair has more factors (it is represented by a longer
word), which results in the interaction BCDE contributing less to the total effect of the factor
A+BCDE. Interchangeable interactions, represented by words with at least three letters,
account for such a small part of the total effect that their contribution to the total effect is
often neglected. This, of course, facilitates the interpretation of the final effect.
SUMMARY
We have learnt how it is possible to set up a fractional plan – it is the full plan
constructed for a subset of the set of all factors, while the levels of the remaining one-letter
factors are generated (calculated). This procedure reduces the total number of experimental
runs, making the total experiment cheaper and faster. We’ve also learnt that the graphical
method used to detect influential factors of a full plan can be exploited for the same purpose
in the case of half plans. Finally, we explained that it is important to select a proper generation
of the levels of the secondary factors. A proper generator leads to favourable interchangeable
pairs for each factor, facilitating the conclusion over how large effects belong to each factor.
What follows is a set of illustrative examples.
PROBLEM 3
A half plan was constructed for factors A, B, C and D:
a. Complete the table below,
b. Using the graphical method, determine which factor is significant.
The necessary data are in table 61.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 129 Table
61: Effects and their probabilities
Factor Effect i iP
A + BCD 1 3 35,7143
B + ACD -0,5 2 21,4286
C + ABD -4 1
D + ABC 3 4 50
AB + CD 9 6 78,5714
AC + BD 6 5 64,2857
AD + BC 17 7 92,8571
Source: author’s
SOLUTION
a) Using the equation
( )100 0,5
i
i
P
m
−
= ,
we get
( )
1
100 1 0,5
7,14
7
P
−
= = .
b) The following graph 16 implies that the interactions AD and BC and also the factor C
could be considered significant.
Figure 16: Graphical evaluation of factor significance
PROBLEM 4
A half plan has been constructed for factors A, B, C, D, the generator of the plan being D =
ABC. The output of the experiment is in table 62.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 130 a.
Calculate effects of the factors including the three-factor and four-factor interactions.
b. Using the interactions from a), estimate variance of the factor effects (this is a new
procedure to be explained!)
c. Write the defining equation and interchangeable pairs.
d. Assess graphically the factor effects.
Table 62: Experimental output
A B C D = ABC Y ABC ABD BCD ACD ABCD
- - - - 77
+ - - + 67
- + - + 64
+ + - - 51
- - + + 64
+ - + - 53
- + + - 73
+ + + + 67
Source: author’s
SOLUTION
a) Inserting the remaining signs + and – (or plus ones and minus ones) in the table, we
get the effects:
( )
1
77 67 64 51 64 53 73 67 10
4
A BCDe e= − + − + − + − + = − = ;
1,5B ACDe e= − = ; 0,5C ABDe e= − = ; 2D ABCe e= = ; 129ABCDe = .
b) If there is only one measurement (experimental output) for each combination of the
factors (for each row of the table), 2
s needed to estimate the variance is calculated as
the average of the second powers of the effects belonging to the longest interactions:
( ) ( ) ( )
2 2 22 2
2
2
2
2 0,5 10 1,5 129
3349,5
5
4.
1674,75 40,9.e e
s
s
s s
n
+ − + − + − +
= =
= = ⇒ =
c) Since D = ABC, the defining equation is
I = ABCD.
The interchangeable pairs are AB, CD; AC, BD; AD, BC.
d) Calculating the Pi’s, as in table 63,
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 131 Table
63: Factor effects and their Pi’s
Factor Effect i Pi
A + BCD -20 1 7,14
B + ACD -3 2 21,42
C + ABD -1 3 35,71
D + ABC 4 6 78,57
AB + CD 1 4 50
AC + BD 3 5 64,28
AD + BC 26 7 92,85
we construct the graph (figure 17).
Figure 17: Graphical evaluation of factor effects
The graph shows that the effects of AD + BC and A+BCD lie outside the central line. These
effects are deemed significant. In the latter case, we can restrict ourselves to the factor A,
since the part of the total effect of A+BCD belonging to BCD will be small enough to be
neglected.
CONTROL TEST 9
Yes/No answers:
9.1 Factorial plans contain all combinations of factor levels?
9.2 The commutative property of multiplication known from the theory of real numbers
does not hold true in the case of factor multiplication?
9.3 If the generator of the plan is D = BC, the defining equation is I=BCD?
9.4 If interactions ABC and DE have the same columns in the experimental plan, the effect
calculated from one of these columns belongs to both interactions?
9.5 When a half plan is constructed, one of the factors is defined as an interaction of other
factors?
Complete the statement:
9.6 Secondary factors are expressed as a __________ of the main factors.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 132 -
9.7 Factorial plans can be divided into __________plans, __________plans and
__________ plans.
9.8 Given a factor A, the following holds: AI = IA = ___, where I is the identity factor.
9.9 Two factors with the same column of ones in an experimental plan are
called__________.
9.10 In full plan, effects of factors and their interactions __________ the same as in the
related half plan.
9.11 Construct the half plan for factors A, B, C, D, generated by B=ACD. Calculate the effect
of C provided the following two measurements for each factor combination were
obtained from the experiment:
The first series of measurements: 10,11,14,12,12,10,13,14,
The second series of measurements: 11,12,12,8,14,12,13,14.
9.12 A half plan was used for factors A, B, C, D.
a) Complete the table below,
b) Detect significant factors with the graphical method.
Effects i iP
A + BCD 1
B + ACD -8
C + ABD -10
D + ABC 4
AB + CD 9
AC + BD 7
AD + BC 5
Source: author’s
SOLUTIONS
9.1 no
9.2 no
9.3 yes
9.4 yes
9.5 not necessarily
9.6 combination
9.7 half plans, central plans, highest-reduction plans (saturated plans)
9.8 A
9.9 interchangeable
9.10 are not
9.11 1Ce =
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 133 -
9.12 a.
Effects i iP
A + BCD 1 3 35,71
B + ACD -8 2 21,42
C + ABD -10 1 7,14
D + ABC 4 4 50
AB + CD 9 7 92,85
AC + BD 7 6 78,57
AD + BC 5 5 64,28
b. Significant factors: B, C (figure 18):
Figure 18: Graphical evaluation of factor effects
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 134 -
10 TAGUCHI’S METHODS – LOSS FUNCTIONS
Taguchi’s methods, based on the research by Genichi Taguchi, include online methods
used during production, and offline methods reserved for pre-production stages. The former is
the contents of the chapters 10 and 11, and it relies heavily on loss functions. In this chapter,
we shall explain the logic behind these functions and the way they are constructed and used.
The first part of the chapter defines loss function, and presents its properties. The second part
of the chapter works with different kinds of loss functions, which is related to different kinds
of what is called tolerance interval. The end of the chapter presents examples and a control
test with questions and answers.
Taguchi’s methods based on loss functions try to measure financial losses experienced by
product users due to producers’ inability to fabricate a product that would precisely comply
with users’ demands. Most often there is always at least a slight imprecision in the production
due to its physical nature, no matter how much the production is surveilled and controlled.
Introduction of loss functions brought a new concept to how problems with quality are
viewed. Earlier, the standard approach had been such that as long as the observed quality
characteristic of a product lied within a tolerance interval, the characteristic not necessarily
being equal to its desired optimal value, the product users would not bear any losses incurred
by quality imprecisions. Taguchi disagreed with this view of the problem, and introduced
simple mathematical functions that suprisingly turned out to be precise enough to measure the
losses that occur even when the slightest deviation of the product quality characteristic from
its optimal level exists.
Let us emphasize that the loss-function approach is only one of many forms of looking at the
process or product. Whereas loss functions quantify the process quality, another question is
how to improve the quality if it is detected to be inadequate, for example by a loss function.
There are many ways how to solve problems within a process: Six Sigma methodology is one
technique based on statistical methods (regression, in particular); analyzing process by
simulation (Zgodavová and Bober, 2012) is another technique which may be used after
process quality characteristics are properly measured or quantified (Zgodavová, 2010), and
key concepts are exactly defined (Brannmark et al., 2012). The objective of process analyses
is to create an optimal or close-to-optimal process set-up, and keep it that way, regardless of
variations of factors which could destabilize the optimal set-up, i.e. keep the process robust
(Siva, 2012).
10.1 DEFINITION AND PROPERTIES OF LOSS FUNCTIONS
Before defining the functions and presenting their graphs and properties, let us mention
some fundamental conditions which are considered to be met in order for the functions to be
used correctly and properly:
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 135 -
1. Every product bears a certain quality characteristic (such as its size, weight, mechanical
property, etc.), and the quality of the whole product is judged based on that particular
quality characteristic.
2. A target (optimal) value T is given for the quality characteristic from 1).
3. Lack of product quality is measured by deviation of the observed product quality
characteristic from its target value T.
4. Any deviation of the characteristic from T brings a financial loss that the product user
must bear because of the necessity to increase expenses on the product maintenance,
repairments, etc.
One of the simpler loss functions is of the form
10-1
( ) ( )
2
L Y k Y T= − for ( , )Y T d T d∈ − + ,
= A otherwise,
where
T = target value of the quality characteristic,
d = tolerance
A = maximal loss due to poor quality
Y = truly achieved value of the quality characteristic (which is a random variable)
L(Y) = financial loss given by the specific value of Y
k = constant to be determined.
Figure 19 shows the loss function just described. Above the tolerance interval ( , )T d T d− + ,
it is a parabole, whereas outside the interval, it is a constant function.
Figure 19: Loss function
If Y T d≤ − or Y T d≥ + , in other words, if d Y T≤ − , then ( )L Y A= .
We can write:
10-2 2
kdA = .
Since d and A are usually known, 10-2 is used to determine k: 2
/k A d= .
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 136 PROBLEM
1
Write the loss function equation for d = 5 and A = 2.
SOLUTION
We have 2
2 / 5 0,08k = = , therefore ( ) ( )
2
0,08L Y Y T= − .
The variable Y is considered to be a random variable, usually following approximately a
normal distribution ( )2
( ),N E Y σ . We are often more interested in the average loss E(L)
rather than the individual loss. The average loss is calculated according to:
10-3
( ) ( ) ( )
2 2 2
E L E k Y T kE Y T kσ = − = − =
 
,
Provided that E(Y) = T. The symbol 2
σ denotes the variance of Y, as usual.
However, if ( )E Y T≠ , then ( ) ( )
22
( )E L k k E Y Tσ= + − .
Therefore, several equations are used in connection with loss functions:
a. the defining equation ( ) ( )
2
L Y k Y T= − ,
b. the equation determining the constant k: 2
A kd= ,
c. the equation for average loss ( ) 2
E L kσ= or ( ) ( )
22
( )E L k k E Y Tσ= + − .
Quality costs can be enumerated in a much more complex manner involving all possible kinds
of losses induced by the not-optimal product quality, such as expenses on repairs, expenses on
product control, losses due to imprecise quality measurements, etc. We shall work with this
concept in chapter 11.
There are also loss functions dependent on more quality characteristics, i.e. they are functions
of several variables. Some of the characteristics do not even have to be quantitative.
10.2 LOSS FUNCTIONS FOR DIFFERENT TYPES OF TOLERANCES
The loss function described in figure 19 is not the only one. There are more kinds of
loss functions, depending on what tolerance interval we work with. What follows is a
classification of some of the most fundamental loss functions. Each of the functions is
accompanied by the corresponding graphical representation.
We distinguish the following types of tolerances:
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 137 a)
Symmetric N-tolerance
Figure 20: Symmetric N-tolerance
In this case, we write T ± d, where d = tolerance. The interval (T-d, T+d) is called the
tolerance interval. The tolerance is symmetric in the sense that the target value lies in the
center of the tolerance interval. If the quality characteristic observed is smaller that the lower
tolerance limit T-d, the financial loss incurred equals A, and the same is true for the case when
the characteristic is greater than the upper tolerance limit T+d.
b) Nonsymmetric N-tolerance
In this case, the loss function looks as described by figure 21.
Figure 21: Nonsymmetric N-tolerance
Here, the tolerance interval is (T - d1, T + d2). We see that there are two tolerances d1 and d2,
which are generally different. There are also two generally different maximal losses A1 and
A2, depending on whether the quality characteristic is too high or too low. The different
maximal losses and different tolerances define two generally different curves above the
horizontal axis, one of the curves being above and to the right of T, and the other being above
and to the left of T.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 138 On
the interval (T-d1, T), we work with equation 10-1, where k = k1. On the interval (T, T+d2)
we work again with equation 10-1, however k = k2 in this case. In the former case, k1 = A1/ d1
2
,
in the latter case, k2= A2/ d2
2
.
c) S-type tolerance (S for Small)
For this type of tolerance, the following is true: the smaller the observed quality characteristic
Y, the better. The target value T = 0. Figure 22 describes the shape of an S-type tolerance loss
function.
Figure 22: S-tolerance
To give an example of this situation, surface roughness can be the quality characteristic Y.
Another example is air pollution. A certain upper tolerance/specification limit is acceptible,
and beyond that limit, the losses reach their maximum. On interval (0, USL), we work with
equation 10-1, where k = A/USL2
; beyond USL, the loss function is constant.
d) L-type tolerance (L for Large)
In this case, the opposite is true: the bigger the characteristic Y, the better. The optimal/target
value is ∞=T . Here, the loss function is of the form
10-4 ( ) 2 2
E L A d s= ⋅ ⋅ ,
where
2 2
(1/ )s E Y= ,
i.e. the expected value (the „average“) of the random variable 2
1/Y .
Equations a) -c) give an individual loss. If we want to determine the average loss, we
calculate the average of the individual losses. Since the population average is usually
unknown, we estimate it with the sample average. The same is true for the case d) where the
unknown population average
2
(1/ )E Y is replaced with its estimate 1 2
n Y− −
∑ .
Before presenting examples, let us summarize the essentials.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 139 -
SUMMARY
We have described all major types of loss functions. We know that a quality
characteristic is observed for each product. An optimal or target value is given for the
characteristic, and if the optimal value is not achieved, certain financial loss is brought upon
the product user. Depending on what the target value and tolerance are, we distinguish the
following types of tolerances:
1. N-type tolerance: symmetric and unsymmetric,
2. S-type tolerance: smaller Y is better, T = 0.
3. L-type tolerance: larger T is better, ∞=T .
Different loss functions correspond to the cases 1-3.
PROBLEM 2
In a crankshaft production, length and diameter of crankshafts are observed. The diameter is
supposed to be 25mm ± 1mm, while the length has a prescription of 100mm ± 2mm. If the
diameter falls outside the tolerance interval, an individual loss of 40 crowns is generated; for
the length, the individual loss is 30 crowns. Ten crankshafts have been taken out of the
production line randomly for examination. These are the measurements of their length and
diameter:
Diameter (in mm):
25,1; 25; 25; 24,9; 25,1; 25; 24,9; 25; 25,1; 24,9.
Length (in mm):
99,9; 99,9; 99,8; 100,2; 100; 100; 100,1; 98; 99,9; 100,2.
Compare the quality of two operations: one that yields a certain diameter of the crankshaft,
and the other resulting in a length of the crankshaft.
SOLUTION:
Diameter: T1 =25, A1 = 40, d1 = 1.
( ) ( ) ( )
2 2 22 1
25,1 25 25 25 ... 24,9 25 0,006
10
s  = − + − + + − =
 
.
2
40
( ) 0,006 0,24
1
Estimated E L = = crowns per unit.
Length: T2 = 100, A2 = 30, d2 = 2.
( ) ( )
2 22 1
99,9 100 ... 100,2 100 0,02
10
s  = − + + − =
 
.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 140 -
2
30
( ) 0,02 0,15
2
Estimated E L = = crowns per unit.
We can conclude that the length of the crankshaft is produced at a higher quality than the
diameter of this product. The total average loss is estimated to be 0,24 + 0,15 = 0,39 crowns
per unit (per crankshaft).
PROBLEM 3
Drums to be used in washing machines are supposed to be 30 cm wide (= diameter). The
tolerances are specified as follows: 30cm-1cm, 30cm+4cm. If the diameter is smaller than the
lower tolerance limit, a loss of 50 crowns is recorded by the consumer. If the diameter
exceeds the upper tolerance limit, the loss is 100 crowns. Two companies produce the same
drums. Compare their quality if the following data samples on their production are available:
Table 64: Production data
Company Deviations from the target value (!)
A 0; 0; -1; 3; 0; 4; 2; -1; 0; 1; 2; 4
B -1; -1; 0; 0; 0; 3; 2; -1; 1; 2; 0
Source: author’s
SOLUTION:
The parameters are: A1 = 50, A2 = 100, d1 = 1, d2 = 4.
Company A:
( ) ( ) ( )22 2 2 2 2 2 2
1 2 2
1 50 100
( ) 1 1 3 4 2 1 2 4
12 1 4
Estimated E L
  = − + − + + + + + +   
( )1 34,375Estimated E L = crowns per unit.
Company B:
( ) ( ) ( ) ( )2 2 2 2 2 2 2
2 2 2
1 50 100
( ) 1 1 1 3 2 1 2
11 1 4
Estimated E L
  = − + − + − + + + +   
( )2 23,864Estimated E L = crowns per unit.
The production of company B seems to give a higher quality. Let us take a closer look at the
logic behind the calculations performed (the case A, the second case is analogous): we are
estimating the average loss by the sample average. Therefore, the data are summed together
and divided by the sample size, which is 12. Each of the terms in the summation represents an
individual loss, i.e. a functional value of the loss function used. Since the tolerance we work
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 141 with
is unsymmetric, each part of the loss function has its own defining equation. Each of
these equations lead to a different value of the constant k. In one case, the constant equals 50,
in the other case, it is equal to 100/16.
PROBLEM 4
Ballbearings are produced by two different companies: Company A produces it with
specification 0,4T ± , while the other company B must cling to specification 1T ± . Fifty
thousand ballbearings is produced every day. Each ballbearing costs 0,60 crowns. If the
tolerance interval is not achieved, the corresponding production unit is scrapped. A spot check
at the two companies led to the following data samples:
Company A: deviations from the optimal size:
-0,3; 0,1; 0,2; 0; 0; -0,2; -0,1; 0; 0,4; 0,1; -0,1; 0, 0; 0,1; -0,2.
Company B: deviations from the optimal size:
0; 0; 1; -0,8; -0,8; 0, 0,6; 0,7; 0; -0,3; -0,2; 0; 0; 1; 0,2.
Compare the production quality of the two companies.
SOLUTION
1. Company A: A = 0,6; d = 0,4.
We have
( ) ( )
2 22 21 0,42
0,3 0,1 ... 0,2 0,028
15 15
s  = − + + + − = =
 
.
and
2
2 2
0,6
( ) .0,028 0,105
0,4
A
Estimated E L s
d
= = = crowns per unit.
The daily loss is 50 000·0,105 = 5 250 crowns.
2. Company B: A = 0,6 ; d = 1.
In this case,
2 2 21
0 ... 0, 2 0, 287
15
s = + + =   .
and
2
0,6
( ) 0,287 0,172
1
Estimated E L = = crowns per unit.
The daily loss in the second case is 50 000·0,172 = 8 600 crowns.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 142 PROBLEM
6
The surface of a piston is adjusted during production so that the surface roughness does not
exceed 10 mm. The smoother the surface, the better. If the upper roughness limit is exceeded,
it is re-worked with a machine tool for 200 crowns. Two different workers at two different
firms have the same job: they are in charge of how smooth the piston surface is. Compare the
quality of their work when the following data samples are available (table 65).
Table 65: Entry data
Worker Surface smoothness
1 0, 1, 9, 6, 10, 2, 3, 0, 9
2 3, 2, 4, 4, 5, 2, 4, 6, 5, 3
Source: author’s
SOLUTION
The parameters are: A = 200, d = 10 (S-type tolerance)
1st Worker:
2 1
9
s = ( (0-0)2
+ (1-0)2
+ (9-0)2
+ ... + (9-0)2
) = 34,67.
2
2 2
200
( ) 34,67 69,34
10
A
Estimated E L s
d
= = = crowns per unit.
2nd Worker:
10
12
=s (32
+ 22
+ 42
+ ... + 52
+ 32
) = 16.
( ) 2
200
16 32
10
Esitmated E L = = crowns per unit.
The second worker’s skills are more than twice as good as those of the first worker.
PROBLEM 7
Rock-climbing equipment makers are required to produce ropes with stiffness of at least 300
kg. If the lower limit is not achieved, the rope must be re-stiffened at the cost of 50 crowns
per metre. Compare two technologies of rope-making if the following production data is
available
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 143 Table
66: Production data
Technology Rope stiffness
1 305, 350, 350, 410, 310, 300, 350, 400
2 305, 301, 308, 306, 300, 320, 310, 310, 320, 325
Source: author’s
SOLUTION
The parameters are: A = 50, d = 300. It is the L - type tolerance.
1st technology.
Based on 10-4, we have:
2 6
1 2 2 2
1 1 1 1
... 8,62 10
8 305 350 400
s − 
= + + + = ⋅ 
 
.
The average loss is
( ) 2 6
1. 50 300 8,62 10Est E L −
= ⋅ ⋅ ⋅ = 38,79 crowns per metre.
2nd technology
Variance:
2 5
2 2 2 2
1 1 1 1
... 1,03 10
10 305 301 325
s − 
= + + + = ⋅ 
 
.
Average loss:
( ) 2 5
2. 50 300 1,03 10Est E L −
= ⋅ ⋅ ⋅ =46,76 crowns per metre.
CONTROL TEST 10
Yes/No questions:
10.1 Loss function can be described mathematically as ( ) ( )2
L Y k Y T= − ?
10.2 The higher Y, the better…this is the S –type tolerance?
10.3 If a product feature has its optimal value, a lower-than-optimal quality of the product
manifests itself by deviations of the feature value from the optimal value?
10.4 With the N-type tolerance, the optimal value is smaller than the target value?
10.5 Loss functions are in part a parabola?
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 144 Complete
the statement:
10.6 Any deviation from the target value T brings __________ .
10.7 A certain product__________is usually observed, based on which we judge the
quality of the product.
10.8 A part of loss function is mathematically a __________.
10.9 Based on what is considered the target value T, we distinguish these types of
tolerances:__________, __________, __________.
10.10 When working with the S – tolerance, the target value T = __________.
10.11 A product diameter and weight are observed. The diameter is to be T1 = 20cm ± 1 and
the weight is to be T2 = 100g ± 2. If the diameter falls outside the tolerance interval, it
costs 20 crowns to repair the product or scrap it; for the weight, the same cost is 30
crowns. Ten product units have been randomly drawn from the production line:
Their diameter was:
20,1; 20; 20; 19,9; 20,1; 20; 19,9; 20, 20,1; 19,9.
Their weight was:
99,9; 99,9; 99,8; 100,2; 100; 100; 100,1; 9,8; 99,9; 100,2.
Compare the production quality in terms of the ability of the company to keep the
target value of the diameter and the weight.
10.12 In filter production, it is required that the throughput of the filter be 10% at the most.
Two filter makers were inspected, and these are the results of the inspection:
Maker % throughput
A 3, 9, 9, 7, 1
B 8, 8, 1, 1, 2, 5
Source: author’s
If the throughput tolerance is exceeded, the costs of the maker A rise by 600 crowns, whereas
in the case of the maker B, the costs increase by 700 crowns. Which producer yields a betterquality
filter?
SOLUTIONS
10. 1 yes
10. 2 no
10. 3 yes
10. 4 no
10. 5 yes
10. 6 losses
10. 7 characteristic
10. 8 ( ) ( )2
L Y k Y T= −
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 145 -
10. 9 N (nominal), S (smaller), L (larger)
10. 10 0
10. 11 Diameter: ( ) 0,12E L = crowns per unit; Weight: ( ) 0,15E L = crowns per unit.
10. 12 A: ( ) 265,2E L = crowns per unit; B: ( ) 185,5E L = crowns per unit.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 146 -
11 TAGUCHI’S METHODS: TOTAL QUALITY COSTS
In the previous chapter, we worked with loss functions that measured financial losses of
customers, resulting from product lower quality. The losses were feature-specific. In this
chapter, we shall include in the losses other types of expenses that are not directly linked to a
specific product feature. Inclusion of these expenses leads to a total quality costs function. In
chapter 11, we shall talk about control charts, as well. This is a very important part of quality
management, as it monitors stability of production processes. The chapter is divided into four
parts: the first part discusses quality cost monitoring, the second part introduces the total
quality cost function when 100% control of the production is realized, the third part of the text
provides the reader with the total quality cost function when the production is controlled after
every batch of n production units, and the final part of the chapter describes basic types of
control charts.
11.1 QUALITY COST MONITORING
The term „quality cost“ can mean more than one thing, nevertheless, it is mostly
connected to expenses on ensuring or improving quality, as well as expenses of nonproductive
nature, such as those resulting from making nonconforming products. From the practical point
of view, it is convenient to divide the quality costs into three categories:
- quality costs of the producer,
- quality costs of the customer,
- quality costs of the whole society.
We shall focus on the first category.
Producers must invest in prevention, production evaluation and defect removal, so that
appropriate quality is achieved in all production stages, i.e. in the product development,
product manufacturing, product installation and product use. Monitoring these „investments“
allows for product improvement. There are different ways of monitoring the expenses:
1) monitoring based on PAF models,
2) monitoring based on process models,
3) monitoring through the Taguchi’s approach.
Ad1) PAF models (Prevention, Appraisal, Failure)
This model is based on dividing company costs into four categories:
Costs resulting from internal defects (these defects originate within company before its final
product reaches the customer),
Costs resulting from external defects (these include customer complaints, repairs, handling
costs, discounts, expenses due to lawsuits, market share losses, etc.),
Evaluation costs (these are mainly expenditures on measuring customer satisfaction,
measuring equipment, software, certification, laboratory testing, etc.).
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 147 Prevention
costs (these are expenses which should rise continuously; they include
expenditures on exploring customer demands, management system development, education of
employees and others).
Ad2) Process models
Process models represent a higher degree of monitoring which keeps track of costs related not
to each product but processes. The costs involve expenses on converting process inputs into
process outputs according to a plan, as well as expenses on resolving inconsistencies that were
not supposed to originate in the process at all.
Ad 3) Taguchi’s methods
These methods use mathematics to describe relations between total quality costs and different
factors that contribute to the costs. The exact nature of mathematics enables to optimize the
costs, which is, of course, an advantage of this approach. We shall devote ourselves to
Taguchi’s methods in the following two sections 11.2 and 11.3.
11.2 TAGUCHI’S APPROACH – THE CASE OF 100% PROCESS CONTROL
The total quality costs are calculated in this case according to equation
11-1 2
02
s
d
A
R
Q
L += ,
where
Q = yearly expenses on 100 % control,
R = yearly production (number of product units made),
d = tolerance within which the product remains satisfactory in terms of its quality,
A = losses due to exceeding the tolerance d,
( ) ( ) ( )
2 2 22
0 2 1 3 2 1
1
...
1
n ns y y y y y y
n
−
 = − + − + + −
 −
,
the y’s being measurements of the observed product quality characteristic.
PROBLEM 1
An automated control (i.e. a 100% control) at a factory costs 25 000 crowns a year. Each year,
four million product units leave the factory. The tolerance for the quality characteristic
observed is 9 and the company’s costs rise by 5 crows each time the tolerance is exceeded.
Calculate the total quality costs if a random sampling showed the „variability“ of the
characteristic to be 12
0 =s .
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 148 -
SOLUTION
We have:
Q = 25 000 Kč
R = 4 000 000 ks,
d = 9,
A = 5 Kč,
12
0 =s .
Thus, according to 11-1, we get:
2
25000 5
1 0,068
4000000 9
L = + ⋅ = crowns per unit.
The total quality costs are 4 000 000·0,068 = 272 000 crowns per year.
11.3 THE CASE OF PROCESS CONTROL AFTER N UNITS
If n product units are made between two controls, the total quality costs are calculated
according to formula
11-2 2
2
2
2
2
2
2
1
3
ms
d
A
z
n
u
D
d
AD
d
A
u
C
n
B
L +





+
+
+++= ,
where
A = loss due to exceeding tolerance d,
B = product control costs,
C = production machinery repair costs,
n = control interval,
u = average number of units produced between two controls,
d = tolerance within which the product remains satisfactory in terms
of its quality (the tolerance is defined by the customer),
D = tolerance defined by the producer (it is usually more demanding
than what the customer demands),
z = number of product units made during the control,
n
B
= control costs per unit,
u
C
= repair costs per unit,
3
2
2
D
d
A
= costs resulting from imprecise production,






+
+
z
n
u
D
d
A
2
12
2
= costs due to producing defective units,
2
2 ms
d
A
= costs due to imprecise measurements.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 149 Equation
11-2 is a result of Taguchi’s long-time experience, and it was mathematically
defined, not derived. However, three terms in the equation are derived from loss functions.
The question we usually ask ourselves is: How often should the production control take place
so that the total quality costs were minimal? What tolerance should the company define for
itself to minimize its total quality costs? The answers to the questions can be obtained by
standard optimization procedures which seek local minima of a function: the first-order
derivatives with respect to n and D are calculated (the two problems are solved separately),
and the derivatives are put equal to zero. This necessary extremum condition leads to an
optimal control interval of
11-3 * 2uB d
n
A D
= .
and an optimal tolerance
11-4 2 2
* 4
3CD d
D
Au
= .
11.4 CONTROL CHARTS
Control charts rank among major statistical tools for production process regulation. The
charts were introduced by Walter Shewhart in the 1920s. Their aim is to monitor a
characteristic of a process in time, and give a signal if a problem in the process occurs. If such
a deterioration of the process takes place, the process owner reacts to the situation, and makes
the necessary adjustments to the process. Thus, the charts serve as a problem prevention.
Values of the observed characteristic are measured on the y-axis of the chart, whereas its xaxis
records points in time at which the characteristic is observed or measured. The time
series values of the characteristic should not exceed certain limits, nor should they form an
improbable pattern. In either case, the chart signals a systematic impact on the process which
has nothing to do with the natural character of the process. Such an impact may result, for
instance, from a defect developed in a machine that is used during the process. Although
nearly any process characteristic can be observed in time, it should satisfy some basic
requirements if it is to be used in the framework of statistics. In our case, such a characteristic
should at least approximately follow a normal distribution.
Two properties of the process characteristic (or the process in question) are observed:
a. its ability to keep itself close to a pre-defined target value,
b. its variability around the target value.
Therefore, two control charts are usually constructed, each of which observes either the
property a) or b). Perhaps the most common charts are: the chart for the average x of the
characteristic and the chart for the range of the characteristic R; the pair is denoted CC( )Rx, ;
or the chart for the average x of the characteristic and the chart for its standard deviation s,
i.e. the pair CC( )sx, . Let us take a closer look at the first pair.
To construct a CC( )Rx, , we take the following steps:
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 150 a.
We gather data about the characteristic at time points t = 1,2, …, m (the first column
of table 67).
b. The average and range minmax xxR −= are calculated for each of the samples, i.e. for
each point in time t.
c. An optimal value (central line), an upper limit and a lower limit are calculated to be
used in the chart as a reference (see below).
d. We plot the time averages in one chart to get the CC( )x chart. Likewise, we plot
the individual time ranges in another chart to get the CC(R) graph.
The time series depicted by either chart should stay below the calculated upper limit and
above the calculated lower limit. Also, it should fluctuate more or less randomly around the
central line of the chart. If it doesn’t seem to be the case, the process owner must check the
status of the process.
Tabulka 67: Samples and characteristics for CC( x , R) charts
Data x-axis y-axis
i = time average ix range Ri
nxxx 11211 ,...,, 1 1x R1
nxxx 22221 ,...,, 2 2x R2
nxxx 33231 ,...,, 3 3x R3
: : : :
mnmm xxx ,...,, 21
m mx Rm
Figure 23 outlines the fundamental limits of the CC(R) chart:
Figure 23: Limits and lines of the CC(R) control chart
UCL = upper control limit, LCL = lower control limit, CL = central line.
For the CC( x ) chart, the limits are calculated as follows::
2LCL x A R= − ,
2UCL x A R= + ,
CL x= ,
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 151 -
where
1
1 m
i
i
x x
m =
= ∑
is the average of the individual sample averages from different time points. Further,
1
1 m
i
i
R R
m =
= ∑ ,
is the average of the individual ranges from different time points.
The limits for the CC(R) chart are:
4UCL D R= ,
3LCL D R= ,
CL R= .
The constants A2, D3 and D4 are in table 68 (the table has zeros where the data’s missing).
Table 68: Constants for LCL and UCL limits of control charts
Source: author’s
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 152 PROBLEM
2
Calculate the limits UCL, LCL and CL of CC( x ) and CC(R) if the following data is given
Table 69: Data samples for CC( )Rx,
Source: author’s
SOLUTION
Control chart CC( x ):
1
1
9,994
m
i
i
x x
m =
= =∑ .
1
1
1,271
m
i
i
R R
m =
= =∑ .
2 9,994 0,720 1,271 0,915LCL x A R= − = − ⋅ = .
2 9,994 0,720 1,271 10,909UCL x A R= + = + ⋅ = .
9,994CL x= = .
Control chart CC(R):
4 2,282 1,271 2,9UCL D R= = ⋅ = .
3 0 1,271 0LCL D R= = ⋅ = .
1,271CL R= = .
Subsequently, for each point in time, the corresponding individual average would be plotted
on the vertical axis of the control chart CC( x ), the time point being plotted on the horizontal
axis of the chart, and the same is true for the control chart CC(R), in which the individual
range corresponding to the particular point in time would be plotted on the vertical axis. In the
end, the charts are evaluated: the basic rule is that none of the points plotted in the chart is
either above the UCL limit or below the LCL limit of the chart. If it does happen, the process
must be checked as to what occured in the process at the moment when the point exceeded the
chart limits. It is quite probable that something unnatural or systematic interfered the natural
course of the process.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 153 -
SUMMARY
The chapter presented the total quality cost function, and also served as an introduction
to control charts. The cost function was not derived mathematically, it was defined, based on
experience of Genuchi Taguchi, a Japanese engineer. If production process is not controlled
each time its product unit is made, it is convenient to figure out how often the control should
be carried out and how precise it should be so that it didn’t cost too much. The total quality
cost function can be used for these purposes.
Control charts are a fundamental tool for statistical process regulation. Their objective is
to monitor process characteristics, and signal that anomalies occur in the process. There are
different types of control charts, depending on the nature of the process characteristic
observed. The reader became acquainted with two basic control charts: one of them monitors
time development of the average value of the process characteristic, while the other records
time development of the range of the same characteristic.
What follows is a problem related to the total quality cost functions.
PROBLEM 3
A pressing machine produces a set of 8 pressed parts at once. It cost 0,5 crowns to produce
each of these parts. The factory controls the pressing by checking the whole set, and if one of
the parts is faulty, the whole set is scrapped, the machine is stopped and adjusted at a cost of
70 crowns. The tolerance defined by the customer for the size of each part is 4, while the
factory has its own tolerance of 10. Four hundred and eighty parts are produced every hour,
and the number of working hours is 2000 per year. The control of the machine lasts 2 minutes,
and it costs 10 crowns. Imprecision of the measuring equipment used during the control is not
considered here. The time interval between two controls is 4 hours in average. We are to
calculate the total quality costs, and determine the optimal control regime, including its
contribution to the quality cost reduction.
SOLUTION
The parameters are:
A = 8·0,5 = 4 crowns
B = 10 crowns
C = 70 crowns
Do= 4
d = 10
no = 480 units
162.
60
480
==z
uo = 4.480 = 1920 units
Inserting uo, no and Do in 11-2, we have the current quality costs of
2 2
0 2 2
10 70 4 4 4 480 1 4
( 16) 0,356
480 1920 10 3 10 2 1920
L
+
= + + + + ⋅ = crowns per unit.
To optimize the control, we have
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 154 a.
According to 11-3
* 2 2 1920 10 10
244,9 240
4 4
o
o
u B d
n
A D
⋅ ⋅
= = = ≈ units,
i.e. the control should be performed approximately every 30 minutes.
b. Based on 11-4, the optimal tolerance is
2 2 2 2
* 44
3 3 70 4 10
2,57 2
4 1920
o
o
CD d
D
Au
⋅ ⋅ ⋅
= = = ≈
⋅
.
c. Cost savings:
*2 * *2
* 2 2
2 2
2 2
1
( )
3 2
10 70 4 2 4 240 1 2
( 16) 0,287 crowns per unit.
240 480 10 3 10 2 480
B C A D A n D
L z
n u d d u
L
+
= + + + + ⋅
+
= + + ⋅ + + ⋅ =
The cost reduction is
=− LL0 0,356 – 0,287 = 0,069 crowns per unit,
that is 0,069·480·2000 = 66 240 crowns in savings per year.
CONTROL TEST 11
Yes/No answers:
11.1 If a production process is controlled each time its product unit is made, the total
quality costs are 2
02
Q A
L s
R d
= + ?
11.2 The fundamental formula for quality cost evaluation is mathematically derived?
11.3 When working with a process characteristic, we are interested in how well the
characteristic clings to its target value and how it fluctuates around the target value?
11.4 The range R of a data sample is calculated as maxR x x= − ?
11.5 CC( )Rx, represents a chart for the average and a chart for the range of a process
characteristic?
Complete the statement:
11.6 In equation 2
02
Q A
L s
R d
= + ,
2
0s =__________.
11.7 If process control is not performed each time a product unit is made, we are interested
in how often __________ and __________.
11.8 __________ __________ is a main tool for statistical process regulation.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 155 -
11.9 Two elementary control charts are: for the average and __________; and for the
average and __________ __________.
SOLUTIONS
11.1 yes
11.2 no
11.3 yes
11.4 no
11.5 yes
11.6 ( ) ( ) ( )[ ]2
1
2
23
2
12
2
0 ...
1
1
−−++−+−
−
= nn yyyyyy
n
s
11.7 to control, how precisely to control
11.8 control charts
11.9 range; standard deviation.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 156 -
CONCLUSION
This textbook has presented selected but frequently used statistical methods which
include procedures applied in industry. The logic of the text was based on the fact that
industrial procedures draw on statistical terms and techniques, which means, the terms and
techniques have to be presented before they can be applied either in industry or other sectors
of economy. Classical statistical methods include, but are not restricted to, regression,
correlation analysis, hypothesis testing, time series analysis, analysis of variance, and
descriptive statistics. These topics were covered in chapters 1-8, whereas the remaining
chapters described the fundamentals of the design of experiments, which is closely related to
regression, Taguchi’s loss functions and control charts. The structure of the text followed the
well-established scheme according to which the subject matter explained is accompanied by
examples, and the end of the chapter presents relevant questions.
The textbook best serves as an outline of major statistical methods, providing the reader with
main ideas and principles of the methods. The extent and depth of the presented topics
comply with the subject matter contained in the course Statistical methods for economists,
taught at the Faculty of Business Administration of the Silesian University. There are other
literary resources, as well, which cover each topic of this textbook. These resources focus
specifically only on some of the methods, and therefore elaborate the ideas behind the
methods further, as compared to what is presented in this textbook. The reader is encouraged
to examine other external scholarly texts, as well. Some of the literary sources are listed on
the following page.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 157 -
REFERENCES
[1]ANTONY, J.: Design of Experiments for Engineers and Scientists, 8th edition,
Butterworth-Heinemann, 2003, ISBN: 0-7506-4709-4.
[2] BISSELL, B.: Statistical methods for SPC and TQM. 1.vyd. London: Chapman and Hall,
1994, ISBN 9780412394409.
[3] BRÄNNMARK M., LANGSTRAND J., JOHANSSON S., HALVARSSON A.,
ABRAHAMSSON L., WINKEL J.: Researching Lean: Methodological Implications of Loose
Definitions, Quality, Innovation, Prosperity, Vol. XVI/2-2012, p. 35-48, ISSN 1335-1745,
DOI: 10.12776/qip.v16i2.67.
[4] McClave, J., SINCICH, T.: Statistics, 12th edition, Pearson Education Ltd., 2014, ISBN:
978-1-292-02265-9.
[5] ROY, R.K.: Design of Experiments Using Taguchi Approach, 1st edition, John Wiley and
Sons, 2001, ISBN: 0-471-36101-1.
[6] SIVA V.: Improvement in Product Development: Use of Back-End Data to Support
Upstream Efforts of Robust Design Methodology, Quality, Innovation, Prosperity, Vol.
XVI/2-2012, p. 84-102, ISSN: 1335-1745, DOI: 10.12776/qip.v16i2.65.
[7] TAGUCHI G., CHOWDHURY, S., WU, Y.: Taguchi’s Quality Engineering Handbook,
1st edition, John Wiley and Sons, 2005, ISBN: 0-471-41334-8.
[8] TOŠENOVSKÝ, J., NOSKIEVIČOVÁ, D.: Statistické metody pro zlepšování jakosti.
1.vyd. Ostrava: Montanex, a.s., 2001, ISBN 80-7225-040-X.
[9] TOŠENOVSKÝ, J., DUDEK, M.: Základy statistického zpracování dat.1.vyd. Ostrava:
VŠB, 2001, ISBN 80-248-0006-3.
[10] WITTE, R.S., WITTE, J.S.: Statistics, 9th edition, John Wiley and Sons, 2010, ISBN:
978-470-39222-5.
[11] ZGODAVOVÁ, K., BOBER, P.: An Innovative Approach to the integrated Management
System Development: SIMPRO-IMS Web-Based Environment, Quality, Innovation,
Prosperity, Vol. XVI/2-2012, p. 59-70, ISSN: 1335-1745, DOI: 10.12776/qip.v16i2.69.
[12] ZGODAVOVÁ, K.: Complexity of Entities and its Metrological Implications,
Proceedings of the 21st International DAAAM Symposium, p. 365-367, 2010, ISSN: 1726-
9679.
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 158 APPENDIX
1 – TABLE FOR DURBIN-WATSON’S TEST
Table for Durbin – Watson’s test: alpha = 1%, dL = lowel limit, dU = upper limit, n =
sample size, k′ = number of model regressors without the absolute term.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 159 -
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 160 Table
for Durbin – Watson’s test: alpha = 5%, dL = lower limit, dU = upper limit, n =
sample size, k′ = number of model regressors without the absolute term.
Filip Tošenovský;STATISTICAL METHODS FOR ECONOMISTS
- 161 -
Filip Tošenovský; STATISTICAL METHODS FOR ECONOMISTS
- 162 Název:
Statistical Methods for Economists
Autor: Ing. Filip Tošenovský, Ph.D.
Vydavatel: Slezská univerzita v Opavě
Obchodně podnikatelská fakulta v Karviné
Určeno: studentům SU OPF Karviná
Počet stran: 162
Vydání: on-line
ISBN: 978-80-7510-033-7