Charles University in Prague
Faculty of Social Sciences
Institute of Economic Studies
RIGOROSIS DIPLOMA THESIS
On the predictability of Central European stock
returns
“Do Neural Networks outperform modern econometric techniques?”
Author: Mgr.Jozef Baruník
Supervisor: PhDr. Filip Žikeš
Academic year: 2005/2006
Declaration:
Hereby I claim that I elaborated this diploma thesis on my own, and that
the only literature and sources I used are those listed in references.
July the 14th, 2006
Author’s signature
ABSTRACT
In this thesis we apply neural networks as nonparametric and nonlinear methods
to the Central European stock markets returns (Czech, Polish, Hungarian and
German) modelling. In the first two chapters we define prediction task and link
the classical econometric analysis to neural networks. We also present
optimization methods which will be used in the tests, conjugate gradient,
LevenbergMarquardt, and evolutionary search method. Further on, we present
statistical methods for comparing the predictive accuracy of the nonnested
models, as well as economic significance measures. In the empirical tests we first
show the power of neural networks on MackeyGlass chaotic time series followed
by realworld data of the daily and weekly returns of mentioned stock exchanges
for the 2000:2006 period. We find neural networks to have significantly lower
prediction error than classical models for daily DAX series, weekly PX50 and BUX
series. The lags of timeseries were used, and also crosscountry predictability
has been tested, but the results were not significantly different. We also achieved
economic significance of predictions with both daily and weekly PX50, BUX and
DAX with 60% accuracy of prediction. Finally we use neural network to learn
BlackScholes model and compared the pricing errors of BlackScholes and neural
network approach on the European call warrant on CEZ. We find that networks
can be used as alternative pricing method as they were able to approximate the
market price of call warrant with significantly lower error then BlackScholes
itself. Our last finding was that LevenbergMarquardt optimization algorithm used
with evolutionary search provides us with significantly lower errors than
conjugate gradient or gradient descent.
Keywords: emerging stock markets, predictability of stock returns, neural
networks, optimization algorithms, derivative pricing using neural networks
JEL classification: C22, C32, C45, C53, E44, G14, G15
ABSTRAKT (in Czech)
V této práci jsou aplikovány neuronové sítě jako neparametrická, nelineární
metoda modelování na středoevropské trhy (Český, Polský, Maďarský a
Německý). V prvních dvou kapitolách je definováno prognózování v kontextu
klasické ekonometrické analýzy ve spojení s neuronovými sítěmi. Dále jsou
prezentovány optimalizační metody použité při testování – konjugovaný gradient,
LevenbergMarquardt a genetické algoritmy, a nakonec statistické metody pro
srovnání přesnosti předpovědí různých modelů a jejich ekonomickou signifikaci.
V empirickém modelování je nejdřív ukázána výkonnost neuronové sítě na
chaotické časové řadě MackeyGlass. Dále následuje analýza reálných denních a
týdenních časových řad středoevropských indexů pro období let 2000 až 2006,
kde je ukázáno, že Neuronové sítě predikují denní výnosy DAX a týdenní výnosy
PX50, BUX se signifikantně nižší chybou pomocí časových řad historických výnosů
než ostatní ekonometrické metody. Podobných výsledků bylo dosaženo při
predikci národního výnosu pomocí zpožděných výnosů alespoň jednoho
z ostatních indexů. Dále je taky ukázáno, že s Neuronovou sítí byla dosažena
ekonomická signifikace predikce denních i týdenních výnosů
PX50, BUX i DAX.
Přesnost předpovědí testovaných řad se pohybuje kolem 60%, co považujeme za
dobrý výsledek. V poslední kapitole je použita neuronová síť pro ocenění
Evropského nákupního warrantu na ČEZ za pomoci časové řady historických cen.
Je ukázáno, že síť je možné použít i jako alternativu pro oceňování, jelikož dokáže
aproximovat tržní cenu lépe než BlackScholesův model. Poslední testy ukázaly,
že LevenbergMarquardtova optimalizační metoda použita s genetickým
algoritmem vykazuje signifikantně nižší chyby odhadů než ostatní metody.
Klíčová slova: výnosy akcií a jejich predikce pomocí neuronové sítě, optimalizační
algoritmy, oceňování derivátů pomocí neuronové sítě
JEL klasifikace: C22, C32, C45, C53, E44, G14, G15
Contents
CONTENTS..........................................................................................................................................
D
INTRODUCTION.................................................................................................................................
1
CHAPTER 1 STOCK RETURNS PREDICTABILITY USING MODERN ECONOMETRIC
METHODS .....................................................................................................................................
4
1.1 PROPERTIES OF STOCK RETURNS TIMESERIES.............................................................................
5
1.2 EFFICIENT MARKET HYPOTHESIS.................................................................................................
5
1.2.1 Martingale model.........................................................................................................
6
1.2.2 Random Walk model....................................................................................................
8
1.3 DEFINITION OF THE PREDICTION TASK ......................................................................................10
1.4 LINEAR REGRESSION MODELS ...................................................................................................11
1.4.1 Classical regression model........................................................................................12
1.4.2 Autoregressive model.................................................................................................13
1.4.3 The ARIMA (p,1,q) model..........................................................................................13
1.5 GARCH MODELS......................................................................................................................14
CHAPTER 2 NEURAL NETWORKS...........................................................................................17
2.1 THE METHODOLOGY PROBLEMS ................................................................................................19
2.2 WHAT IS A NEURAL NETWORK? ...............................................................................................20
2.2.1 Feedforward Networks...............................................................................................21
2.2.2 Transformation functions – logsigmoid, tansig and Gaussian...................................22
2.3 MULTILAYERED FEEDFORWARD NETWORKS ............................................................................25
2.4 LEARNING ALGORITHMS ...........................................................................................................27
2.4.1 Stochastic gradient descent backpropagation learning algorithm ............................28
2.4.2 Conjugate Gradient Learning Algorithm...................................................................30
2.4.3 LevenbergMarquardt Learning Algorithm...............................................................33
2.5 THE NONLINEAR ESTIMATION PROBLEM..................................................................................34
2.5.1 Stochastic evolutionary search ..................................................................................36
2.5.2 Hybrid learning as a solution? ..................................................................................38
2.6 PREPROCESSING THE DATA .......................................................................................................38
2.6.1 Curse of dimensionality .............................................................................................39
2.6.2 Principal Component Analysis...................................................................................39
2.6.3 Nonlinear Principal Components using neural networks..........................................41
2.6.4 Stationarity: Dickey—Fuller Test..............................................................................42
2.6.5 Data scaling...............................................................................................................43
2.7 EVALUATION OF ESTIMATED MODELS .......................................................................................44
2.7.1 Normality...................................................................................................................45
2.7.2 Goodness of fit ...........................................................................................................46
2.7.3 Schwarz Information Criterion..................................................................................47
2.7.4 QStatistics.................................................................................................................47
2.7.5 Root Mean Squared Error Statistic............................................................................48
2.8 STATISTICAL COMPARISON OF PREDICTIVE ACCURACY ...........................................................48
2.8.1 Optimal forecast under different loss functions.........................................................49
2.8.2 DieboldMariano Test................................................................................................51
2.9 ECONOMIC SIGNIFICANCE TESTS ...............................................................................................52
2.9.1 The HenrikssonMerton measure...............................................................................52
2.9.2 The BreakEven Transaction Costs............................................................................53
2.9.3 Pesaran and Timmerman nonparametric market timing..........................................54
2.10 BLACKBOX CRITICISM............................................................................................................55
2.11 CONCLUDING REMARKS ..........................................................................................................57
CHAPTER 3 APPLICATION TO CENTRALEUROPEAN STOCK MARKET RETURNS
MODELLING ...................................................................................................................................59
3.1 EXAMPLE OF A MACKEYGLASS ARTIFICIAL SERIES..................................................................60
3.2 EUROPEAN STOCK MARKETS.....................................................................................................63
3.2.1 Data description ........................................................................................................63
3.2.2 Empirical results – daily returns ...............................................................................65
3.2.3 Empirical results – weekly returns.............................................................................67
3.3 PX50: GAINING THE PREDICTIVE EDGE....................................................................................69
3.3.1 Cointegration of BUX, WIG, DAX and PX50 markets .............................................69
3.3.2 Crossmarket predictions...........................................................................................71
3.4 CONCLUDING REMARKS ............................................................................................................73
CHAPTER 4 APPLICATION TO PRICING DERIVATIVES...................................................75
4.1 THEORETICAL FRAMEWORK PROPOSED BY BLACK AND SCHOLES ............................................76
4.2 NEURAL NETWORK APPROACH TO DERIVATIVES PRICING .........................................................77
4.3 PRICING OF CEZ CALL WARRANT.............................................................................................79
4.3.1 The data .....................................................................................................................79
4.3.2 Learning the Black Scholes formula..........................................................................81
4.3.3 Performance of Neural Network in warrant pricing..................................................82
4.4 CONCLUDING REMARKS ............................................................................................................84
CONCLUSION....................................................................................................................................85
APPENDIX A: DISTRIBUTION OF MACKEYGLASS SERIES.................................................88
APPENDIX B: OLS ESTIMATION RESULTS...............................................................................89
REFERENCES....................................................................................................................................90
Acknowledgments
First and foremost I would like to thank Filip Žikeš from the Faculty of
Social Sciences, Charles University for his guidance, many useful suggestions and
valuable comments, and for supervising my work on this thesis. I also owe a
great deal to people from Brokerjet a.s. (Prague) for giving me the chance to
understand the market behavior from its “inside”, specially to Petr Ondřej and
Tomáš Provazník for various discussions on the trading issues for past three
years.
Last, but not least, I would like to thank to my parents for their neverending
love and support.
Introduction
“One of the earliest and most enduring questions of financial econometrics
is whether financial asset prices are forecastable. Perhaps because of the obvious
analogy between financial investments and games of chance, mathematical
models of asset prices have an unusually rich history that predates virtually every
other aspect of economic analysis. The fact that many prominent mathematicians
and scientists have applied their considerable skills to forecasting financial
securities prices is a testament to the fascination and the challenges of this
problem. Indeed, modern financial economics is firmly rooted in early attempts to
“beat the market”, an endeavor that is still of current interest, discussed and
debated in journal articles, conferences, and at cocktail parties!”
Campbell, Lo and MacKinlay (1997), p.27
Life must be understood looking backwards, but must be lived looking
forward. The past is helpful for predicting the future, but we have to know which
approximating models to use, in combination with past data, to predict future
events. Žikeš (2003) finds that European stock returns do not follow random
walk, thus contains predictable components, and presents modern econometric
techniques which helps us to uncover part of the pattern. We would like to link
these methods with neural networks research and provide a useful bridge which
lacks in most of the literature. This thesis is an extension of previous work aimed
on the predictability of Central European stock markets returns, presenting the
neural network approach to the problem.
On the basis of universal approximation theorem, we use the neural
networks with hope they will improve the prediction task as they are able to
approximate any function as Hornik, Stinchcombe, and White (1989) shows.
Thus, we will aim on comparison of results of econometric modelling and neural
network modelling to see whether neural networks brings us closer insight into
the patterns of stock returns or not. The readers shall see that the neural network
is a very useful nonparametric econometric technique. Criticisms rise mainly from
the fact that neural networks drew their motivation from biological phenomena,
from physiology of nerve cells, they have become part of a separate literature
(see Hertz, Krogh and Palmer (1991), Hutchinson, Lo, and Poggio (1994), Poggio
and Girosi (1990), and White (1988) resp. (1992) for the overview). We will also
append this discussion in this thesis. The structure will be as followed:
We start with theoretical framework of stock returns predictability in the
first chapter, where we present Efficient Market Hypothesis, define the prediction
task, and present linear regression models and GARCH modelling.
In the second chapter we move further on to neural networks. We discuss
methodology problems first to avoid confusion, then we present basic forms of
networks and transformation functions which will be tested further in the next two
chapters. We also discuss the most important  optimization methods used.
Starting with quasiNewton stochastic gradient search, through conjugate
gradient and LevenbergMarquardt we get to stochastic evolutionary searches
and discuss nonlinear estimation problem. At the end of the chapter we pay
attention to the evaluation of estimated models, and to statistical methods of
predictive accuracy and economic significance. We close the chapter with Black
Box criticism discussion where we comment on its irrelevance.
In the third chapter we apply presented methods to central European stock
market returns. We start with the modelling of MackeyGlass’s chaotic time series
to show how neural network perform on artificial data. On the basis of general
approximation theorem we expect the neural network to approximate the process
very well. We will also compare it to common techniques presented in the first
chapter to illustrate the power of the networks. In the rest of the chapter we
model the PX50, BUX, DAX and WIG daily and weekly returns. On the insample
and more important outofsample criteria we test classical autoregressive
models, ARIMA (p,I,q) and GARCH with neural networks. For the comparison we
use statistical tests described in the theoretical part, and also tests of economic
relevance of the prediction model.
In the last chapter we examine the usage of neural networks to derivatives
pricing. If the price of derivative is determined by the BlackScholes formula,
neural network can be used to estimate the Black Scholes formula with sufficient
degree of accuracy. If the assumptions of BlackScholes model are violated, the
neural networks can be used as better and more efficient derivative pricing
models. We follow this analysis as the logical implication from findings in the third
chapter, while assumptions of Black Scholes as lognormal distribution of stock
prices, geometric Brownian motion, constant volatility or frictionless markets are
nonrealistic, we expect the neural network to be able to price the derivatives
more efficiently. We conduct the empirical analysis on the European call warrant
on the CEZ, the second most liquid security on the Czech stock market. The
methodology is simple. Firstly we test if the neural network is able to
approximate the BlackScholes on the artificial data on the call warrant on CEZ.
Then we will use real market prices and test if the neural networks can be used
as the nonparametric derivative pricing method effectively than BlackScholes
itself.
The thesis concludes with summary of the empirical results we achieve and
suggestions for further research.
Chapter 1
Stock returns predictability using
modern econometric methods
Predictability of stock returns have been attracting the attention of many
academics and professionals for a long time1. It concerns forecasting future
returns from the past – observed – returns as well as crosssectional forecasting
from other  financial or macroeconomic  variables2 that relates to the returns.
The basic assumption is that history tends to repeat itself, meaning that past
patterns of price behavior in individual stocks will tend to repeat in future. Thus
the way to predict the future of returns is to develop and uncover those patterns.
The economic rationale for doing so is very strong: abnormal returns. At a first
glance, the problem seems to be simple. All we need is historical prices of the
returns which we want to forecast, and “userfriendly” econometric software
which will do the work for us and recognize the patterns in the data. Costs are
negligible even to a common investor and possible results of correctly modeled
returns are very attractive.
This chapter outlines commonly used techniques for time series prediction,
and presents enhanced modern econometric methods for modelling of time series
and detecting the presence of regular patterns. Although it presents most of the
1 Campbel, Lo, MacKinlay (1997) can be used to find references addressing almost any question of the
problem. Hellstrom, Holmstrom (1998), Hawanini and Keim (1993).
Main reference to this research are Fama and French (1988, 1989, 1990), Chen, Roll and Ross
(1986), Barro (1990)
important concepts and brings the reader in the problem, it serves just as an
introductory chapter to the main concept – neural networks presented in this
thesis.
The organization is as follows. Firstly, the Efficient Market hypothesis, an idea
which stands at the beginning of this research is presented in its three forms in
(1.2). Martingale and Random Walk processes helps to close the basic framework
of stock returns predictability. In (1.4) we present Classical Linear regression
modelling with more general autoregressive and ARIMA (p,1,q) models.
Subchapter (1.5) follows with exploring nonlinear, timevarying models which
stands on the generalized autoregressive conditional heteroskedasticity, GARCH.
1.1 Properties of stock returns timeseries
First of all we present basic properties of stock returns as the motivation.
All of the problems will be discussed in detail in following subchapters thus the
reader can find references there. Also statistical and distributional properties (i.e.
heavy tails) will not be mentioned here as we will discuss them further in
empirical testing of the presented models. This part should only serve as an
essential introduction of the basic concepts of stock returns predictability.
i)
Stock returns time series often behave nearly like a randomwalk process,
which means that from a theoretical point of view there are no predictable
regular patterns. Predictability of stock returns have also been questioned
in scope of the efficient market hypothesis.
ii) Statistical properties of the time series are different at different points in
time.
iii) Financial time series are very noisy, meaning that there is a large amount
of random daytoday variations.
1.2 Efficient market hypothesis
The efficient market hypothesis (EHM) has been one of the most
important concepts in modern financial theory as it has found broad acceptance3.
As summarized by Fama (1970), “a market in which prices always ‘fully reflect’
3 Anthony and Biggs (1995), Malkiel (1987), White (1998)
available information is called ‘efficient’.” As Campbel, Lo, MacKinlay (1997)
remarks, quotation marks ‘fully reflects’ are prompting that the formulation needs
to be explained in detail. Malkiel (1987) expands the Fama’s definition with the
idea of judging efficiency of market by measuring the profits that can be made by
trading on the available information. He writes: “If the market is efficient, it is
impossible to make economic profit by trading on the information.” Thus if the
current price reflected all information available at the market, no prediction of
future changes would be possible. As new information enters the market, it is
immediately reflected and new market price is developed. Depending on the type
of information set, Roberts (1967) distinguishes
.
Weakform Efficiency: The information set includes only the history of
the prices or returns themselves. In other words, technical analysis4 is
of no use.
.
Semistrongform Efficiency: The information set includes all publicly
available information known to all market participants. In other words,
fundamental analysis5 is of no use.
.
Strongform Efficiency: The information set includes all privately
available information known to any market participant. In other words,
even insider information is of no use.
As we consider stock returns predictability at this work, we will work only with
weakform efficiency which enables us to hope that we will be able to predict the
future returns from the past ones.
1.2.1Martingale model
Martingale model was perhaps the earliest idea of financial asset pricing
models, which grew from the history of game of chances and probability theory.
Girolamo Cardano (1565) proposed that the “most fundamental principle of
4
Technical analysis is based on creating various basic indicators as trendlines, support and
resistance, volatility, momentum indicators etc. from past prices and volume. Indicators are used to
produce trading (buy/sell) signals or rules. This is done mainly graphically by comparing the price and
a trading rule.
5
Fundamental analysis is mainly based on the financial analysis of the company’s value aiming on
profitability, efficiency and true value of company’s stock.
gambling is equal conditions.” Thus by the means of a fair game, the stochastic
P.
process {
}
0
satisfies the following condition:
tt=
Ft..
=
Pt, (1.1)
...Pt+1
where Pt
is stock price at time t and is Ftmeasurable, ...
Pt+1
Ft.
.
are conditional
expectations defined on the probabilistic space ,,{}, ), where . is the
(
.
FF
P
t
space of market situations, Fis .algebra of the subsets of ., {} is the
Ft
usual filtration, Ft =.{PP, ,..., P}, which is also called information set, and P
tt11
is a probability measure on F. Then tomorrow’s price is expected to be equal to
today’s price given the historical prices as information set. Martingale hypothesis
implies that the expected return is zero as:
Ft..
=
Pt+..
.rt+1
Ft.., (1.2)
...Pt+1
or if equation (1.1) holds,
...Pt+1 Pt
Ft.=.. .
.rt+1
Ft.
=
0 , (1.3)
.
where rt is stock price change. The reader should note that martingale
hypothesis implies that price changes are uncorrelated at all lags. Increments in
value (changes in price) are unpredictable and conditional on the information set
which is fully reflected in prices. Hence any attempt of linear and nonlinear
forecasting rules is ineffective, as
Cov fr
.
(), gr(
.=0 , (1.4)
t.
tj ).
t
+
where f(). and g(). are two arbitrary functions ,:
\
›
\
, rt and rtj are
.fg +
stock price changes, or returns in two periods for all tand j.0
.
In fact, the martingale was considered to be a necessary condition for an efficient
market. Roberts (1967) considers it to be a weakform market efficiency.
Main drawback of the martingale model is that it does not allow a tradeoff
between risk and expected return. If the expected return was zero, no one
would invest in the security. It has been shown that martingale is neither a
necessary nor a sufficient condition for rational markets6.
6 i.e. Leroy (1973)
1.2.2 Random Walk model
The martingale model given by (1.1) resp. (1.2) can be rewritten equivalently as
Pt+1 =Pt+.t, (1.5)
where {.t}
is a martingale difference sequence. In this form, it is nearly identical
with the random walk model, the forerunner of the theory of efficient capital
markets. The martingale, however, is less restrictive than the random walk. It
requires only independence of the conditional expectation of price changes from
the information available. Random walk model requires, furthermore,
independence involving the higher conditional moments of the probability
distribution of price changes.
Campbel, Lo and MacKinlay (1997) distinguish between three versions of
the random walk hypothesis. The simplest one is Random Walk 1 or RW1, the
independently and identically distributed  iid7 increments in which the dynamics
of {pt}8is given by:
p
µ
p +.,
. ~
(0, .2 ),
=
+
(1.6)
tt1 tt
where .t is an random variable with zero mean, variance .2 and µ is the
expected price change or drift. Conditional mean and variance are linear functions
of time9, which implies that random walk is nonstationary. We will assert that
natural logarithm of prices follows random walk with iid increments to avoid the
problem of limited liability of stock returns. If the {Pt}was normally distributed,
there would always be positive probability of Pt<0 which is unrealistic.
Random Walk is thus sufficient but not necessary condition for market
efficiency in its weakform. Hence rejecting the null hypothesis H0 that stock
returns follow random walk does not mean market inefficiency. The second
version, RW2, also relaxes the identical distribution assumption which allows
timevarying, unconditional volatility. RW1 is thus a special case of RW2 which
contains more general price processes and allows for unconditional
7
iid will be used from this point as standard notation for independently and identically distributed
variable
8 Continuously compounded returns rpp
p is natural logarithm of price .
=
tt1 , where tpt =ln Pt
9
..
pp0 ..=p0 +µ, .
.
t
t Varp
p0 ..=.2t.
.
t
heteroskedasticity10. RW3 is an even more general version – one most often
tested in the literature – which relaxes the independence assumption and includes
price processes with dependent but uncorrelated increments. Lo, MacKinlay
(1988) exploits simple Random walk tests in detail. We will not describe the tests
here as the reader can follow the reference if needed.
Now, when we have discussed the basic idea of stock return predictability,
we can move on to more sophisticated methods, but before we do so, a short
conclusion of EHM framework will be carried out. The paradox of efficient markets
is that if every investor believed a market was efficient, then the market would
not be efficient because the participants would not want to trade as they would
not expect the profit. In effect, efficient markets depend on market participants
who believe the market is inefficient and trade securities in an attempt to
outperform the market. For deeper analysis, see Grossman, Stiglitz (1980)
Although market efficiency is not really testable because of joint
hypothesis11, it provides a basic framework of stock returns prediction. It started
the discussion, and nonrejecting Random Walk hypothesis implies that there are
no patterns to be found in the stock returns.
Even we can not test the market efficiency, in reality we find most of the
markets to be neither perfectly efficient nor completely inefficient. For evidence,
Cambazoglu (2003), Hellstrom, Holstrom (1998), Lo, MacKinlay (1988), Žikeš
(2003) and much more researchers found predictable patterns at various world
stock markets and provided an evidence that tested markets are predictable to
some extent. From the other point of view, we can say that all markets are
efficient to a certain extent, some more so than others. “Rather than being an
issue of black or white, market efficiency is more a matter of shades of gray” 12.
In markets with substantial impairments of efficiency, more knowledgeable
investors can outperform less knowledgeable ones. Hence, abnormal returns,
even if small ones, will necessarily exist to compensate participants for taking
their risk, even if predictable patterns will not be found. This debate is the
starting point for predictability models which will be discussed in next chapters.
10
In recent literature reader can find dozens of empirical evidence that returns are conditional
heteroskedastic. i.e. Campbel, Lo and MacKinlay (1997) contains the reference
11
Any test of efficiency must assume an equilibrium model that defines normal returns. Rejecting
market efficiency implies that market is truly inefficient or an incorrect equilibrium model has been
assumed. Hence, market efficiency as such can never be rejected, Fama (1991)
12 Lo, MacKinlay (1988)
1.3 Definition of the prediction task
Prediction problem can be formulated in various ways. We will restrict on
defining the stock returns prediction, as it is the primary concern of the thesis,
even if the stock prices are not the only financial timeseries of the general
economist’s interest. General prediction can be defined as follows:
Let P be a random variable defined on a probability space (.,,FF,P
{
}
),
tt
where . is space of outcomes, Fis .algebra of the subsets of ., and P is a
probability measure on {}is the filtration.
Fand Ft usual A conditional
Ft.
.
is conditional probability of the set Pt
being evaluated with
the information available in the .algebra F.
probability
P
..
Pt+1
Now let us assume following economic agent’s utility functions:
uW+=gP ,
.
P^
+
, (1.7)
(
th)
(
th+
(
th)
)
where agent’s utility u(). depends on the variable P in time +, decision
th
function .().and forecast P^
with forecasting horizon h.1, and w is an reward
variable. For illustration, let us set h=1. At time t+1, agent’s utility depends on
the realization of pt+1, and accuracy of it’s forecast, p^t+1. Forecasting is defined as
major factor of a decision rule.
Let EP.
.
th+
F.
=
P^
th=hX,
.
be an expectation of P+ conditional on
t
(
t
)
th
t.
+
the information set Ft, where ... is unknown vector of parameters, where
....
k
is compact and observable at time t, Xt is an Ftmeasurable vector of
variables.
Xt may include P information, but also some exogenous variables, indicators,
tn
etc. Thus the reader may note that an optimal forecast from our definition does
not exclude misspecification or failure to include relevant information in Xt,
which may have crucial impact on the predictions. Under this imperfect setting,
utility function will be negatively correlated with forecast error which can be
defined as .+
t
=
pth+p^th+
th
t.
Maximizing utility function requires to find optimal forecast P^
*
+ and to
th
establish optimal decision .(.
)
based on this forecast. Optimality here can be
achieved by minimizing expected loss function L:
\ ›
..
+
:
P^
*
+
t.
argmin ELP X .
. (
th+
,,..,
)
Ft.
.
, (1.8)
th
..
.
were . is a degree of asymmetry. The reader can find indepth discussions of
possible error functions with assumptions in Patton, Timmermann (2004, 2006)
reference as general definition of Loss function is sufficient for our definition of
prediction task. Rigorous discussion of prediction task can also be found in
Hamilton (1994). For illustration, we define just optimal forecast depending on
loss function which depends only on forecast errors. This form13 will be also used
further in our tests:
P^
* .
(
^
.
.)
.
min ELP

Pth
Ft
=
min EL (
. +
Ft
.
.
. (1.9)
)
t
.
th++
+
th
t
t
.
.
th
Later in the chapter (2.8)  Statistical Comparison of Predictive Accuracy, we will
present an optimal forecast under the different loss functions.
In next sections we will consider classical linear and nonlinear regression
models as common choices of estimating .
.
th+
Ft.
.
, through which we will get
EP
to another possibilities, neural network models
1.4 Linear regression models
Mounting evidence in the literature can be found, that stock prices do not
follow random walk. Lo, MacKinlay (1988) decisively reject the null hypothesis
that U.S. stock weekly returns are the random walk process. Žikeš (2003) finds
that Central European markets also do not follow random walk. Filacek et al.
(1998) find that daily returns of PSE’s14 main index PX50 are significantly
positively autocorrelated. In this subchapter we will introduce basic linear and
nonlinear regression models, so the principle of the modern forecasting
techniques can be extended in next chapters by Neural Network models.
13 i.e. MSE – mean squared error, MAE – Mean absolute error has this form
14 Prague Stock Exchange, Czech Republic
1.4.1Classical regression model
When predicting, we usually start with a linear regression model, where a
given output variable y is predicted from information on a set x of observed
variables. In time series, input variables might include lagged output variable or
contemporaneous exogenous variables. The model is defined by following
equation:
p
yt =.ßixi t , +.t , (1.10)
i=1
.t
~
N (0,.2 ),
where .t is random disturbance term, E ...t
xt ..=0 . {ßp
}
are parameters to be
....
estimated, while {ßp } represents estimated set of coefficients and {y }denotes
p
..
estimated (predicted) output variables. The main goal is to find {ßp }to minimize
the sum of squared differences, or residuals . between the observed y variable
and the modelpredicted ..y variable. There are a various ways and estimation
methods15 of the problem:
T 2 T 2
Min.
=..
t =.(yt ..yt ), (1.11)
t =1 t =1
where
p
yt =.ßixi t , +.t ,
i=1
p
..y =.ßx ,
t iit ,
i=1
2
.t
~
N (0,.).
15 with different assumptions about distribution of the disturbance term .t , or about the constancy of
its variance .2 , as well as about the independence of the input variable, reader can find these
methods at any standard econometric textbook, i.e. Greene (1993) or Baltagi (2002)
1.4.2Autoregressive model
Commonly used linear model which enhances classical regression is an
autoregressive model:
pq
yt =.ßiyti +..jxj t , +.t , (1.12)
i=1 j=1
2
where are .~
N (0,.), and where there are q exogenous x variables with
t
coefficients .j , p lags of the dependent variable y and p +q coefficients to be
estimated. In the timeseries model this is known as the linear ARX model, since
the autoregressive components are given by lagged y variables and it
incorporates exogenous x variables.
1.4.3The ARIMA (p,1,q) model
Generalization of simple Random Walk Model and Autoregressive Model is
allowing for serial correlation in the disturbances .t . Autoregressive integrated
moving average model  ARIMA (p,1,q)  is the most applied linear model for
approximation of stock returns processes. It puts together three processes for
modelling the serial correlation in the disturbances: AR (p), MA (q) and
integration order term. The processes are as follows.
AR (p) process includes p lagged values of the returns in the forecasting
equation for the unconditional residual. An autoregressive model of order p has
the form:
p
rt =..irti +.t , (1.13)
i=1
n
or represented using lag operator L. ..n {1,..., p}: L r t =rtn :
p
.
.
.
.
.1...iLi ..rt =.t . (1.14)
..i=1 .
.
The second, integration order term corresponds to differencing the values
being forecast. In this model, the first difference is enough as the stationarity can
be achieved. Third, MA (q) process uses lagged values of forecast error to
improve the current forecasts. For the q order it has the form:
q
r =+....,
(1.15)
tt iti
i=1
or r =+1 .L ..
(1.16)
t
.
.
.q
ii ..
t.
i=1
.
Thus ARIMA (p,1,q)16 model can be generally represented by:
pq.
i .
.
j
.
=++
. .
(1.17)
.1..iL .
( )
1L rt µ.1 .jL .t .
.
i=1 .
.
j=1
.
A common way to estimate the ARIMA (p,1,q) was proposed by Box and
Jenkins (1976). Time series needs to be differenced to achieve stationarity. Then
the guess of p and q is made by observing autocorrelation and partial correlation
functions. Nonlinear least squares or Maximum likelihood method is then applied
to estimate the model, and diagnostic tests are run to see if the guess of p and q
orders was appropriate. BoxJenkins methodology is widely used and the reader
can find the details in Box and Jenkins (1976).
While choosing p,q as a “let the data speak” process is being attacked by
researchers because it is a process of guessing, ARIMA model still helps the
researchers in understanding of behavior of the stock prices. Linear models may
become of very good use mainly on the markets with longterm trends with only
small symmetric changes in the variable. However, for the volatile markets,
nonlinear processes in the returns may come into the researcher’s sight. Thus,
linear models may fail to capture the turning points, bubbles and unexpected
moves in the prices. For this reason, we will present nonlinear forecasting
techniques.
1.5 GARCH models
There are many types of nonlinear functional forms to use as an
alternative to linear ones. The main approach is the GARCHtype models17. These
models are based on the main principles of the modern finance – risk which is
related to an expected stock returns. To measure the risk of an asset, the
standard deviation of returns from unconditional mean is used. This measure is
also interpreted as the volatility of a stock returns hence main use of GARCH
16 Note that ARIMA (0,1,0) is a random walk which is a special case of this general process.
GARCH stands for generalized autoregressive conditional heteroskedasticity. The model was
introduced by Engle (1982) who received the Nobel price in 2003 for his work on this model and
generalized by Bollerslev (1986).
models is for volatility prediction. Following describes a general GARCH(r,p)
model:
r=
ß
+xTß. ,
(1.18)
+
t
0
t 1 t
.t ..(0,.2 ),
t
nm
.2 =.+
..
2
+
..
2
 ,
(1.19)
t
0 .
i ti 
.
j tj
i=1 j=1
where r is rate of return, .t is normally distributed with zero mean and
conditional variance .2 . .’s and .’s represent evolution of conditional variance.
(
, )
max pq
Condition ..<1
.(
i
+
i)
is imposed so the unconditional variance is finite,
i=1
whereas its conditional variance evolves over time.
For the demonstrative purposes we set ,
rp to 1 and present GARCH (1,1)
model, which is most common in financial time series predictions.
.2 =
.
+..
2 +..
2 . (1.20)
t 01 t11 t1
GARCHM type model is another useful alternative, while it accounts for
the possibility that returns are dependent on the volatility. In GARCHM models,
the variance of the disturbance term directly affects the mean of the dependent
variable. Thus it includes volatility in the return equation:
r=
ß
+ß.2 +
. ,
(1.21)
t 01 tt
.2 =
.
+..
2 +..
2
.
(1.22)
t 01 t11 t1
The GARCHM model is a stochastic recursive system, given the initial
conditions .02 and .02 , as well as estimates. Random shock is drawn from the
normal distribution, hence we can use maximum likelihood estimation. The
likelihood function L is the joint likelihood of observing {yt}, for t=1,..., T and
has following form:
T 1
.
.
(yt ..yt)2 ..
L=
.
^ 2 exp .

^2 ., (1.23)
t=12..t 2.t
..
.
.
..y =
ß
+ß.l
, (1.24)
t 01 t
.t =yt ..yt,
(1.25)
...
2 =
.
+..
2 +..
2
.
(1.26)
t 01 t11 t1
..........
The usual method of obtaining the parameter estimates ..ß
ß.,1 is
,, ,
0101
maximizing he logarithm of the likelihood function wrt. parameters and restriction
that variance is greater than zero and .>0 , .>0 :
Max
T ()=1 .
.
ln 2
( ) .
T
...
t
+
T (t ..yt )2
.
.
, (1.27)
ln L
+
ln
..........
.
t
.
.().y
l
.
..ß
ß.
{0,,1 0,1,1}t=12
.
t=1 t=1 .t .
.
.
.
=
..t
2
where t 1,..., T;.>0 .
What is nice about the GARCH approach is that it captures the source of
nonlinearity. Conditional variance is nonlinear function of past values, variance is
the function of past prediction errors. Thus the risk factor in the
forecasting/predicting the dynamics of asset returns is captured well by the
model. GARCH models are also able to capture wellobserved phenomenon in
stock returns time series, volatility clustering. Periods of high volatility are
followed by high volatility and the same with periods of low volatility. Thus we
have a specific set of parameters to be estimated with welldefined meaning,
interpretation, and rationale. But the model is restrictive, because we are limited
to these welldefined sets of parameters and distribution, and specific form.
Possibility for reduction of this restrictiveness is to follow Bollerslev (1986)
and use his proposed Student’s tdistribution which better captures to financial
timeseries as they are often leptokurtic18 and fattailed. Bollerslev and
Wooldridge (1988) also derive the quasimaximum likelihood estimation method.
A interested reader should look for the details in the mentioned references as
our main interest of this thesis is neural network models and we just outline the
principles of the modern econometric tools for predicting timeseries so we can
compare and link it to the neural network approach in next sections. Even though
it’s not the main aim of this thesis, it can also serve to some extend as an
overview of all main methods, linear and nonlinear regressiontypes and also
neural networktypes. By starting the thesis with this first chapter where a reader
could find not only the framework for the prediction in form of EHM and Random
Walk but also the preview of approaches, we can do so. After this brief
introductory chapter to the problem, we will continue with the neural networks.
Variable is called leptokurtic when the standardized fourth moment, kurtosis, is higher than 3,
sometimes referred to as excess kurtosis. This also results in “fatter tails” of the density function.
Chapter 2
Neural Networks
Neural Networks learning methods provide a robust approach to
approximating realvalued, vectorvalued, and discretevalued functions. The
study of artificial neural networks (ANNs) has been inspired by the observation
that biological learning systems are built of very complex webs of interconnected
neurons. ANNs, are analogically built webs of interconnected set of simple units,
or inputs which may be possible outputs of other units, to produce simple output,
which may become input in other units, Mitchell (1997). The interested reader is
recommended to use the reference for further details, as we will put the neural
networks in use with financial time series, mainly stock returns. By referring to
“neural networks” we will consider mainly research targeting development of
systems capable to approximate complex functions efficiently and robustly in the
manner of the definition (1.3).
The main motivation of neural networks usage in predicting stock returns,
or other financial timeseries, is the same as presented in the first chapter. As
classical econometric models provide us some insights into the behavior of stock
returns, we believe that neural network will do better. We believe that the
learning process of neural networks will help approximate the learning process of
agents or investors more efficiently resulting in finding a better understanding of
stock prices. Contrary to the EMH, several researchers claim the stock market
exhibit chaos19. Chaos is a nonlinear deterministic process which appears
random, but can not be easily expressed. With the neural network’s ability to
19 Hsieh (1991), Barkoulas, travlos (1998), Peters (1994)
learn nonlinear, chaotic systems, it may be possible to outperform traditional
analysis presented in previous chapters.
McNelis (2005) shows very good results on predicting artificial data and
chaos process by neural networks and shows how artificial intelligence could shed
more light on the timeseries processes more then econometric tools presented in
the first chapter. He tests predicting power of the models also on industry data,
inflation, but the test on stock markets and volatility are missing. In the following
chapters, we will follow his and other works with empirical research on Central
European Markets while we believe that emerging markets, in particular, or
markets with a great innovation and changes, represent great opportunity for the
use of neural networks for the prediction task. The reasons are intuitive:
The data are often very noisy either because of thinness of the markets or
information or discontinuous trading20 gaps. Thus we have to deal with lots of
asymmetries and nonlinearities which can not be assumed. The other reason is
that agents in these markets are themselves in process of learning, mainly by
trial and error. Often they can not assume impact of policy news or legal changes
to the market simply because they did not see any real examples in their past.
Thus, information set for the prediction task is very limited. As we will show,
parameter estimates of neural networks are themselves a result of “learning by
mistake” and the search process and can be compared to parameters used by
agents to forecast and make decisions.
In this chapter we will present theoretical framework of neural networks
used further in the work for empirical modelling. We begin with methodology
problems, introducing the basic definitions of neural networks, feedforward and
multilayered feedforward neural networks. On the basis of universal
approximation theorem, these forms can approximate any continuous real
function as Hornik, Stinchcombe, and White (1989) show. We show that neural
network is not blackbox instrument by describing transformation functions,
neurons and defining the system mathematically. Then we follow with crucial
learning algorithms discussion, as tool for optimalization in terms of error
minimization. We discuss basic gradient descent search, more sophisticated
conjugate gradient method, LevenbergMarquardt method which seems to be
most efficient. We close the discussion with presenting a stochastic evolutionary
search and the discussion of the nonlinear estimation problem.
20 Often there are many stocks with no or very low volume trades at these markets
Finally, we turn into the crucial data preprocessing and testing statistics for
comparison of the analysis conducted in following chapters. We introduce
nonlinear Principal Component Analysis as an tool for dealing with curse of
dimensionality.
After the exhaustive introduction of neural network estimation procedure,
we close the chapter with attending Blackbox criticism discussion and try to
argue in favor of neural network usage in econometric modelling.
2.1 The methodology problems
Much of the early development and work on neural network analysis has
been within psychology, neuroscience related to the pattern recognition problems.
Genetic algorithms used for empirical implementation of neural networks have
followed similar pattern of development in applied mathematics in optimization of
dynamic nonlinear and discrete systems, moving into data engineering.
Thus these systems have been developed in different surroundings that
econometrical and statistical models which results in confusion in literature,
mainly from the simple technical and naming conventions. A model is known as
an architecture, and we train rather than estimate the network architecture. A
Researcher uses training set and test set of data instead of insample and outofsample
data, and the confusion should disappear whenever the reader expects
coefficients instead of weights.
If we consider the application of neural networks, or Artificial Intelligence
itself, the gap is almost widening. Broad literature on neural networks is simply
not relevant to financial professionals or academics. Also mounting publications
and empirical works on usage of neural networks in finance does not link to
preceding theoretical financial literature which is probably the reason why the
most of this literature is not taken seriously by the broader financial and
economic academic community. As McNelis (2005) remarks: “The appeal of the
neural network approach lies in the assumption of bounded rationality: when we
forecast in financial markets, we are forecasting the forecasts of others, or
approximating the expectations of others.” Thus, market participants are
continuously learning and adapting their beliefs from the past mistakes.
The basic is that reactions of market participants are not linear and
proportionate, but asymmetric and nonlinear to changes in variables. Neural
networks approximate this behavior in a very intuitive way, while our definition
from (1.3) still holds. A Very important point is approximation through the
learning process. As market agents are continuously learning, the neural network
is trying to capture the learning process and base on it. The difference between
Neural Network models and presented econometric models is also that
researchers are not making hypothesis about the coefficients to be estimated, or
about functional form of the model. The coefficients, or as mentioned weights, are
not able to be interpreted. In this manner, the methodology of prediction is
different while in econometrics one is striving to obtain consistent, accurate,
unbiased estimates of parameters to be interpreted.
2.2 What is a Neural Network?
Like linear or nonlinear methods, a neural network relates a set of input
variables, say, {xi
}
,i =1,..., k to a set of one or more output variables, say,
{yj
}
, j =1,..., k *. Let us recall the definition of the stock returns prediction
problem from chapter (1.3). It defines the prediction problem in the very similar
manner. The only difference between network and other approximation methods
is, that the approximating function uses one or more so called hidden layers, in
which the input variables are squashed or transformed by a special function,
known as logistic or logsigmoid transformation. While this approach may seem
“esoteric” or maybe “mystical” on at the first glance, the reader will soon see that
it may be used as a very efficient way to model nonlinear processes.
The reason we turn into neural network is straightforward. It is the goal of
the prediction problem to find an approach or method that forecasts the data
best, generated by unknown, nonlinear processes, with as few parameters as
possible, which is as simple as achievable and as easy to estimate as it can be.
Even if it seems impossible now, we may be surprised by the findings of next
chapters. Moreover, it has been shown that “neural networks can approximate
any function with finitely many discontinuities to arbitrary precision”21. This is
known as the universal approximation theorem.
21 Hornik, Stinchcombe, White (1989)
2.2.1Feedforward Networks
Inputs x Hidden Layer Output  y
x3
x2
x1
n1
n2
y
neurons n
FIGURE 2.1. : Feedforward Neural network
Structure of the most basic and commonly used neural network in finance
with one hidden layer22 containing two neurons, three input variables and one
output is schematically shown in FIGURE 2.1. We can see that in comparison with
classical linear models, there are two more neurons which process inputs to
improve the predictions. It should be mentioned here that the connection
between input variables and neurons, also called input neurons, and connections
between neurons and output, output neurons are called synapses.
The reader might note that the simple linear regression model is just a
special case of the feedforward neural network, namely network with one neuron
which contains a linear approximation function. The simples example of an
artificial neural network is the binary threshold model, McCulloch and Pitts
(1943), in which an output Y can either be zero or one related to I input
variables. The model may be formalized as follows23:
Y =f ..
.
I
ßiXi µ..
, (2.1)
.i=1
.
fu()=
...
10
ifif
uu
.
<
00.
, (2.2)
where f () is the activation function, hidden layer which transforms the inputs
u
into the neuron, and if the weighted sum of inputs is greater than µ, neuron is
activated. Now, we can discuss in detail most common functional forms of the
“mystic” neurons work
22 Sometimes referred to as multiperceptron network
We include this simple example here because it is very illustrative connection between classical
regression models and neural network models and we fell that this connection is often being forgotten
to explain in the neural networks financial research papers. This results in confusion and refusing of
this approaches.
2.2.2 Transformation
functions – logsigmoid, tansig and
Gaussian
Maybe the most confusion about neural networks comes from the hidden
layer presence and the function of neurons. They process inputs by forming linear
combinations of them and then squashing these combinations using the
logsigmoid function. In this part we will describe these squasher or
transformation functions, but for the illustrative purposes, we start with the figure
of a typical logistic function which will transform inputs, say {xi}, =5,...,5
i
before transmitting their effects to the output.
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
5 4 3 2 1 0 1 2 3 4 5
1
FIGURE 2.2. : Logsigmoid function
This function reflects the learning behavior of the networks, more precisely,
“learning by doing”. The function is increasingly steep until the inflection point
from which it becomes increasingly flat and its slope moves exponentially to zero.
Nonlinear sigmoid function captures learning process in the formation of
expectations characterized by bounded rationality. Kuan, White (1994) describes
it as “tendency of certain types of neurons to be quiescent of modest levels of
input activity, and to become active only after the input activity passes a certain
threshold, while beyond this, increases in input activity have little further effect”.
The feedforward or multilayered perception (MLP) network can be described by
following equations:
i*
n , =.k,0 +..
, x , , (2.3)
kt kiit
i=1
Nkt, =.(nkt, )
=
1+e1
nkt,
, (2.4)
k*
yt =.0 +..kNkt , , (2.5)
k=1
where .(nkt,
)
is the logsigmoid activation function. There is i* input variables
{x}, and k* neurons. .ki, represents coefficient vector or input weights vector.
Variable ntk, is squashed by the logsigmoid function, and becomes a neuron N ,
tk
at time t. Then the set of k* neurons are combined linearly with the vector of
coefficients {.k}, k=1,..., k* and forms the final output which is forecast ..yt. This
model is the workhorse of the neural networks forecasting approach as almost all
researchers start with this network as the first alternative to the linear models.
An alternative to a logsigmoid activation function is tansig or tanh
hyperbolic tangent function. The behavior is very similar to the logsigmoid
function, but it squashes the linear combinations within the wider interval of
[1,1]
rather then [0,1 ]. Formalization of the network with tansig squasher
functions is as follows:
i*
n , =.k,0 +..
, x , , (2.6)
kt kiit
i=1
enkt, enkt ,
N =
.
n
=
, (2.7)
kt,
(
kt,
)
enkt, +enkt,
k*
yt =.0 +..kNkt , , (2.8)
k=1
where .(nkt,
)
is the tansig activation function.
Another activation function is cumulative Gaussian function, commonly
referred to as the normal function. FIGURE 2.3 plots this activation function
against logsigmoid function.
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
5 4 3 2 1 0 1 2 3 4 5
Cumulative Gaussian
function (normal
distribution function
Logsigmoid function
FIGURE 2.3: Gaussian function
The advantage of usage the Gaussian function is that has thinner tails,
thus it does not respond to some extreme values. It can be observed from the
figure, that it shows very little or no response to extreme values below 2 and
above +2, while logsigmoid responds to them much more. Mathematical
formalization of the neural network using Gaussian activation function can be
represented by following system:
i*
n , =.k,0 +..
, x , , (2.9)
kt kiit
i=1
12
Nkt, =.(nkt, )
=
n
.
kt,
21
.
e

2
nkt,
, (2.10)

.
k*
yt =.0 +..
kNkt , , (2.11)
k=1
where .(nkt,
)
is the standard cumulative Gaussian function.
We described basic functional forms of neural networks with most
commonly used transformation functions. The reader is now probably asking the
questions: “OK but, what transformation function should I use?”, or “Are there
any other transformation functions?”. There are many other possible
transformation functions in fact. The reason we describe these few is that they
performed best in our tests and are also used in each of the references used in
this paper.
The answer to the first question is not as simple as answer to the second
one. Each transformation function transforms inputs in a different manner. Some
respond to extreme values, some do not, thus they do not serve equally well in
approximating the unknown function. Hence, choosing the form of squasher
function is often up to the researcher and the data used. The best way is to
perform tests with different transformation functions used in the neurons and use
the one which performs best. This is one of the main drawbacks of neural
networks, which will be discussed in further detail at the end of this chapter,
while it takes time.
2.3 Multilayered Feedforward Networks
By making use of two or more hidden layers, we may be able to
approximate more complex systems. FIGURE 2.4 illustrates neural network with
two hidden layers, each consisting of two neurons. In the figure we also illustrate
an example of time series modelling with neural network. Say we have returns
{xt}
through time t and we want to forecast them. Then we simply use inputs
{xt2, xt1, xt}
to produce output{
xt+1}
. For generality of the illustration, we denote
y as output variable.
Mathematical representation of the system with i* input variables, k*
neurons in one hidden layer, and l* neurons in the second hidden layer follows:
i*
nkt, =.k,0
.
, x , , (2.12)
+
.
ki it
i=1
1
Nkt, =nkt,
, (2.13)
1+
e
k*
plt, =.l,0 .lk Nkt , , (2.14)
+
.
,
k=1
1
Plt,
=
1+
e
plt,
, (2.15)
l*
y=
. .
P . (2.16)
t 0 +
.
l lt ,
l=1
26
FIGURE 2.4: Feedforward network with two hidden layers
Adding a second hidden layer increases the number of parameters to be
estimated and this is basically the cost of complexity which is gained by using
more hidden layers. Researchers should note that with more parameters not only
greater training time is a problem, there is a much greater probability that the
parameter estimates will converge to a local, rather that global optimum. This
problem is further discussed in chapter (2.5). As shown by Dayhoff and DeLeo
(2001), simplicity of network brings better results and we will probably manage
with smaller networks in our tests also:
xt1
xt2
Inputs  x
xt
n
n
y
Output – y
y=xt+1
p
p
2 Hidden Layers
0,015
0,01
0,005
0
0,005
0,01
0,015
0,02
t x
t
“A general function approximation theorem has been proven for threelayer
neural networks. This result shows that artificial neural networks with two layers
of trainable weights are capable of approximating any nonlinear function. This is
powerful computational property that is robust and has ramifications for many
different applications of neural networks. Neural networks can approximate a
multifactor function in such a way that creating the functional form and fitting the
function are performed at the same time, unlike nonlinear regression in which a
fit is forced to a prechosen function. This capability gives neural networks a
decided advantage over traditional statistical multivariate regression techniques.”
(Dayhoff and DeLeo(2001, p.1624)
2.4 Learning algorithms
In order to be able to approximate the target function – in our case stock
returns, the neural network has to be able to “learn”. The process of learning is
defined as adjustment of weights using a learning algorithm. We present common
backpropagation algorithm and two more specific, conjugate gradient algorithm,
and LevenbergMarquardt algorithm. These two are presented mainly because
they provided most impressive results in comparison to other common methods
as the reader can see in next chapters.
The most common way to train neural network is by learning an algorithm
called “backpropagation” or “errorbackpropagation”. Let us assume following
error function:
1 t*
2
.().=.(yt y^t ), (2.17)
Tt=1
where y^tis the estimated output variable of the network  or forecast, ytis
variable being forecasted, or input variable in time t .{1,..., T}. Then according to
our definition of prediction task in (1.3), the main goal of the learning process is
to minimize ..  the sum of prediction errors for all training examples.
(
)
Training phase is thus unconstrained nonlinear optimization problem, where the
goal is to find optimal set of weights of parameters by solving minimization
problem.
n
min {.()..: .R}, (2.18)
where : nR is continuously differentiable.
.R›
2.4.1 Stochastic gradient descent backpropagation learning
algorithm
There are several ways of achieving minimization of the (), but
..
basically the algorithm is as follows24:
(i)
choose random initial values for the model – weights
.
(ii)
calculate the gradient G of the error function ..
with respect
(
)
to each weight
(iii)
adjust the model weights so we move a short distance in the
direction of the greatest rate of decrease of the error, i.e. in the
direction of (–G)
(iv)
repeat steps (ii) and (iii) until G is zero and .. is
(
)
minimized.
So we are searching for the gradient G=.
.
(.)
of function .which is the
vector of first partial derivatives of the error function .(.)with respect to the
weight vector
.
..(
)
...(.)
,
..(.)
..(.)
.
(2.19)
.
=
.
,..., ..
.
..1 ..2 ..n
.
Further more, the gradient specifies the direction that produces the steepest
increase in .. The negative of this vector thus gives us the direction of steepest
decrease.
FIGURE 2.5 25 the behavior of .(.)with respect to one weight
.
. In order to
find minimum, we always have to increase/decrease w in opposite direction to the
slope, by .x, where ,0 <..0.5 is learning
.=..jji ..
R
but most commonly26
rate that determines size of steps for the algorithm, the rest is the partial
derivative of ..
with respect to weights. Thus:
(
)
24 Schraudolph and Cummins (2002)
Please note that the figure is only schematic and in real neural network we will work with much
more weights then one.
26 Note that this is usual interval used by rule of thumb. If .is too small near zero, it may take huge
time to converge to optimal weights. If .is too big it may happen that it will “jump” from positive to
negative gradient and optimum will not be found at all.
.=
jji
..().
...x =

, (2.20)
..ji
and finally the algorithm will find the final weights with minimum the error
function by
.t+1‹.t +... (2.21)
ji ji
()..
.1 .(min.2 .
.
)
FIGURE 2.5 : Gradient descent
So if we find negative gradient in step (ii) of algorithm, we will increase w in step
(iii) and vice versa. In this way we will move towards the minimum ()=0
...
by repeating the algorithm in N steps.
Important feature of this algorithm is that is assumes a quadratic error
function, hence there exist only one minimum. In practice the error function will
have apart from the global minimum multiple local minima. At this point the
reader probably knows what will follow – the alert that algorithm can converge to
local minimum and will not find global one. Other drawbacks of this method are
that there is a need to specify .and much worse, it’s slow convergence.
2.4.2 Conjugate Gradient Learning Algorithm
Besides popular steepest descent algorithm, conjugate gradient algorithm is
another search method that can be used to minimize the network error function
.. in conjugate directions. This method puts into the use orthogonal and
linearly independent nonzero vectors and in some cases brings better
convergence results then previous method.
Definition: Two vectors d and d j are mutually G conjugate if
:
(
)
i
T
dGd =0 . (2.22)
ij
Then to minimize error function .(.0 ) we begin with initializing the parameter
vector .of n elements at any random value .0: .(.0 )=c . Then we iterate on
the weights set . until minimum of .(.) is found. Error function is represented
by following secondorder Taylor expansion:
.=
1 TG
.() c .
+ .
..
, (2.23)
2
where . is gradient of the error function wrt. weights set .and G is Hessian of
the error function, an nn
×
symmetric and positive definite matrix. Name
conjugate27 comes from the fact that in this iteration, weights vectors are
conjugates of Hessian.
Choosing .0 =(.0,1 ,..., .0, k )as set of k initial parameters, we search for direction
d0 =.0 . The gradient vector is defined as:
..(.0,1 +h1,..., .0, k ).(.0,1 ,..., .0,k )
.
.
. .
h
.
...,..., .+h ,...,
.
..,..., ..
(0,1 0,ii 0,k )
(
0,1 0,k )
.
.
.=
.
h .. (2.24)
0 i
.
. . # .
..(.0,1 ,..., .0,k ++hk
) (
..0,1 ,..., .0,k )
.
..
..
.
hk
.

The hi is set as max (.,..0,i )with .=10 6. Hessian G0 is matrix of secondorder
partial derivatives of ()wrt. to 0
...and is computed similarly as Jacobian or
gradient vector:
27 Method was originally proposed by Hestens, Stiefel (1952)
22 2
.
..(
)
..(.)
...
.
()
.
.
2
" . .
..0,1 ..0,1 ,..0,2 ..0,1 ,..0,k
. .
22 2 .
..(
)
()
.
..
.
.
.
().
..
"
2
G0 =...0,2,..0,1 ..0,2 ..0,2 ,..0,k .. (2.25)
.
.
. #
....
# . .
2 (
)
2 (
)
2 ().
...
...
...
. …
.
.
2 .
...0, k,..0,1 ..0,k,..0,2 ..0,k
.
Offdiagonal elements of the matrix will be given by:
2 (
)
1 ..(.0,1 ,..., .0,i +hi,.0, j +hj,..., .0, k 
.
(.0,1 ,..., .0,i,.., .0, +h,..., ..
...=
×. )
jj 0,k)
.
.
.
,.
.
hh .
.
,...,
.
h,
.
,..., ..
,..., ..
0,i 0, j ji ..(0,1 0,i
+
i 0, j 0, k)
.
(0,1 0,k)
.
(2.26)
And diagonal elements are given by:
2
..()
.
1 ×.
,..., .+h,...,
.

. .
,...,
.
+
. .
,..., .h,..., ..
..0,2
i
=
hi
2 .
.
(.0,1 0,ii 0,k)
2 (0,1 0, k)
(
0,1 0,ii 0, k)
.
(2.27)
We found direction d0 thus we can follow iteration process to solve the
minimization problem of ().
..
.k+1 =.k +.kdk , (2.28)
dk+1 =.k+1 +ßkdk, (2.29)
. and ß are momentum terms to avoid oscillations. Let µk
=
1+
1
ßk
. Equation
(2.29) can be rewritten as follows
µ.()
+ 
µ)d ..
, (2.30)
dk+1
=
1 .
µ
.k+1 (1 k
which allows us to look at the search direction as a convex combination of the
current steepest descent direction and the direction of last move. The search
distance of each direction is varied. Value of .k can be found by line search
techniques such as Brent’s Algorithm28 so that .(wk +.kdk) is minimized given
fixes .k
and dk.
28 Brent (1973)
31
ßk is then calculated by following three formulae:
.T [..]
Hestens and Stiefel’s formula29 ßk
=
kT
+1 k+1 k . (2.31)
d ..
k[k+1 k]
.T [..]
Polak and Ribiére’s formula30 ßk
=
k+1 k+1 k . (2.32)
T
..
kk
Fletcher and Reeve’s formula31 ßk =
.Tk+
T
1.k+1 . (2.33)
.k.k
Shanno’s inexact line search32 considers the conjugate method as memoryless
quasiNewton method and derives following formula for computing dk+1 :
TTT T
.
.
yy.p.
y.
.
p.
k+1 k+1 .
.
kT
k.
kTk kTk.
kT kTkk
d =
. 
1+

p+
y, (2.34)
pypy py py
.
.
kk.
kk kk.
kk
where p=.d and y=
.
.
kkk kk+1 k
Conjugate gradient method finds optimal vector . along the current gradient by
doing the lisearch, and converges to the solution faster than steepest gradient.
Method computes gradient at the new point and projects it onto the subspace
defined by the complement of the space defined by all previously chosen
gradients. New direction is orthogonal to all previous search directions.
Before moving to LevenbergMarquardt algorithm, we will sum up the
conjugate gradient algorithm, by putting it into few simple steps:
(i) set k=1, initialize .0
(ii) compute .=...0
0 (
)
(iii) set d0 =.0
(iv) compute . by line search where .=argmin ..w+.d .
kk ..
(k kk )
.
(v) update weight vector by .k+1
=
k+
kk
..
d
(vi) if network error .. is less than a preset minimum value of the
(
)
maximum number of iterations has been reached, stop
else go to next step
(vii) if +>n, then 1 k+1 , k=1 and go to step (ii)
k 1 .=.
29 Hestens, Stiefel (1952)
30 Polak (1971)
31Dai, Yuan (1996)
32 Shanno (1978)
else
1) set k=k+
1
2) compute .k+1 =.
.
(.k+1
)
3) compute .^k
4) compute new direction d =
.
+
ß
d
k+1 k+1 kk
5) go to step (iv)
We do not expect from conjugate gradient approach to minimize error function
better, but we do expect more efficiency while it should provide faster results.
Next, we introduce last, LevenbergMarguard algorithm, an we will expect also
better level of minimization from it.
2.4.3LevenbergMarquardt Learning Algorithm
Gradient descent works for simple models, but is too simplistic for more
complex models. So we may want to use more sophisticated methods to obtain
better results. The technique invented by Levenberg33 involves blending between
the introduced steepest gradient and the quadratic approximation. It uses the
steepest gradient to approach minimum, and then switch to the quadratic
approximation. We can formalize it as follows. Let . be a “blending factor”,
constant which will determine the mix between the two methods. The update rule
here is:
.k+1 =.k(H+..)1 d,
(2.35)
where again. is weight vector, H is Hessian matrix of the error function and I is
identity matrix. Depending on the value of . we can approach to following forms.
1
With .›
0 , we get .k+1 =.k
Hd, which is basically quadratic approximation
and with growing . we get .k+1 =.k1 d which the reader can compare to
.
equation (2.21) and find that it is steepest gradient.
Algorithm adjusts value of . according to whether .(.) is increasing or
decreasing as follows:
(i) do update according to equation (2.35)
(ii) evaluate the error at the new weight vector
33 Levenberg (1944)
(iii)
if the error has increased as result of the step (i), retract weights to
previous values and increase .by34 10. Then go to (i)
Else (if the error decreased), accept the weights and decrease .by
factor 10.
If error is increasing, quadratic approximation is not working well and we
are far from the minimum. Thus we need to approach simple descent by
increasing . to locate the minimum. Conversely, if we locate the minimum and
the error is decreasing, approximation is working well. Hence, we expect that we
are closer to minimum so we try to incline to Hessian by decreasing the ..
Marquardt (1963) improved this method with a clever incorporation of
estimated local curvature information. His insight was that when . is high and
we are doing essentially gradient descent, we can still benefit from Hessian
matrix that we estimated. He suggested that we should move further in the
directions in which the gradient is smaller in order to get around the error valley
problem. Marquardt replaced identity matrix from equation (2.35) with diagonal
of Hessian:
.k
+
=.(H +
.
[])1
diagH d
.
(2.36)
1 k
We can see that this method does not require other computations then previous
methods. All we need is () as error function of estimated output and desired
..
....
It is important to notice that it is nothing more than a heuristic method. It is
not optimal for any defined criterion of speed or final error. What is so appealing
is that it works extremely well in practice. Its only drawback is that it requires
matrix inversion step, thus becomes much slower than backpropagation or
conjugate gradient in more complex models. On the other hand, it has a much
better results as the reader will see in further chapters.
output, and it’s gradient (
)
2.5 The Nonlinear Estimation Problem
As we saw in previous subchapters, finding the coefficient values of
nonlinear models is not that easy job as neural network is highly complex
nonlinear system. We can hit several locally optimal solutions, but none of these
34 Or other significant factor. 10 was originally proposed by Levenberg.
.
can be the best solution in terms of minimizing error between our model
prediction y^ and actual value y .
In any nonlinear system, we start the estimation with initial conditions as
we saw in previous chapter. These are meant to be a guess or random variable,
and we get to the problem of some parameters being guessed better than others.
This may end in converging to local rather than global optimum and of course to
best forecast in local neighborhood of initial guess, but not best forecast ahead of
the “initial area”. This can be very intuitively illustrated in following FIGURE 2.6 :
.(
. )
global maximum
local maximum
saddle point
local minimum
global minimum
FIGURE 2.6: Problem of search for local optima
As we can see, initial set weights may rather lie near to a local maximum
than a minimum, or near a saddle point while our search of minimum of error
function is using derivatives of error function. Thus we have to recognize also
curvature around our point by secondderivatives which will provide us better
insight. If the change of gradient or secondderivative is positive, we know that
we are near minimum and vice versa for maximum.
So as we adjust weights by presented algorithms, we can easily get stuck at
any of the positions from FIGURE 2.6 where derivative is zero or function has a
flat slope (blue lines on the figure). If we are adjusting weights by too large
steps, algorithm can easily converge from nearglobal minimum to maximum or
other point. If we adjust by too small steps in contrary, the algorithm may get
stuck in a saddle point for a long time during the training period and may not
converge to a minimum at all.
Maybe the reader is asking the question: “but what can we do to avoid this
problem?” There are several techniques of minimizing the chance of converting to
“the wrong” optimum. A very intuitive way is reestimation of whole model,
another way is stochastic evolutionary search presenting in following subchapter.
2.5.1Stochastic evolutionary search
Genetic algorithm reduces the likelihood of landing in a local minimum. We
do not need to approximate Hessian, we start with “population” of p initial
guesses, {.., ,..., .}and update them by genetic selection, breeding, and
0,1 0,2 0, p
mutation, for many generations, until the best coefficient vector is found. Let us
have a closer look at this process.
(i)
Population creation
We start with a population N * of random vectors .. Let p be the
size of each vector representing the total number of parameters to be
estimated. Then we create following population:
.
....
.
.
.
.
.
.
. .
.
.
.
. . . .
111 1
.2 .2 .2 .2
.
.
.
.
. . . .
.3 .3 .3
…
.3 . (2.37)
.
.
.
.
. . . . .
.
.
.
. . . .
....
#
..
.
.
.
.
. . . . .
.
.
.
. . . .
...
.
ppp p
..1 ..2 ...
.
iN*
(ii)
Selection
The next step is the selection of two pairs from population at random,
with replacement, and evaluation the fitness of them according to sum
of squared errors. Weights with lower error receive better fitness
values. Two winning vectors (i,j) with best fitness are then chosen for
“breeding”
(iii)
Crossover
now, these two vectors (i,j) will “breed children” meaning they will be
associated with another pair of vectors C1(i) and C2(j) by one of three
methods to be chosen randomly with same probability equal to 1/3.
Shuffle crossover for which random draws from a binomial distribution
are made and new vectors are swapped or no change is made ,
Arithmetic crossover for which the random value of c .()
0,1 is chosen
and then new vectors are linear combination of old ones:
c , +

c ip .
.
1 c.,1 .+c or last one, Singlepoint crossover,
ip ()jp ,
(
, )jp ,
where integer I is randomly chosen from set [1, k1]. The vectors are
then cut at this integer and parameters are swapped.
(iv)
Mutation
now “children” C1(i) and C2(j) have to mutate in generations
G=1,2,..., G* with probability35, say, p..=0.15 +0.33/ G assigned to
them. Randomly drawing real numbers rr.0,1 and random
1, 2 (
)
number s from a standard normal distribution, mutated weight ..., is
ip
given by
.
.
.1G.b .
.
..+s.1r..
G* ..
.ifr >0.5
.
.ip,
.
2
.
1
.
..., =.
. .
b
.
.
.
, (2.38)
ip
.
.
..
1G..
. .
..ip, s.1 r2
.G* ..ifr 1 .0.5 .


..
.
.
..
. .
where G is the number of generations, G* is the maximum number
of generations and b is the degree to which the mutation is
nonuniform. Usually b=2. Probability of creating new coefficient which
is far from the current coefficient diminishes as Gapproaches G*.
This allows more precise search of weights approaching to a global
optimum.
(v)
Election tournament
The last step is “tournament” in which all chosen weights are
competing for the best fitness criterion. Again, two vectors with the
best fitness “survive” and pass to next generation. Even if the older
pair has better fitness, it wins the tournament and the younger one is
eliminated.
The process is repeated from (i) through (v) for G* generations. Convergence is
obtained if we do not see improvement in fitness of the last – optimal weights.
Unfortunately, literature does not provide us with the optimal value of G*as for
each problem it will be different. What we can do is to add simple ifthen rule of
no improvement in sum of squared errors, or fitness. If there is no improvement
seen, the algorithm will stop.
35 Probability here is just an example
2.5.2 Hybrid learning as a solution?
One of the main drawbacks of genetic algorithms is its extreme slowness.
Even for reasonable dimension of weights vector
.
, the various combinations
and permutations of elements that the genetic algorithm might find optimal may
become very large. In the next subchapter we will discuss the course of
dimensionality problem, but even if we manage to reduce the dimension
significantly the time taken to converge to a global optimum may be extremely
long. On the other hand, it has been mathematically proved36, that convergence
occurs.
The hybrid approach solves partially the problem of slowness of the
genetic algorithm. We may run genetic algorithm for a reasonable number of
generations, say 50 or 100 which will take little time and then use obtained
vector of weights as initial weights in gradient searching algorithms.
Problems arise even with usage of the hybrid approach because of the
nature of neural networks. The Neural network structure can give different results
with some kind of data, as initial guess may fall in the local optimum trap as we
saw in previous chapters. We can use repeated estimations for the robustness of
results. Granger and Jeon (2002) have suggested a simple idea of thick
modelling. The framework of this idea is to repeatedly estimate a given data set
with different specifications and then use the mean of the obtained information.
They mainly use this method for forecasting, thus they find a mean of repeated
forecasts to be an optimal one. They find this method outperformed simple linear
models, while it also outperformed individual network results on macroeconomic
data modelling.
2.6 Preprocessing the data
One of the first steps of research when modelling time series is adjusting,
scaling the data and removing nonstationarity. These procedures are known as
data preprocessing and are often crucial for the results. In this subchapter we will
discuss the problems of preprocessing the data including curse of dimensionality.
36 See Hartl (1990), or Mitchel(1997)
2.6.1Curse of dimensionality
One of the most important steps in designing a neural network is the
choice of appropriate data pre and postprocessing. The first problem arrives
with choosing the variables that may explain our observations best. In forecasting
stock market prices, there may be many variables that may have influence on the
price. If we use all possible candidates as a regressors in the model, we will face
the curse of dimensionality, first mentioned by Bellman (1961). It simply means
that the number of sample sizes needed to estimate a model with a given degree
of accuracy grows exponentially with the number of variables in the model.
Thus, intuitive assumption – “more data will provide greater insight into
the process” does not necessarily hold and reduction of dimensionality is often
necessary for good, simple predictive model, as it is crucial for the model to
choose variables that influence the observations most. In other words, to reduce
the number of regressors to a manageable subset if we want to have sufficient
degree of freedom for any meaningful conclusions.
2.6.2Principal Component Analysis
Principal component analysis (PCA) is basically an approach to reducing a
large set of variables into a smaller subset – reduction of dimensionality while
preserving as much information contained in the data as possible. PCA identifies
linear combinations of data that explain most of the variation of the original data.
For N vectors, N linearly independent combinations will explain total variation of
the data. However, what if only two or three linear combinations, or principal
components explains most of the variation of the total data set? We can then
significantly reduce the dimension of the model. This should be done with caution
because it can happen that we reduce important information away.
2.6.2.1 KarhunenLoeve Transformation
The goal of principal component analysis is to map ddimensional vectors
xito mdimensional vectors ziwith
<
. We can express vector x as linear
md
combination of a set of d orthonormal vectors ui
d
x zu, (2.39)
=
.
ii
i=1
where the vectors uisatisfy the orthonormality relationship
T
uu=., (2.40)
ij ij
where .ij is the Kronecker delta37. Explicitly, coefficients zi can be found as
T
z=ux . (2.41)
ii
mm d
So the dimensionality reduction works as follows: :> coefficients zi are
replaced by constant, say bi so vector x can be best approximated as follows:
md
x
=
zu+
bu. (2.42)
.ii .ii
i=1 im =
+
1
So again we are solving problem of minimization of sum of squares errors of data
set of N samples, which is defined as:
1 N
21 Nd 2
.u=
xx
=
..
(znj b). (2.43)
(
) .
2 im 1, i
2 n=1
nn
n=1 =
+
If we set ...bi=0 , then
1 Nn T
b=
z=ux , (2.44)
i .ii
Nn=1
with xbeing arithmetic mean and using (2.41) we can rewrite .as:
dN d
.()
=
ux (x
=
u
u 12 im
.1
.=1
(iT
n ))2 12 im
..
iT ui. (2.45)
=
+
n =
+
1
n
Where =(xxxx)T
..
n )(n is covariance matrix of xi. As shown in Bishop
i=1
(1996), minimum can be found when the basis vector satisfies condition
.u=.u so they are eigenvectors of the covariance matrix. Note that since
i ii
covariance matrix is real and symmetric, its eigenvectors can be orthonormal as
assumed. Thus value of error in minimum is equal to:
.(umin )
=
N
1
im
.
d
1
.i, (2.46)
=
+
and minimum can be found by choosing dm smallest eigenvalues and their

corresponding eigenvectors ui or principal components  to discard.
37 Kronecker delta is a function of two variables, usually integers, which is 1 if they are equal, and 0
.1 =
ifi j
otherwise. .1,2 =0 , but .3,3 =1. It can be formalized as follows: .ij =..
0
.
. , ifi j
2.6.3Nonlinear Principal Components using neural networks
Neural networks can also be used for reduction of dimensionality problem.
Network is trained to map the ddimensional input space onto itself over a mdimensional
(md)
<
hidden layers. Let us consider four input variable network
encoded by two logsigmoid functions under neurons nin a dimensionality
reduction mapping as shown in FIGURE 2.7.
Inputs  x Inputs  x
x3
x2
x1
N11
N21
Q21
Q22H units
x4
x3
x2
x1
x4
FIGURE 2.7: Neural Principal components
First two Nneurons for dimensionality reduction mapping are linearly
combined to form H neural principal components. Then these are decoded by
another logsigmoid Qneurons for reconstruction mapping which are linearly
combined to generate inputs as the output layer. Thus inputs x1,..., xn are
mapped into themselves. Letting X be a matrix with k columns, there is j
neurons and p, model can be formalizes by following system of equations:
K
n
.
X,
j =
.
j,kk
k=1
1Nj
=
1exp (nj)
,
+
J
H
ß
N,
p =
.
p, jj
j=1
P
q
.
H,
j =
.
j, pp
p=1
1Qj
=
1exp
+
(
)
qj
,
J
X^ k =..
, Q. (2.47)
kj j
j=1
And naturally, this system of equation can be optimized by solving minimization
n
min {.xx()of sum of squared errors problem (): .R}where .x is a loss
function.
McNellis(2005) shows that nonlinear principal component analysis
outperforms linear one in much better accuracy. The main drawback is again the
time needed to find the optimum.
2.6.4Stationarity: Dickey—Fuller Test
Most of the time series considered in this thesis are time dependent and
before starting to work with them, we need to difference the data to gain
covariance stationarity time series. Series is said to be (weakly or covariance)
stationary if the first and second moments38 are constant through time.
The most commonly used test for stationarity is one proposed by Dickey
and Fuller (1979). For a given series {yt}:
k
yt .yt1 +.i.y+.t
.=

.
ti , (2.48)
i=1
where .= 
y yy
, ., . are coefficients to be estimated, and . is a random
t tt1 it
disturbance term with E.= and E(.2 )=.. Under the null hypothesis,
()02
tt
.=0 . From equation (2.48) we can see that if this holds, yt at any time will be
equal to yt1 plus/minus effects of the remaining terms. Thus longrun expected
value of the series is uncertain if yt yt1 and Eyt =E.t
=
. Series with
=()()0
.=0 are called nonstationary, or a unit root process.
If there is some persistence in the model, with . falling in the interval (1,0),
the relevant regression changes to:
k
y (1 .)yt1 +..i.yti
=
+
+.. (2.49)
t

t
i=1
38 First moment mean, second moment – variances and covariances
In the long run it is still valid that yti=0 for =,..., k
.
ii . But longrun mean
reduces to the following, with .* =(1+.):
yt(1.*)=.t. (2.50)
Then, expected value of yis Eyt
=
E.t (1.*)
t(
)
().
For stationarity, it is necessary that coefficient . is significantly less than zero.
Dickey and Fuller tests are modified, onesided ttests of hypothesis .<0 in a
linear regression and allow presence of constant and trend terms in the
regressions.
Most of the time financial series are nonstationary themselves and needs
to be firstdifferenced to achieve stationarity. Logarithmic first differencing
usually helps and it is nothing else then transforming the financial series into the
returns:
.=ln (Pt)ln (1 )
rt Pt , (2.51)
where rt is return of the series, Ptis series itself. After transforming the series we
should use proposed DickeyFuller test39 to assure that our testing series are
stationary.
2.6.5 Data scaling
Sometimes we use data with very high or low numbers, or outliers which
may cause a computer to assign zero to values being minimizes. Sometimes we
want to test differently scaled data, i.e. if we want to test effects of interest rates
changes on the market, or sometimes our data simply contains too many zero
values which cause errors in the estimation process. For all of these cases, it
might be crucial to scale the data right after we gained its stationarity.
The reader should also note that using i.e. logsigmoid functions for
estimation might cause that large value will be simply assigned40 1 and low
values 0. Then it is very likely that we might loose information. Thus there may
be a need in transforming the data. The most simple is linear scaling function to
range (0,1) or(1,1). It uses maximum and minimum values of the series x.
Following equations represent scaling to intervals (0,1) and (1,1) respectively:
39 Or an alternative to it
40 For illustration see FIGURE 2.2. : Logsigmoid function, p. 19
x * min (x)
x *
=
it, i , (2.52)
max xi min xi
it, (
)
(
)
x * min (x)
x * =2. it, i 1. (2.53)
it,
max xi xi
()min (
)
There are also nonlinear methods of scaling data which transforms
series xi say to zi i.e. in following way. Firstly we standardize a series, and then
use nonlinear transformation:
1
x*
=
, (2.54)
1exp (
)
z
+

xx
z=
. (2.55)
.
x
Of course it is often very hard to say which of the transformation should
be used. It depends only on the results obtained, so researcher is left with trial –
error method. Luckily for us, most of the financial series does not need scaling
while first differencing most of the times help us to “keep” data in the narrow
ranges. In other words, how many times we see 1 representing 100% return in
two consequent time periods xt and xt+1 ? On the other hand, we should always
keep in mind this possibility of data preprocessing.
2.7 Evaluation of estimated models
Until now we presented complex procedure of estimation with neural
networks. In this section we will briefly present a few criteria which will help us
with interpreting the results. We will work with insample criteria, or training
period results interpretation which is in fact evaluation for information on how
well the estimated data fits our modeled data. We will see that model which
explains most of the variation of the training data may turn to be inapplicable for
forecasting purposes, or better said outofsample data which model “did not see
before”. They are also called testing data or outofsample criteria which will be
most important for us in the testing part.
So the framework of empirical testing is the following: After preprocessing
the data we divide it into 2 or 3 samples – training, crossvalidation and testing
sets. The Neural network will be estimated using the training data and optimal
weights will be found at this stage. Then the weights are put to crossvalidation
data and might be slightly adapted to changes if we find that insample criteria
deteriorated. Just after that, the last set of data is put to test. Coefficients
obtained from training will be used to perform with new data which had no impact
on calculating the coefficients, which is the most important part. The reader
should also be aware of the proportion of training to testing data set. In most of
the studies they cut 2025% for testing purposes, but it can be crucial for our
results to do this with patience. Imagine we want to model AAA stock returns and
15.01.2002 there had been huge reforms at the company leading to consistently
higher than expected profits. This would also have impact on returns of our AAA
Company and if we train network on the data until 15.01.2002 and try to test
them further on, we may be extremely disappointed. Our model will know just
pattern from the prereform period. Hence according to changes, also pattern of
returns changed after the date and our model will not be capable to deal with it.
2.7.1 Normality
It is a common practice that residuals are assumed to come from a Gaussian
or normal distribution in econometric modelling. Assumption may be needed for
efficiency, and we often do not release it also in neural network modelling. Wellknown
test, JarqueBera (1980) statistics, starts from the assumption that a
normal distribution has zero third moment  skewness41 S and fourths moment 
kurtosis42 K of 3 and measures the difference from the normal distribution..
Given the residual vector.^, the JarqueBera statistics is formalized as follows:
Nk
.
2 (K 3)2
.
JB .^
=

.S +.. (2.56)
(
)
6 .
.
4 .
.
41
Skewness is a measure of asymmetry of the distribution of the series around the mean. We
N 3
compute it as S
=
1 ...
yi y ..
where .^ is estimator of the standard deviation. Positive
Ni=1 ..^
.
skewness means long right tail, negative skewness implies long left tail.
42
Kurtosis measures the peakedness or flatness of the distribution of the series. We compute is as
1 N .yi y .4
follows: K =..
.. If kurtosis exceeds 3, the distribution is peaked, said to beNi=1 ..^
.
leptokurtic and if it’s less than 3 it is flat – platykurtic relative to normal distribution.
Under the null hypothesis of a normal distribution, the JarqueBera statistics is
distributed as
.
2 . Reported probability is probability, that JarqueBera
2 (
)
statistic exceeds in absolute value the observed value under the null hypothesis.
Thus i.e. JB=4.32 (p<0.039523) tells us that we reject the null hypothesis of
normal distribution at the 5% significance level but not at the 1% significance
level.
2.7.2Goodness of fit
Rsquared coefficient (multiple correlation coefficient) is probably the most
commonly used measure of overall goodness of fit of a model. It can be simply
interpreted as the fraction of variance of the dependent variable explained by the
independent variables. Value of statistics fall into the (0,1) interval43 while if it’s
0,we can assume that model fits the data no better that simple mean of
dependent variable if it’s 1, model explains the variance perfectly. Statistics is
represented by:
T
.(y^t yt )2
R2 =tT
=1 . (2.57)
.(yt yt )2
t=1
One problem with using R2 as a measure of goodness of fit is that it will never
decrease as we add regressors. As an extreme case, we can obtain R2 =1 if we
include as many independent regressors as there are sample observations. Thus
we adjusted R2 measure is used, R2 or adjR2 which penalizes R2 for addition of
regressors which do not contribute to the explanatory power of the model. It can
be computed as follows:
2 =
  (
R2 )T 1 . (2.58)
R 11
Tk

and is naturally never larger then R2 In all our tests, if we refer to R2 , we refer
to adjusted statistics.
43 Please note that for number of reasons this coefficient can be also negative in standard econometric
modelling. For example if regression does not have an intercept or constant, if it contains restrictions,
or if the model is twostage least squares or ARCH.
2.7.3 Schwarz Information Criterion
One way to modify the R2 statistic is to make use of Schwarz (1978)
information criterion which corrects the performance of a model for the number of
parameters, k, it uses. The statistic is used simply to prefer model with lowest
value.
T (yty^t)2
.
kln T
SC=ln
.
..
.
..
+
(
)
. (2.59)
.t=1 T
.
T
Alternatively, Akaike or HannahQuinn statistics may be used which punishes a
kT or
.
T)/ respectively. Schwarzgiven model by factor of 2/k.ln ln (()..
T
criterion punishes model more than others by factor of k(ln (T))/T.
2.7.4 QStatistics
Besides having properties of constancy of variances, assumption of
normality of residuals, serial independence is next step of evaluating whether
there is some information content in residuals. If model is wellspecified,
residuals should not contain any pattern in their first and second moments. Thus
we need to test for serial independence and homoskedasticity, or constancy of
variance.
If the autocorrelation is absent, residuals are unpredictable from past
data. The autocorrelation function is defined by the following equation, for
different lag lengths m:
T
..^^.
ttm
^ =+
..()=tm1 . (2.60)
mT ..^t
2
t=1
Following statistics proposed by Ljung and Box (1978) is than used for examining
the joint significance of the first M residual autocorrelations, with asymptotic Chi
squared distribution with M degrees of freedom .2 (M):
M .2 (.^)
m
QM(
) (
=TT +2)
.
. (2.61)
1 Tm
m=(
 )
2.7.5 Root Mean Squared Error Statistic
The most common statistic for evaluating outofsample fitness under
quadratic loss function is the root mean squared statistic derived from Mean
squared error:
1 T
RMSE=
yy , (2.62)
.(
t
^t)2
Tt=1
where T is number of observations. Normalized Mean Squared Error is also used
and is given by:
T
.(
yt
y^t)2
NMSE=
tT
=1 . (2.63)
.(
yt
yt)2
t=1
Please note that NMSE can also be expressed by 1
R2 . See equation (2.57).
2.8 Statistical Comparison of Predictive Accuracy
The key question in forecasting is measurement of accuracy of different
forecasts, as we are interested in the model producing most accurate forecasts.
As we will compare performance of various econometric models and neural
network models, we have to consider statistical methods for comparing the
results so we are able to identify if neural network models help us in producing
more accurate results or not. This needs to be done on outofsample model
valuation.
ij
Let us consider two hstep forecasts,
{
p^ }T
and
{
p^ }T
, of the time
+
th
t +
th
t
t=1
t=1
series
{
p}T with forecast errors of {
.
i }T
and {
.
j }T
. To choose model
th+
t1 th
t +
th
t
=
+
t=1
t=1
with significantly lower prediction error, thus better accuracy, we wish to compare
the expected loss associated with both forecasts. Of course this will depend on
the chosen loss function as defined in (1.3). We will restrict on the loss function
dependent on the forecast error here, L(.th
)
, and we will try to find optimal h
+
h
step prediction:
.
argmin EL(
.
Ft.
.
(2.64)
^*
+
t
.
.
th+
t)
Pth
Thus we will test the null hypothesis of equal forecast accuracy for two
forecasts against the alternative hypothesis of unequal forecast accuracy:
.
ij .
HEL 0:
.
(.th+
)

L(.tht)
. =
0 , (2.65)
t
+
HEL (.
1: .
.
thi
+
)

L(
.
jt).
. .
0 , (2.66)
t +
th
)
is positive loss function and L(
.
i )

L(
.
j
)
is the losswhere L(
.
+
th
t +
th
t +
th
t
differential.
In the testing of the null hypothesis, the choice of the loss function is
needed. In the next subchapter, we will present quadratic loss function, when we
basically chose model j, if (
.
j
+)2
<(
.
i )2
, mean absolute loss function, when
th
t +
th
t
we choose model jif
.
j
<
.
i
, and also asymmetric loss functions, when we
+
th
t
+
th
t
are more concerned about positive errors than negative, or viceversa.
2.8.1 Optimal forecast under different loss functions
Quadratic loss function – under the quadratic loss function, we can define
optimal hstep forecast as follows:
P^
* .)2
.
argmin Ep

p^th
Ft
.
.
, (2.67)
t ..
(
th++
+
th
t
.
where the prediction is conditional expectation E..
P+
thFt.
^
th+
t on information
.
=
P
ij
set
F
. Thus considering two forecasts
{
p^ }T
and
{
p^ }T
, we will choose
+
th
t +
th
t
t=1
t=1
if it satisfies Ep

p+
forecast {p^ j
+}Tt=1
...
(
^ jt)2 ..
. <
Ep...
(
th+

p^thi
+
t)2 ...
t th+
th
th
Quadratic loss function is the most popular in the literature, it is monotonically
increasing, symmetric, homogenous of degree 2 and differentiable everywhere.
Mean absolute loss function – under the mean absolute loss function, the
optimal hstep forecast will be:
^*
.
argmin E..
(
pth+

p^
+
Ft
..
.
, (2.68)
)
+
Pth
th
t
t
.
where loss function L(.th)
=
is monotonically increasing, symmetric,
pth+

^ +
+
t
pth
t
tht
=
0 .homogenous and differentiable everywhere except .+
Asymmetric loss functions  sometimes the researcher is more concerned
about positive errors
(
pth+

p^
+
>
0)
, than about negative errors
th
t
<
0)
as they may be more costly. Two well known asymmetric loss
(
pth+

p^ +
th
t
functions are linear exponential loss function – Linex, and linearlinear loss
function – LinLin:
Linex loss function – under the linear exponential loss function, the optimal hstep
forecast will be:
P^
*
.
argmin E.
.
exp ap +

p^th+
t
+
ap th +

p^th+
)1)
Ft
..
.
, (2.69)
)
)
(
t .
(
(
(
th
+
th
t
for a.
0 . Function is asymmetric as for a>
0 it is almost linear to the left of the
yaxis, and almost exponential to the right, and vice versa for a<
0 .For this loss
function, we will try to find the
{
p^ jt}Tt=1
which will satisfy following condition:
+
th
E.
exp ap th+

p^ j
+)
) (
jt)1)..
+
ap +

p^th+
.
(
(
(
th
t th
.
.
, a.
0 .
t))
+
ap

p^th+
<
E..
(exp(ap
(
th
p^thi
+
(
it)1)...
.
+
+
th
Piecewise asymmetric loss functions
th
L(.tha=
..
.
aL1 (.th+
; .)
. >
0
,,0 , (2.70)
t;,b, .)
.bL2 (.th+
t
; .)
.+
t ab.>
+
<
0
t +
th
t
.
.
where typically L(
.
t; .)
=
L2 (.th+
t; .)
=
.th
. Special cases are: .=1: Lin
1 +
th
+
t
Lin loss function and .=
2 : Quadquad loss function, both nondifferentiable at
zero, but continuous, and asymmetric for .
ab
2.8.2 DieboldMariano Test
The most important question is, how can we determine, if the outofsample
fit of one model is significantly better than the outofsample fit of
another model. Diebold and Mariano (1995) have proposed a test for the null
hypothesis of equal predictive ability, against the alternative of non equal
ij
predictive ability. For two nonnested 44 models, let the {.}T
and {.}T
be
+
th
t +
th
t
t=1
t=1
the hstep ahead prediction errors. Under the assumption that errors are strictly
stationary, the null hypothesis of equal predictive accuracy is specified as
HEL (
.
)L(
.
)=0 , and HEL (
.
)L(.
0:
.
ij
.
1:
.
ij )..0 . The statistic is
+
t +
th
th
t +
th
.
th
t
.
.
+
t
.
based on loss differential,
ij
dL
=
(
.
)L(
.
)
,
(2.71)
t
+
t +
th
th
t
is following:
1 T
.dt
a
Tt=
1
(
)
DM.= ~
N0,1 , (2.72)
1
T1 .
.
.
.
.
.
..
^ ()
1 ..T=
 
1 ().
.(T
)
..ST
.
where ^
=
1 T
d dd d)and 1{.ST is the lag window, and
..

().(
)
(
()}
T=+1
tt
.
t.
ST is the truncation lag. The statistics is based on the idea that for large
()
samples the mean loss differential, which is the numerator in (2.72), is
approximately normally distributed with mean µ and variance 2 f(). In the
.
d 0
2 f 0 , which is denominator of (2.72), there is an consistent estimate of
.
d(
)
weighted sum of the available sample autocovariances. For further details please
see Diebold, Mariano (1995).
Thus we will test if the competing neural network model with outofsample
j
prediction errors {.}T
, is significantly better than a benchmark model with
+
th
t
t=1
i
prediction errors{.}T
. The DMstatistics is approximately normally
+
th
t
.
t=1
distributed under the null hypothesis of no significant differences in predictive
accuracy of the models. Thus if the neural network’s predictive errors will be
44 neither one is a special case of the other
significantly lower than for example ARIMA(p,I,q), the DM.
should be below the
critical value of 1.96 at the 5% critical level. Thus we will report the statistics
and the pvalues for it.
2.9 Economic significance tests
In the final analysis, the criteria will rest on the question: “how does the
results of a neural network lend themselves to interpretations that make
economical sense and give us better information for decision making?”.
Ft .
.
be the expected return on an optimal portfolio . for
Let zt
.+1
.
E ..
rt
.+1
period t +1, and rt+1 the rate of return on a riskfree asset at t +1, whose value is
known at time t . For this study, portfolio
.
will always consist of an asset being
predicted. Simple asset allocation strategy is formed45:
.1 if zt
.+1
>
rt+1
.t+1 =
.
(2.73)
.0 otherwise,
where .t+1 is the fraction of asset invested in the portfolio
.
. So we will invest to
an asset being predicted if the expected return is greater that a riskfree return,
and vice versa. Thus realized return on this trading strategy xt +1 will be
xt+1 =.t+1rt
.+1 +(1.t+1
)
rt .
2.9.1The HenrikssonMerton measure
Henriksson and Merton (1981) proposed a nonparametric measure to
evaluate the performance of the trading strategy described above. Let p1 denote
the probability of a correct forecast in an “down” market and p2 be the
probability of a correct forecast in an “up” market:
p1
=
Pr ob ..
.t
=
0
rt
..
r .
.
,
t
p2
=
Pr ob ..
.t =1
rt
.>
rt ..
.
45 See Henriksson and Merton (1981), Lo and MacKinlay (1997).
p1 +p2 is a sufficient statistic for assessing the predictions46. A sufficient
condition for forecast to have a positive economic value is 1 +2 >1
pp , while the
null hypothesis of no predictability can be formed as:
: +=1,
Hp p
01 2
against
: +>1.
Hp p (2.74)
11 2
Under the null hypothesis, n1 number of successful predictions in a “down”
market has hypergeometric distribution that can be asymptotically approximated
by normal distribution:
a.nNnN().
NNn
n1 ~
.
1,1 122
.
, (2.75)
NN1
.N
( )

.
where =2 Nobservations where
NN Nis total number of observations with
11
rr, nn nis total number of predictions that rr, while nis number
=+
t
..t12 t
..t1
of successful predictions, given rr, and n number of unsuccessful
t
..t2predictions. Thus null hypothesis can be tested with this statistics by referring n1to the critical values of normal distribution.
2.9.2 The BreakEven Transaction Costs
Another direct measure of the economic significance of stock return
predictability can be found in Lo and MacKinlay (1997). Basically, they measure
breakeven transaction costs equating total return on an active markettiming
trading strategy with the total return on a passive investment. The endofperiod
value of a dollar investment over the entire period can be defined as:
P .
W=+(1 r),
Tr
A
.
Tt(rt)(1 .t)(1+rt),
W=.1+
+

where A,P are active and passive. If we switch between these two portfolios k
times, the oneway transaction costs (100 x c) can be found from equation:
46 Merton (1981)
WP =WTA 1ck ,
T ()
hence
.WTP .1/ k
c 1
.
A
.
(2.76)
=.
.WT
.
(100 x c) are implied transaction costs and if we compare them with the realworld
transaction costs, we will get a measure of economic significance of stock
return predictability.
2.9.3Pesaran and Timmerman nonparametric market timing
In financial time series one may be often interested more in the sign of the
stock return predictions rather than the exact value. If we have good sign
predicting model, we can use it for construction of simple signals. If the model
predicts positive change, buy signal would be created, if negative change, sell
signal would be created. Furthermore, if the predicted sign is the same as for the
previous period, hold signal would be created.
Such statistics was formalized by Pesaran and Timmerman (1992) and is
based on the null hypothesis that a given model has no economic value in
forecasting the direction. The statistics is defined as follows:
a
SR SRI
PT
= ~
N 0,1 , (2.77)
()
var SR var SRI
(
)
(
)
where SR is success ratio computed as an weighted average of
Ih =1{pt+h .p^t+h >0}, SRI is estimate of the probability of correctly predicting the
direction of change assuming independence between the actual and the predicted
directions, SRI =DD 

1 D 1D , where D and D are weighted averages of
^
(
)
(
^
)
^
an Ihactual =1{pt+h >0}and Ihpredicted =1{p^t+h >0}respectively.
Thus the PT statistics is approximately distributed as standard normal,
under the null hypothesis that the signs of the forecasts and the signs of actual
variables are independent. Hence, if we will have a model with a very good
predictive accuracy, forecasted and actual signs will be statistically dependent,
and the forecasting model will have economic significance.
2.10 Blackbox criticism
The growth in popularity of neural networks in recent years has led some
researchers to make partial judgments in favor or against these models. In this
section, we will review a few of these claims and discuss the blackbox criticism.
Let us start with few statements:
(i) Networks do not require the type of distributional assumptions used in
econometrics
(ii) Networks are intelligent systems that learn
(iii) The early stopping procedure requires arbitrary decisions by the
researcher
Some researchers, such as Aiken and Bsat (1999), claim that neural
networks are not constrained by the distributional assumptions used in other
statistical methods. However, as demonstrated by Sarle (1998), neural networks
involve exactly the same type of distributional assumptions as other statistical
methods. For more than a century, statisticians have studied the properties of
various estimators and have identified the conditions under which these
estimators are efficient, i.e. when they yield consistent unbiased estimates with a
minimal variance. They discovered, for example, that efficient results are
obtained when the errors are normally distributed with zero mean, are
uncorrelated with each other, and have a constant variance throughout the
sample. By rigorously identifying these optimality conditions, statisticians have
been able to assess the consequences of the violation of these conditions. Since
many neural networks are equivalent to statistical methods, they require the
exact same conditions to attain an optimal performance. This implies, among
others, that the residuals of a neural network should be subjected to the same
diagnostic tests that are applied to the residuals of a linear regression model.
Researchers who ignore these optimality conditions and proceed to estimate their
network weights will obtain suboptimal estimates. Most empirical studies
involving neural networks do not pay attention to these optimality conditions.
Researchers also tend to ignore issues of stationarity when building their
network. A prudent researcher should verify that all variables in the network are
stationary before experimenting with different architectures. In fact, level
variables that are trend stationary but that are not bounded could also pose
problems for the network. Since a hidden unit produces a value that is bounded,
the use of input variables that grow continuously over time could eventually lead
the hidden units to reach their maximal or minimal value. The contribution of
each hidden unit to the network's output (which is given by the value of the
hidden unit multiplied by the weight connecting it to the output unit) would then
remain constant, even if the boundless input continues to grow over time. This
would result in a deterioration of forecasting accuracy for subsequent periods.
Similar problems would arise when attempting to forecast a level variable that
grows continuously over time. Hence, even trend stationary level variables should
be transformed so that they do not grow continuously over time (e.g. by using
the first difference, the growth rate, the ratio to GDP, etc.)
Also when implementing the early stopping procedure, the researcher must
make a certain number of arbitrary decisions that can have a significant bearing
on the estimation results. First, the researcher must divide the sample into
training, validation, and test sets. A commonly used "rule of thumb" consists in
retaining 25 percent of the sample for the validation set and test set and with the
remainder being allocated to the training set. However, this guideline does not
have any theoretical or empirical foundations as results vary depending on data
used. In addition, the researcher must decide which observations to include in
each set. Some researchers assemble their validation set from the most recent
observations in their time series, while others randomly select observations from
the entire sample. Once again, there is no objective rule to this effect.
This criticism should not be overemphasized since a researcher can estimate
the network using a different division of the data into the various sets and thus
assess the sensitivity of the results to this allotment. Moreover, it is important to
remember that econometricians make similar arbitrary decisions when they
withhold observations from their sample in order to make outofsample
forecasts. Econometricians using timeseries data typically withhold an arbitrary
number of observations from the end of their sample, since they are interested in
assessing the model's capacity to forecast the future. To the extent that
researchers in the neural network field assemble their validation and test sets
from the last observations of the sample, they will be consistent with standard
econometric practice.
The beauty of neural networks is that they can model behavior of agents
without in the process of learning without giving them the model according to
which they can change their behavior. A nice example is the BlackScholes option
pricing model47 which was found to approximate behavior of agents in the
markets who are searching for the arbitrage opportunity. Nowadays, the model is
used for options pricing and in fact, agents adjusted their decisions to it.
Hutchinson, J.M., A.W. Lo and T. Poggio (1994) shown that neural networks can
learn BlackScholes very quickly. The reader can use this reference to learn more
about this research. In the last chapter, we will use the neural network to price a
warrant on Czech security and compare it to BlackScholes pricing. And this is the
example which shows us that neural networks has great potential in
approximating of behavior of agents without “knowing” the model first. Neural
network is able to find the price of the option even more efficiently than Black
Scholes, without using it, just by process of learning. Thus, even if philosophical
question, blackbox criticism can be easily turned down by this argumentation
while neural networks perform in very efficient way of learning. Just as economic
agents are in learning process.
2.11 Concluding remarks
We discussed the process of modelling series by neural networks in this
chapter in depth so we can move further to test the theory on real data as “Gray
is the theory, green is the life” 48.We Defined neural networks, discussed learning
processes of finding optimal solutions and formalized it. We also discussed
preprocessing data methods and closed the chapter with defining estimation
criteria for our modelling. So we are ready to put the theory to test in next
chapter.
We saw that when facing the task of estimating a model we have a large
number of choices at all stages of the modelling process. We can assign different
weights to insample and outofsample performance. We also have to decide e.g.
whether to take logarithms and firstdifference the data, deseasonalize or scale
them, what type of network specification to use, and which diagnostics should
have more weight for and so on.
Most of these questions generally take care of themselves in the process of
modelling. In general, we want to find out and compare the performance to linear
models, we use the same data preprocessing and lags as we would use in linear
models. Thus sometimes, linear models can help us in choosing the input
47 Black and Scholes (1973), Merton (1973)
48 Mephistopheles words from Goethe’s tragedy Faust, Erster Teil, Studierzimmer.
variables of the network by estimating insample performance of it. Of course if
we have linear model which is poorly specified it will not be hard for network to
outperform it. Also insample performance of the network in comparison to wellspecified
linear model should be better. Real test of performance is on outofsamples.
After the inputs specifications, we start with simplest networks and
search algorithms moving to more complex ones. We always compare the
performance by estimation criteria and if these do not improve by more complex
methods, we should stick to the simpler ones. Commonly with more variables and
more complexity we can have better feeling of explaining the variance of data,
but we may also end up with disappointment when test the model on outofsamples.
Generally, we should not loose the parsimony as parsimonious models
often outperform the more complex ones.
So the reader can see that it is a very complex process, we can say “state of
the art” when researcher can influence the process in many ways and can directly
improve the results by choosing different optimalization algorithms, or
transformation functions in neurons. This is also one of the main drawbacks put
to a criticism of neural networks – slowness of the estimation process. As we will
see time investment may bring some fruit.
Chapter 3
Application to CentralEuropean Stock
Market returns modelling
In this chapter, we will use the presented theory for modelling49 of the
CentralEuropean stock markets with emphasis on the prediction task defined in
1.3. We believe that the emerging markets represent the best ground for the use
of neural network models. The data is very often much noisier because the
markets are very thin and also due to the speed with which the news spread
among the market agents. Thus our assumption is that neural network should be
able to help uncover the process.
As the motivation for the good modelling results of emerging markets the
reader may be interested in following research that has been carried recently.
Almost all results are very impressive. Nygren (2004) examines the predictability
of Swedish stock exchange, Mohan, Jha, Laha, and Dutta (2005) examines neural
networks predictive power on Bombay stock exchange, Cambazoglu (2003) finds
impressive patterns on Turkish stock exchange. Finally, Yao, Tan, and Poh (1999)
study Kuala Lumpur Stock Exchange with some impressive results.
Encouraged with previous research, we move to test the power of neural
networks in CentralEuropean Markets against linear methods discussed in the
first chapter. The outline of this chapter is as follows: firstly we will use artificial
MackeyGlass time series for testing as these are not constrained with the sample
Please note that all tests were carried out using Eviews 4.1 and Neuro Solutions 5.0 software –
product that provide environment for neural networks modelling, and development of any learning
procedures. Free 60–day, fully functional evaluation copy can be ordered at
http://www.neurosolutions.com/ also with MATLAB or EXCELL extensions
length and should prove the ability of networks to discover and learn the pattern
of series almost perfectly. Then we will model returns of Central European Stock
indices daily and weekly, namely of Prague, Warsaw and Budapest stock
exchanges which we believe describe the corresponding stock markets well. For
comparison and more complex forecasting model development we will analyze
also index of Deutche Boerse which is believed to be most liquid in continental
Europe.
Finally, on the basis of cointegration analysis we will develop a robust
forecasting model when the indices will be predicted among each others lags, as
there has been recent studies of European Stock market cointegration – see Žikeš
(2003) – who found European markets to be co integrated.
3.1 Example of a Mackeyglass artificial series
To show the power of neural network approach relative to autoregressive
linear models, we start with simple example of artificial data modelling50. Very
good motivation for the use of this data is that there is no sizeofthesample
limits! The data is artificial which means that they are produced by model, and
thus we know that there exist functional form. According to general
approximation theorem, neural network should be able to learn the system by
which the data is generated. For this purposes, we use MackeyGlass51 time
series produced by a following stochastic timedelay difference system:
dx =ßxt()
+
.xt(.)
10 , (3.1)
dt 1+xt(.)
where xt is the value of the time series at the time t. This system is chaotic
(
)
for .>16.8 . We use the value of 30, and ., ß values of 0.2 and 0.1
respectively. The data are scaled to (1,1) interval:
The reader is convinced to use the McNelis (2005) reference for more examples on artificial data
modelling.
51 Mackey and Glass (1977)
1,50
1,00
0,50
0,00
0,50
1,00
1,50
FIGURE 3.1: MackeyGlass chaotic time series
Firstly, we reject the null hypothesis of normality with help of JarqueBera test
statistic being equal to 86.3 at 1% significance level52. The value of test statistics
of Augmented Dickeyfuller test exceeds the critical values so we can reject the
null hypothesis of a unit root. Thus series are stationary. We find strong
autocorrelation in the data, but we try first with simple regression  yt being
explained by yt1 , yt2 and yt3 . Autocorrelation still remains strong in residuals
even after estimating ARMA (p,q) model. We find that ARMA(2,2) best fits the
data, but we still can not reject the null hypothesis of serial independence of
residuals. ARCHLM test strongly suggests the presence of heteroskedastic
residuals, but we found that even GARCH(1,1) model did not help.
Table 1: Estimation results: MackeyGlass chaotic timeseries
Statistics data Autoregression ARMA(2,2) NN
adjR^2 0.8 0,84 0.99
Qstats 165* 155*
Schwarz criterion 0,234583 0,431651 7.6489
ARCHLM 80.729* 50,39*
DickeyFuller 7.289867*
JarqueBera 86.73115*
Outofsample results
RMSE
NMSE
0.212
0.162
0.1916
0.132
0,0503969
0,0100132
AR vs. NN ARMA vs. NN
DM(0)
DM(1)
DM(2)
DM(3)
14.88*
17.86*
14.33*
16.53*
14.11*
14.42*
12.26*
13.39 *
*1% significance level, DM statistics are comparing NN models versus benchmark
linear models.
52 For the distributions of timeseries and all other results of tests see Appendix A
The insample performance of the models is quite good. Classical
regression, ARMA(2,2) explains 80% and 84% of the variance in data
respectively. Feedforward Neural network with one layer and 3 neurons with
logsigmoid function and LevenbergMarquardt optimization was chosen as an
alternative to linear models. As we can observe from results, it explains 99% of
an insample data. Schwarz information criterion is much better also. Results are
very good as for linear models and network, but real test will be outofsample
data testing53.
0,60
0,40
0,20
0,00
0,20
0,40
1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289
Linear model Error
ARMA (2,2)
Neural Network Error
0,60
FIGURE 3.2: Outofsample prediction error comparison
For outofsample, we use Diebold – Mariano (chapter 2.8.2) to compare
simple autoregression and ARMA (2,2) with neural network errors. DM statistics
strongly rejects the null of no significant differences in predictive accuracy at 1%
significance levels for all tested lags. Neural network also managed to explain
98% of the data. Errors can be compared in figure (3.2).
Of course we were testing artificial data thus the data which were “created”,
and obviously must contain pattern. One would expect that if the data are
artificial, good predicting model should recognize the pattern and use it for
powerful predictions. As we see, linear models managed to uncover the pattern of
artificial data well (ARMA little better than simple regression), but still neural
model was much better in this task, when it predicted with better accuracy much
more significantly than other models. We chose this example to show the power
of neural networks and their ability to learn the pattern. Clearly, if the underlying
data were generated by a stochastic process, networks will be preferred over
other tested models. Thus we showed that the general approximation theorem is
valid, and we will see how the models will perform on the real data in next
sections, or maybe better said, if the data are generated by any process which is
to be uncovered or not.
53 We divided 20% of observations for realtime forecasting.
3.2 European Stock markets
3.2.1Data description
In the prediction task, we focus on sample of 1566 daily returns54 from
January 2000 until April 2006 and 382 weekly returns from January 1999 until
April 2006 of valueweighted indices PX50, WIG, BUX and DAX55. All the data
were downloaded and regularly uploaded from Bloomberg during the research.
Monthly returns were omitted because the sample size very small even for neural
network. The descriptive statistics of the series is summarized in the following
table.
Table 2: The descriptive statistics
Daily (1564 observations) Weekly (381 observations)
BUX DAX PX50 WIG BUX DAX PX50 WIG
Mean 0,00067 0,00005 0,00075 0,00056 0,00293 0,00337 0,00028 0,00329
Median 0,00049 0,00045 0,00081 0,00047 0,00240 0,00525 0,00332 0,00507
Maximum 0,06004 0,07553 0,04179 0,05593 0,09569 0,08719 0,12887 0,11501
Minimum 0,07433 0,08875 0,06000 0,08468 0,13579 0,09876 0,13919 0,18100
Std, Dev, 0,01410 0,01690 0,01248 0,01281 0,02967 0,02748 0,03383 0,03402
Skewness 0,14797 0,01262 0,27616 0,12427 0,20928 0,23586 0,17928 0,40852
Kurtosis 4,88697 5,61569 4,38258 5,54571 4,61303 3,69753 4,27986 5,35253
JarqueBera 237,74* 445,90* 144,45* 426,35* 44,09* 11,26* 28,05* 98,46*
*Significant at the 1% level.
JarqueBera test statistics tells us that all indices for daily and weekly
returns deviate from normal distribution. This is no surprise to us because
financial time series are well known to be leptokurtic, but we will have a closer
look to an distribution to learn more about the shape of it. We will report
histogram and nonparametric Epanechnikov kernel density estimator – which
has the form of Ku()=34 (

2 )
(
.1)for all series. The bandwidth h was
1 uI
u
selected according Silverman’s rule of thumb, h=
0.9kN1/5 min
(
, /1.34 ). See
sR
Silverman (1986, equation 3.31).
54 To achieve stationarity all the data are first difference of log series rt =ln Ptln Pt1
55 PX50 – Prague Stock Exchange, WIG – Warsaw Stock Exchange, BUX – Budapest Stock Exchange
and DAX – Deutche Boerse
64
Kernel density estimate (orange) Normal distribution (brown)
PX50daily
0.06 0.04 0.02 0 0.02 0.04
0
10
20
30
40
WIGdaily
0.06 0.04 0.02 0 0.02 0.04
0
10
20
30
40
BUXdaily
0.06 0.04 0.02 0 0.02 0.04 0.06
0
5
10
15
20
25
30
35
DAXdaily
0.05 0 0.05
0
5
10
15
20
25
30
35
PX50weekly
0.1 0.05 0 0.05
0
5
10
15
WIGweekly
0.1 0.05 0 0.05 0.1
0
5
10
15
DAXweekly
0.15 0.1 0.05 0 0.05 0.1
0
2
4
6
8
10
12
BUXweekly
0.15 0.1 0.05 0 0.05 0.1
0
2
4
6
8
10
12
14
FIGURE 3.3: Histograms and Kernel density functions compared to normal distribution
Distributions of central European stock markets are in line with the
developed stock market distributions. They are leptokurtic as expected which
means that they are said to have heavy or fat tails. This may be attributed to
conditional heteroskedasticity, so it is important to notice this before estimation.
3.2.2Empirical results – daily returns
We start with modelling the daily returns of each index with ARIMA
estimation. Augmented DickeyFuller statistics exceed the critical values on 1%
significant level, thus we can reject the null of presence of unit root and state that
all tested series are stationary. PX50 seems to follow ARIMA (1,0,1) best. BUX
returns seems to be explained well by ARIMA(2,0,2), WIG and DAX does not
contain AR and MA errors thus the random walk hypothesis can not be rejected
for them. LjungBox Q statistics shows us the presence of conditional
heteroskedasticity in the residuals from ARIMA models. So we will try to model it
by GARCH(1,1) model as it turns out that this model rules not only with its
parsimony, but also performance with these series. We find these ARIMAGARCH
models to be most appropriately specified. ARIMA(1,0,1)GARCH(1,1) for PX50,
ARIMA(2,0,2)GARCH(1,1) for BUX, and GARCH(1,1) for DAX and WIG returns.
According to results in Table 3 we can see that null hypothesis of no serial
correlation can be clearly rejected with PX50 model and also with BUX model.
Thus these models do not explain all of the variance and should be used with
caution for forecasting prediction task. We will use them only as the
representatives of linear modelling against the neural networks, because we did
not find any better specification models for the data. This might be explained by
use of daily stock returns which are autocorrelated due to the effect of
nonsynchronous trading56. Thus in next sections the use of weekly data should
improve performance of these models.
Table 4: Insample performance on daily returns
PX50 BUX WIG DAX
linear neural linear neural linear neural linear neural
Adj Rsquared 0.004888 0,19 0.021550 0,11 0.0019618 0,09 0.024190 0,16
Schwarz criterion 6.020920 9,283 5.732452 8,58 6.002524 8,61 5.715768 6,5
Ljungbox Q(4) 8,96* 3,8** 7,7 3,31
Ljungbox Q(8) 13,312** 5,98 10,48 7,18
Ljungbox Q(12) 16,42** 9,775 13,82 13,481
*,**,*** significance on 1%, 5% and 10% levels
56 For more details of this issue see Campbel, Lo, MacKinlay (1997)
Table 5: Outof sample performance on daily returns
PX50 BUX WIG DAX
linear neural Linear neural linear neural linear neural
RMSE 0,0199 0,00966 0.0154 0.149 0.012 0.011 0.086 0.08
NMSE 1,003 0,965 0.999 0.981 1.012 0.99 1.007 1.012
DM(0) 1.1 (0.14) 0.59 (0.25) 0.98 (0.16) 1.72 (0.04)
DM(1) 0.91 (0.18) 0.82 (0.2) 1.09 (0.13) 1.78 (0.036)
DM(2) 0.83 (0.21) 0.79 (0.22) 1.2 (0.11) 2.02 (0.022)
DM(3) 0.71 (0.24) 0.78 (0.21) 1.16 (0.13) 1.72 (0.04)
HM 1 (0.00)
1.08 (0.00)
56% (0.07)
1.2%
1.01 (0.02) 1.02 (0.00) 1 (0.00) 1 (0.1) 1 (0.00)
PT 51%(0.2)
53% (0.12) 54% (0.02) 54% (0.5) 54% (0.3) 62% (0.4)
TC 0.002%
0.31% 1.1% 0.002% 0.2% 0.03%
1.03 (0.15)
47% (0.13
0.3%
DM: DieboldMariano statistic (pvalues), HM: Henriksson – Merton statistic, PT: Pessaran
Timmerman (SR with pvalue), TC – total costs
In comparison to modern econometric tools, we will model stock returns
using presented neural network methodology. Simple Feedforward TimeDelayed
structure of network will be used in testing with 1 hidden layer, and Levenberg
Marquardt algorithm. Inputs were used 3 lagged variables mapped into 3 neurons
as we found it provided best results. From results obtained, we can see that there
is a very poor pattern to be learned from our data. It seems that although indices
returns are predictable to some extend, it is very small. Neural networks perform
a little better with explaining the insample data. R2 increases from 0.4%
achieved by linear model to 19% achieved by neural net with PX50 index, and
similarly with other indices as shown in Table 5. Schwarz criterion also favors to
neural networks.
But real test of outofsamples does not make very big difference between
usage of linear and neural network models. We withheld 20% of the data as a
rule of thumb for outofsample testing. As to the Diebold Mariano test, we can
not reject the null hypothesis of equal predictive accuracy of linear and neural
network models for all tested series, except DAX. Thus the neural network model
does not seem to have significantly different errors for the tested daily returns.
Economic significance of predictions differs. For all linear models, we can not
reject the null hypothesis of no predictability with HenrikssonMerton statistic57
and neither PessaranTimmerman. Thus linear models have no economic value
and should not be used for real predictions. Even implied transaction costs are on
very low level. Situation is little bit different with neural network models. With
PX50 and BUX data, we can reject the null of no predictability, while HM is
significant at 1% level for both data sets. PT is significant at 10% level for PX50,
and BUX also, which means that the null hypothesis of independence of actual
57 we use PRIBOR as riskfree rate, and it will be used also in following tests
signs and forecasted signs can be rejected at 10% significance level. Implied
transaction costs are higher than real world transaction costs. Thus even if neural
networks could not beat the linear models with statistically significant lower
errors, they seem to have economic value at least for two tested series.
Although we can gain some predictive edge with daily European Stock
returns, timeseries does not seem to explain themselves very well. It may be
caused by autocorrelation which we could not remove, but as to the power of
approximation ability of neural networks, we think the tested daily returns can
simply be unpredictable, or producing not significant predictions. In the following
section we will see if the weekly data will bring us better results and we will be
able to gain some more predictive edge using neural network models.
3.2.3Empirical results – weekly returns
Again we start with a very similar approach with weekly returns. ADF test
confirms stationarity of the data, thus we can proceed to BoxJenkins
methodology. PX50 follows ARIMA (1,0,0). Note that this result is interesting,
because the weekly data contains MA errors no more. Other weekly returns are
best explained with the same models as daily ones. After the observation of Q
statistics we add GARCH(1,1) to model heteroskedasticity in the residuals and we
end with ARIMA(1,0,0)GARCH(1,1) for PX50, ARIMA(2,0,2)GARCH(1,1) for
BUX, and GARCH(1,1) for DAX and WIG returns. It is interesting that the null
hypothesis of no serial correlation can not be rejected at 1%, 5% and even 10%
significance levels. Thus the models seems to explain most of the variance in the
data and thus can be used for predicting.
Feedforward Timedelayed neural network architecture with 1 hidden
layer, 3 inputs (lagged variables), logsigmoid squasher function and Levenberg
Marquardt algorithm is put to test. From obtained results we can see that insample
improvement by the neural network seems to be really significant as to
the explanatory power and Schwarz criteria.
Table 6: insample performance on weekly returns
PX50 BUX WIG DAX
linear neural linear neural linear neural linear neural
Adj Rsquared 0,018 0,48 0,014 0,15 0,00056 0,28 0,00447 0,34
Schwarz criterion 4,3765 9,456 3,95 11,17 4,25 7,2966 4,12 6,87
Ljungbox Q(4) 0,1034 1,445 7,07 3,236
Ljungbox Q(8) 5,942 2,2489 16,331*** 6,28
Ljungbox Q(12) 8,087 7,6128 19,147*** 10,814
*,**,*** significance on 1%, 5% and 10% levels
Table 7: Outof sample performance on weekly returns
PX50 BUX WIG DAX
linear neural Linear neural linear neural linear neural
RMSE 0,0206 0,01915 0.0342 0.015 0.025 0.019 0.022 0.02
NMSE 0,9806 0.978 0.993 0.987 1.03 0.97 1.029 0.99
DM(0) 2.01 (0.022) 1.78 (0.0375) 0.65 (0.25) 0.67 (0.25)
DM(1) 2.03 (0.021) 1.89 (0.029) 0.62 (0.26) 0.54 (0.29)
DM(2) 1.85 (0.031) 1.98 (0.023) 0.68 (0.24) 0.48 (0.31)
DM(3) 1.94 (0.025) 1.8 (0.035) 0.8 (0.21) 0.51 (0.29)
HM 1.07 (0.04)
1.09 (0.00)
60% (0.09)
0.8%
0.9 (0.02) 1.01 (0.05) 1 (0.2) 1.1 (0.00) 1 (0.00)
PT 58% (0.25)
58%(0.2) 60%(0.12) 0.55%(0.15) 58% (0.09) 55% (0.3)
0.6% 0.1% 0.01% 1% 0.03%
TC 0.4%
1.2 (0.06)
58% (0.07)
0.7%
DM: DieboldMariano statistic (pvalues), HM: Henriksson – Merton statistic, PT: Pessaran
Timmerman (SR with pvalue), TC – total costs
Let us turn to more interesting outofsample forecasts. DieboldMariano
tells us that neural networks have significantly lower error compared to linear
models with PX50 and BUX, as null hypothesis of equal predictive accuracy can be
rejected at 5% significance level for all lags. For other two tested series, WIG and
DAX, the null of equal predictive accuracy can not be rejected, thus for this data,
the models performs statistically similar. As to the economic significance of
forecasts, we reject the null hypothesis of no predictability using HM for PX50
and WIG series at 1% significance levels, and for DAX at 10% significance level.
According to PT, the neural networks has also significant sign predictions, as the
null hypothesis of independence of signs between predicted series and actual
ones can be rejected at 10% significance levels. Implied transaction costs are
quite low, but slightly higher than those of real world58. Models did not perform
well only with BUX series, where we can not reject neither the null hypothesis of
no predictability, nor the null of signs independence.
From preceding tests we can conclude that there is a predictive edge in the
Europeanstock markets. Neural networks seem to explain the time series a little
better than classical approach. When facing the prediction task, the results are
also improved. We can say that with significant chance of 3:2 next week’s return
can be predicted with use of raw price data with neural network. We use these
results as the starting point for development of more robust model in next
subchapter. While it is clear that one can gain abnormal returns using presented
methods, we will try to propose different model which will use not only the lagged
58 we found realworld transaction costs for an 10.000 EUR investment in Czech Republic to be cca.
0.05% in average.
68
variables of the time series itself, but also other variables to gain more
explanatory power and robust results even on daily returns.
3.3 PX50: Gaining the predictive edge
In previous sections we found that the European Stock markets contain
predictable components, but use of the models with lagged data does not seem to
provide us with strong results59 on daily data. On the weekly data models
performed significantly better in two cases, and we managed to gain economic
significance almost for all tested series. In this chapter we will continue with
different approach. We will try to find empirical relationship between the
European Stock Markets and if we manage to find any, we will use it to build a
model which would bring us deeper understanding of the PX50 stock market
returns. In this part, we will use the same daily data as described in previous
section60.
3.3.1Cointegration of BUX, WIG, DAX and PX50 markets
Our first hypothesis is that PX50, DAX, BUX and WIG are comoving and
thus the returns of these markets can be used to bring more light into their
patterns and to predict each other. Žikeš (2003) provided us with results of
Johansen multivariate cointegration analysis and found that all markets are
influenced by at least one lagged variable of neighbor markets. Instead of
conducting the same research and receiving the same results we will try to use
his results in our modelling. Let us firstly examine very illustrative figure FIGURE
3.4 where we plot daily returns of all indices normalized to interval (0,1).
59 The results are not that bad though. The reader should keep in mind that if we can predict future
returns with 55%60% accuracy we have “3/2 : 1” ratio of winning to loosing trades. If we manage to
predict returns with 70% accuracy it is actually excellent result as we have a “7/3 : 1” ratio of winning
to loosing trades and we can consistently earn abnormal returns from the market.
60 We just remind that for all of the tests we divide the tested sample into 70% insample, 10% crosssection
and 20% outofsample for neural nets, and 80% : 20% for regressions.
70
0
0,2
0,4
0,6
0,8
1
1,2
4.1.2000
4.5.2000
4.9.2000
4.1.2001
4.5.2001
4.9.2001
4.1.2002
4.5.2002
4.9.2002
4.1.2003
4.5.2003
4.9.2003
4.1.2004
4.5.2004
4.9.2004
4.1.2005
4.5.2005
4.9.2005
4.1.2006
BUX
DAX
WIG
PX50
FIGURE 3.4: Daily price of BUX, DAX, WIG, PX50 scaled to (0,1)
From the figure it is clear that the Czech, Hungarian and Polish markets are
moving together, German market was falling much faster in the period of 2002 –
2003 and in the middle of the year 2003 it joined other markets but
underperformed them. From this period we can see that the markets are comoving.
With empirical rigorous background of Žikeš’s analysis, we can use this
information for the PX50 stock market returns prediction.
First of all, we conduct a PCA analysis to find which vectors influence
market returns most. We will conduct classical regression PCA analysis and also
nonlinear neural network PCA described in section (2.6.3) for all four indices.
logsigmoid squasher function and LevenbergMarquardt optimalization
mechanism will be used. The results are in following table:
Table 8: Results of PCA
PX50 BUX WIG DAX
classical neural classical neural classical neural classical neural
Adj Rsquared 0,281 0,31 0,352 0,4 0,323 0,34 0,184 0,24
Schwarz criterion 6,36 9,13 6,2 9,13 6,47 9,25 5,42 8,2
Ljungbox Q(4) 10,98** 20,2* 8,58*** 17,23*
Ljungbox Q(8) 13,241 30,058* 9,91 55,03*
Ljungbox Q(12) 15,232 33,97* 14,69 59,95*
1 PX50 returns are being explained by BUX, WIG and DAX with coefficients 0.276*, 0.243* and 0.076* resp.
2 BUX returns are being explained by PX50, WIG and DAX with coefficients 0.322*, 0.376* and 0.117* resp.
3 WIG returns are being explained by PX50, BUX, and DAX with coefficients 0.2189*, 0.290* and 0.1075* resp.
4 DAX returns are being explained by PX50, BUX and WIG with coefficients 0.191*, 0.258* and 0.30* resp.
*,**,*** significance levels of 1%, 5% and 10% resp.
Thus we can see that really all markets are influencing each other and are
moving in tight range. Thus we may try to use lags of PX50, BUX and WIG for
explaining their variance and follow the analysis from previous chapters. The
reader noticed that DAX coefficients are smallest thus DAX surprisingly does not
have such big influence on the three market indices. They explain themselves
best and this information can be used also for their prediction in following text.
Not so surprisingly, DAX is not explained well with PX50, BUX and WIG
returns. This is caused mainly by the fact that half of the tested period the DAX
was moving faster against remaining markets. If we divide the sets to 2 subsets
of pre2003 and after2003 we would find much better results in the second
period. But this is obvious from the FIGURE 3.4 so we will leave this part as an
exercise for interested readers as we will provide the division to the subperiods
in next outof sample forecasting tests. Thus for now the results are clear and we
will move further to use them for real forecasting of the market returns.
3.3.2 Crossmarket predictions
In previous subchapter we found that the PX50, BUX, WIG and DAX returns
are comoving thus now we will be interested if this information can be used for
the forecasting. The methodology here will be quite different. We will try to
forecast the one day return of the market with use of the lags of 3 remaining
markets. For this purpose we apply correlation analysis61 to find which lags
influences the returns most. Then we will use linear OLS estimate and Feed
Forward Neural Network again with best performing logsigmoid transformation
function and LevenbergMarquardt search algorithm. Following models were
developed62:
PX 50t+1 =ß0 +ß1PX 50t1 +ß2 PX 50t5 +ß3BUXt1 +ß4 BUXt3 +ß5 DAXt (3.2)
BUXt+1 =ß0 +ß1BUXt3 +ß2 BUX t5 +ß3 DAXt +ß4 DAX t2 (3.3)
WIGt+1 =ß0 +ß1WIG t +ß2WIGt5 +ß3PX 50t +ß4 PX 50t3 +ß5 DAX t +ß6 DAXt2 (3.4)
DAXt+1 =ß0 +ß1DAXt3 +ß2 DAXt4 +ß3 PX 50t5 +ß4WIG t4 +ß5 BUXt1 (3.5)
61 We use sample correlation coefficient  Pearson product moment correlation coefficient which is the
best estimate of the correlation between two series to determine the potential explanatory variables.
We pick all variables with correlation coefficient statistically significant at 1%, 5% and 10% levels.
62 Estimates can be found in appendix B
Table 9: Insample performance of the daily models for whole tested period
insample PX50 BUX WIG DAX
classical neural classical neural classical neural classical Neural
Adj Rsquared 0,0234 0,12 0,015 0,202 0,023 0,17 0,019 0,11
Schwarz criterion 6,038 8,99 5,78 8,85 6,08 9,409 5,23 7,98
Ljungbox Q(4) 1,31 5,53 0,96 2,8
Ljungbox Q(8) 2,99 5,94 1,302 24,69
Ljungbox Q(12) 3,016 6,53 4,05 30,51*
*,**,*** significance levels of 1%, 5% and 10% resp.
As we can see, PX50, BUX, WIG and DAX returns seems to be explained to
some extent with their mutual lags. As to the comparison of the autoregressive
model with neural network, neural network leaves the autoregressive models far
behind. As to the explanatory power, neural network explains 12%20% of
variance of the returns in individual model, while autoregression only 1.5%2.34%.
Schwartz criterion is also preferring networks much better. So Implication
for the modelling would be very intuitive – use linear regression model to identify
significance of the variables and then improve the estimates with neural
networks. The reader can observe very interesting thing – there is no
autocorrelation present in the models. Ljungbox Q statistics were not significant
at any level for any Q(k). So the results suggests to us, that we could gain some
predictive edge from these models.
Again, we will be concerned with outofsample testing more than insample.
In Table 10 we have the results for whole testing period.
Table 10: Outofsample performance of the daily models for whole tested period
PX50 BUX WIG DAX
Linear neural linear neural linear neural linear neural
RMSE 0.09 0,009 0,0152 0.014 0.012 0.0095 0.0086 0.008
NMSE 0,985 0.978 0.994 0.96 1.014 0.999 1.014 0.9878
DM(0) 0.71 (0.23) 0.26 (0.59) 0.06 (0.52) 1.9 (0.02)
DM(1) 0.71 (0.22) 0.23 (0.59) 0.08 (0.52) 1.99 (0.02)
DM(2) 0.68 (0.24) 0.23 (0.59) 0.08 (0.53) 1.91 (0.02)
DM(3) 0.65 (0.25) 0.24 (0.59) 0.077 (0.53) 1.8 (0.036)
HM 1.03 (0.06)
1.07 (0.00)
59%(0.06)
1.2%
1.04 (0.2)
1.06 (0.05)
57%(0.05)
1.7%
0.98 (0.00)
1 (0.05)
56% (0.21)
.46%
1.02 (0.15)
PT 56%(0.056)
52%(0.27)
52%(0.61)
53% (0.22)
TC 0.6%
0.45%
1%
0.62%
1.07 (0.00)
57% (0.12)
0.2%
DM: DieboldMariano statistic (pvalues), HM: Henriksson – Merton statistic, PT: Pessaran
Timmerman (SR with pvalue), TC – total costs
In our final tests, DieboldMariano tells us that for almost all series the
errors of linear models and neural ones are almost identical, while we can not
72
reject the null hypothesis of equal predictive accuracy for PX50, BUX, WIG. But
we can reject the null hypothesis of no predictability for PX50, BUX and DAX at
1%, 5% and 1% significance levels resp. We also reject the null hypothesis of
independence of directional change of actual and predicted series for PX50 and
BUX at 10% significance level. Implied transaction costs are also in line with HM
and PT statistic, while they confirm economic significance. Again, we were not
able to gain consistently and significantly better predictive power for all tested
series, even with usage of neural network models. This may imply that the daily
European stock market returns are simply unpredictable, as the lags of
surrounding markets did help to explain the variance very little.
3.4 Concluding remarks
At the very beginning of this chapter we illustrate the power of neural
network modelling. Our hypothesis was that if the neural network can
approximate any function, it must be capable of approximating artificial chaotic
time series. We showed that it performed very well on the MackeyGlass chaotic
time series, even in the prediction of them. We compared classical econometric
approaches to model the MackeyGlass chaotic time series with neural network,
and showed that neural network performs much better in the task with
significantly lower errors. Thus we showed that the neural network is capable of
approximating any process, hence it is very strong instrument for our prediction
task.
Next we moved to real world data, the Central European stock market
returns represented by PX50, BUX, WIG and DAX indices. We described the data
first and found no deviation from distributional properties of other developed and
more liquid markets which was no surprise to us. More interestingly, we
conducted the indepth analysis of daily returns, followed by weekly returns and
found that neural networks can be used to improve predictive power of the
classical models only slightly. For daily returns, neural networks improve only
economic significance, but the prediction are not significantly different from linear
models. We conclude that daily European stock market returns may not contain
any significant pattern to be uncovered when using historical prices. With weekly
returns neural networks performed significantly better than linear models on
PX50 and BUX markets. Economical significance was also gained for 3 out of 4
markets, while networks achieved around 60% directional accuracy on weekly
data, which is quite good result.
Thus finally on the basis of cointegration analysis we modeled the returns
with lagged variables of all four indices as we found they are significant to the
returns. In fact, it is logical step as the markets are moving together, and mainly
in these days of globalization, world markets are trading very tightly. In the times
when this research had been conducted, NASDAQ63 unsuccessfully bid for the
London Stock Exchange. Few months later in late may 2006, NYSE64 and
EURONEXT65 bourse announced the merger and creation of first transatlantic
exchange behemoth, the largest stock exchange in the world. Thus markets are
no more depending only on local economical issues, but surly weaker exchanges
follow stronger ones.
But the analysis did not bring the fruit, as the daily lags of surrounding
markets did not improve our results. Again we could not reject the null hypothesis
of equal predictive accuracy of the used models, and economic significance was
very similar to the analysis conducted in the chapters before. Thus we conclude
that daily European Stock markets may not contain any predictable pattern even
if the lags of surrounding markets are used.
An attentive reader will note that one can try to improve or modify the
model for real trading and use indicators, or smoothed prices. We obtained the
results with raw stock markets returns, but for instance, if exponential moving
averages are used to smooth the stock market returns, the prediction of shortterm
direction is even stronger. We showed that a good predictive model can be
built from raw data and we will leave the exercise of using other inputs of moving
averages, or indicators to the reader. For example lagged moving averages of 5
days may predict a one week ahead return well as they smooth the series. There
is much more models to be used depending on the strategy we want to achieve.
We also draw attention to the problem of relevance of the data used. The neural
network can approximate any process but when building the model, bear in mind
that if you input data which are of no importance into the model, it will return
nothing else but forecasts which will be not be applicable. The relevance of the
inputs is crucial for good results. In next chapter we will induce implications for
derivative pricing methods.
63 USA Technological stock Exchange
64 New York Stock Exchange
65 second largest European bourse – integration of Bruxelles, Paris and Amsterdam
Chapter 4
Application to pricing derivatives
In the previous chapter we concluded that with the use of neural networks
we are able to gain a predictive edge. Of course this is a very strong implication
for the markets and traders, but still, it is of quite speculative usage. There are
also many problems of using these models in real trading. The main drawback is,
for example, that most of the models are behaving in the manner that they tend
to predict the movement with some lag. This is fine if the markets are steady and
the model captures the shortterm trends well, but if there are unexpected
exogenous moves or crushes of the stock market, the models very often fail to
warn us. Of course it depends on the input variables used, but still one should
never base his/her trading strategy only on the predictive model as other part of
the success is understanding the market and proper reaction to economic news.
Of course, the modelling of the market returns and uncovering the pattern serves
to a trader very well in gaining abnormal returns in the market if combined also
with understanding of the markets.
Much stronger implications of our findings can be made for another very
interesting area – pricing and hedging of the derivatives. A well known Black
Scholes66 model for pricing of European call options is based on assumptions
which are unrealistic. Stock prices under the lognormal distribution follow
geometric Brownian motion, volatility is constant over time and returns are
normally distributed. These assumptions are nonrealistic. Our study only extends
the empirical literature which shows that based on this assumptions, Black
Scholes can not be used for rational pricing of the options. We just showed that
the returns are strongly predictable, thus are far away from random walk, and
66 Black and Scholes (1973), Merton (1973)
the biggest problem is constancy of volatility. One solution to the problem is to
reestimate the model every day with “new”  updated volatility which will be set
to constant, but this approach for example does not decrease hedging error which
is crucial for big institutional traders.
In the following chapter we will show how neural networks can be used to
option pricing much efficiently on the basis of universal approximation theorem.
We will start with very brief theoretical introduction, which will be followed with
application to an pricing of an warrant which underlying is the largest and second
most liquid stock on the Prague Stock exchange – CEZ.
4.1 Theoretical framework proposed by Black and Scholes
Much of a growth of the market for options and other derivatives is linked
to the famous papers by Black and Scholes (1973) and Merton (1973) in which
closedform option pricing formulas were obtained through a dynamic hedging
argument and noarbitrage condition. This approach has been generalized to the
pricing of an array of securities, and even if there is no closeform solution,
pricing formulas can be obtained numerically.
The basics of the model lies on the assumption of the hedging/noarbitrage
approach, underlying price dynamics St() which is assumed to follow
geometric Brownian motion:
dSt =µStdt+.StdWt , (4.1)
(
)
(
)
()(
)
where µ is expected gain or constant drift, . volatility and Wt()is Wiener
process67. Let
(
, )
CSt be the value or price of the European68 call option on non