To appear on the European Journal of Operational Research (In Press)
Download PDF
Predicting the exchange traded fund DIA with a
combination of Genetic Algorithms and Neural Networks
Massimiliano Versace, Rushi Bhatt, Oliver Hinds, Mark
Shiffer
Department of Cognitive and Neural
Abstract: We evaluate the performance of a heterogeneous mixture of neural network
algorithms for predicting the exchange-traded fund DIA. A genetic algorithm is
utilized to find the best mixture of neural networks, the topology of
individual networks in the ensemble, and to determine the features set. The Genetic Algorithm also determines the
window size of the input time-series supplied to the individual classifiers in
the mixture of experts. The mixtures of neural network experts consist of
recurrent back-propagation networks, and Radial Basis Function networks. The
application of Genetic Algorithm on the heterogeneous mixture of powerful
neural network architectures shows promise for prediction of stock market time
series. These highly non-linear, stochastic and highly non-stationary time
series have been found to be notoriously difficult to predict using
conventional linear statistical methods. In this paper, we propose a
biologically inspired methodology to tackle such hard problems using a
multi-faceted solution.
1.
Introduction
During the last few years, financial institutions as well as
individual investors have found a wide range of uses for Artificial
Intelligence (AI) technologies such as expert systems, artificial neural
networks (ANNs), genetic algorithms (GAs), and fuzzy logic. Financial
engineering has therefore become the natural theatre where pattern
classification algorithms can be contrasted and tested on highly demanding
grounds. The main rationale motivating the use of these techniques is to
identify and exploit the regularity that is hidden in the apparently chaotic
price trend of a given security. Parallel to the development if AI
technologies, the last decade has also witnessed the development of several
nonlinear time series models (Granger and Anderson, 1978, Tong and Lim, 1980,
Engle, 1982). These nonlinear models, although not suffering from a priori
assumption of a linear relationship between data and time series, are limited
by the fact that an a priori assumption of the nature of the nonlinear
relationship must be formulated, the latter task being even more demanding than
the linear case.
Within the wider field of AI, ANNs have been demonstrated to
be a natural solution to problems where an explicit description of the nature
of the data is not available due to their capability to generate data-driven
hypotheses. Historically, the development of these algorithms was inspired by
investigations into the functioning of the nervous system. ANNs have
subsequently been found to have sound theoretical basis from the perspective of
statistical learning theory, and they usually yield good performance when used
for real-world data analysis or in predicting nonlinear dynamical systems
(Lapedes and Farber, 1987, 1988; Haykin, 1999; Duda, Hart and Stork, 2000).
ANNs have good generalization capabilities and are usually robust against noisy
or missing data, all of which are highly desirable properties time series
prediction. Importantly, the class of unsupervised neural networks can operate
a much richer class of functions compared to linear regression based techniques.
Therefore, unsupervised neural networks have the capability of discovering
associations between features that may not have been expected or looked for.
There is an extensive literature on financial applications
of ANNs (Trippi and Turban, 1993; Azoff, 1994; Refenes, 1995; Gately, 1996,
Odom and Sharda, 1990; Coleman et al., 1991; Salchenkerger et al., 1992; Tam
and Kiang, 1992; Wilson and Sharda, 1994, Weigend et al., 1992; Zhang, 1998,
for a review). Although encouraging results have been reported in which
ANNs-based systems outperformed widely-used well-established statistical
methods, many inconsistent reports have been undermining the robustness of
these findings. Among the reasons of these discrepancies are well-known
problems that characterize ANNs, in particular:
1)
Different network type (linear filters,
multilayer perceptrons, radial basis functions network, or RBF, and self-organizing
maps, among others) can lead to different results when trained and tested on
the same database. This is mainly due to the different classes of decision
boundaries that different ANN types prefer;
2)
For a given network type, ANNs are sensitive to
the choices topology and size for a given data set;
3)
ANNs are prone to overfitting, unless great care
is taken in choosing the size and connectivity of the network;
4)
The highly variable nature of financial time
series can prevent a single ANN from producing accurate forecasts for an
extended trading period, even thought it can perform well above chance in a
given testing set.
Combining classifiers and boosting
methods often lead to improvement in performance over single neural networks. Several studies (Pelikan et al., 1992; Ginzburg
and Horn, 1994; Zhang, 1994; Chu and Widjaja, 1994) have shown improvement of
performance in combining different ANNs or training similar ANNs on different
features of the input data and then recombining their output in a later stage. Our
usage of a mixture of experts is motivated by the hypothesis that a
particularly difficult problem can be broken into smaller sub-problems which
are easier to solve with a single classifier. This divide et impera principle,
so common in computer science, is particularly suited for difficult problems,
like modelling of financial time series. The mixture of experts framework
specifies that a prediction is made up of a series of predictions from separate
experts, or networks. In general, the final output is decided by a function
that combines individual classification outputs:
(1)
where
is a vector of individual classifier output, and O is
the output of the mixture of experts. A common implementation of a mixture of
expert uses a weighted combination of yi:
(2)
![]()
where m is the number of experts in the model,
is the
prediction of the expert i and gi is the weighting on expert
i. This weighting can itself be the
prediction of another model, known as the gate, or can simply be set to 1 for
each classifier. The results of the application of mixture of experts are
encouraging, with these systems often outperforming standard statistical
techniques and other classification methods, although it is difficult to say in
general in what kind of tasks mixtures-of-experts or any other kind of model
will perform well on any given dataset (Kang and Oh, 1997, 2000; Jordan and
Jacobs, 1994; Jordan and Xu, 1995; Waterhouse,
2002).
In the ANNs literature, inappropriate topology selection and
weight training are frequently blamed for poor performance. Increasing the
number of hidden layer neurons helps improving network performance, yet many
problems could be solved with very few neurons if only the network took its
optimal configuration. Due to the large size of the hypothesis space that these
ANNs are capable of generating, it is difficult to guarantee that an ANN will
converge to a globally optimal hypothesis for the distribution of the
underlying the given dataset.
The inherent nonlinearity of ANN results in the existence of
many sub-optimal networks, and the great majority of training algorithms
converge to these sub-optimal configurations (local minima). Solutions to local
minima problems often imply methods that are probabilistic in their nature. The
methods can in fact find the globally optimal solution with a certain
probability, which usually depends on the number of iterations of the
algorithm. At the same time, the danger of overfitting is always present,
rendering the problem of the solution of an optimal network configuration very
difficult.
To summarize, there are multiple factors that influence the
choice of a given system of networks, and within a given network type there are
multiple combinations of parameters, network architecture, activation and
learning functions, input selection and preprocessing that produce a
combinatorial explosion of possible systems.
A natural solution for searching this very
large space of system configuration is provided by GAs. GAs are a class of probabilistic search techniques that has been
developed in the past decades as a general-purpose optimization tool (Holland,
1975; Goldberg, 1989). GAs mimic biological evolution by using a massively
parallel search mechanism that involves a)
the initialization of a random population of systems (or candidate solutions) b) ordering the systems based on a
measure of success in approximating the desired solution (fitness) c) a reproduction stage, in which the
best exemplars of the last generation have the chance to produce an offspring
through the application of some GA operators, namely crossover and mutation.
GAs are suited for particularly hard problems when little or no knowledge of
the optimal function is given and the search space is very large. GAs
are thus an ideal tool for solving the problem of discovering appropriate
parameterizations for ANNs, and good results have been obtained in combining
GAs and ANNs in hybrid systems (Belew et al, 1990; Montana and Davis, 1989;
Shaffer et al, 1990; Chang and Lippmann, 1991; Harp and Samad, 1991; Miller et
al, 1989; Whitley et al, 1989; Kingdon, 1997; Venkatesan and Kumar, 2002).
This paper presents an application of such a hybrid system
that uses GAs for selecting an appropriate combination of networks, parameters
and training regimen for predicting the direction of variation of the closing
price of an exchange traded fund, DIA. The "Diamonds" (AMEX ticker:
DIA) is an exchange traded fund tracking the 30 corporations of the Dow Jones
Industrial Average. The price of DIA, which is worth about 1/100th of the Dow Jones' value, faithfully tracks the
fluctuations in the Industrial Average. We have decided to concentrate on this
stock because DIA is a readily available instrument for trading on the Dow
Jones Industrial Average, a canonical financial index.
Section 2 of this paper presents a description of the data
set employed to predict the daily direction of variation of the DIA closing
price. In Section 3 we describe the prediction model, which embodies a
combination of GA and ANNs. In Section 4 we analyze the performance of the
model on 63 test trading days on which the ANNs of the model have not been
trained. Finally, in Section 5 we make some concluding remarks.
2.1 Data collection
Data selection and preprocessing constitute a crucial step in any
modeling effort. The phases of data preparation can be broadly classified into
three distinct areas: variable selection and collection, data inspection, and
data pre-processing. Since our goal was to trade DIA frequently, we were able
to narrow the list of choices to those variables relevant to the time frame.
The data were obtained using a java-based query interface into a database of
historical stock data available freely from Yahoo! (http://finance.yahoo.com).

Table
1. The raw
data loaded from Yahoo! Database. The to-be-predicted stock, DIA, is loaded
along with other historical values of indices, currencies, bonds and
commodities. The sampling interval is from the 11th of November 2001
trough
A list of the variables employed in this study
is shown in Table 1. The data were collected over the period from
2.2 Data
inspection
Even though the data were obtained from a
reliable source, significant errors and missing data were present in the
database. The validity of the input data was checked both visually and
statistically. Missing data (holidays) were filled-in with the last trading day
available. This filling-in has negligible consequences on the final performance
of the system, since the data are pre-processed in the form of a normalized
difference between the actual value and its moving average. This pre-processing
allows normalizing the indicator, therefore eliminating the bias given by its
absolute magnitude in favor of a measure of its relative variation. This
procedure allows using relatively long time series without the risk of
overweighting the data with the higher absolute value.
2.3 Data pre-processing
In order to transform the data into a format acceptable by the prediction algorithms, derive a measure of the statistical fluctuations in the feature vectors used therein, and render the feature vectors independently of the absolute value assumed by the underlying indices from which the data are derived, two preprocessing operators were applied to the data. The renormalization operator (RNO) is used in some cases to extract daily changes in indicators that are independent of the absolute magniture of the indicator over time and to transform the fluctuations into percentages at others. The RNO operator is specified by
(3)
![]()
where x,y and z take values specific to the indicator operated on. The Fluctuation Sensitive Function (FSF) is used when the absolute magnitude of an indicator should be retained for the prediction algorithms. This operator is simply specified by x-y.
After the raw data listed in Table 1 were loaded and crosschecked for accuracy, each field of this dataset, called dataset A, was independently normalized using the RNO operator. The data in the dataset A were further pre-processed and saved in a second dataset called dataset B, in order to reduce the learning load of the system and allowing the learning algorithms to concentrate the more predictive portions of the input (Table 2). This procedure spares the networks from extracting knowledge that can be provided directly in the input pattern, thereby allowing the system to extract higher-order relationships between these complex variables. The values indicated in Table 1 were preprocessed and re-arranged in the final input matrix, consisting of 64 data types and 320 data points. The pre-processing performed on the raw data is described in Tables 2 and 3. In general, the to-be-predicted stock data was extensively preprocessed, adopting a percentage measure of the difference between the indicator (Open, High, Low, Close, Volume) and its moving average (10 days), among other measurements of variation like Rate of Change (ROC), percentage difference between Open and Close, etc. Some widely used Technical Analysis indicators were also employed, namely Moving Average Convergence/Divergence (MACD), Relative Strength Index (RSI), Chainkin Volatility and MOMENTUM (Table 2). For the other indicators a measure of the variation given by Open, High, Low, and Close indicators and a Moving Average was used. More details on the nature of the pre-processing are reported in Tables 2 and 3. This new input matrix was further independently normalized for each field, resulting in each column assuming a value between 0 and 1. At the end of pre-processing, the absolute magnitude of all data was discarded, allowing usage of an arbitrarily long dataset although the absolute values of the indicators part of the dataset might substantially vary over time. The two datasets A and B differ by the extensive preprocessing that characterizes the database B. Before being fed to the component networks of the prediction model, an additional preprocessing step called complement coding was applied. This is a normalization procedure in which a given input vector x is coded along with its complement xc = (1-x). As a result, the input vector and the number of input nodes double in size, allowing automatic normalization of the input vector and, more importantly, an improvement of the pattern classification capabilities of the network (Carpenter et al., 1991). Complement coding allows a higher-order hidden node to be associated to both the presence and the absence of a given feature.
Finally, an additional preprocessing stage consists in the application of Principal Component Analysis (PCA). The input vector can therefore be substituted by a vector with the nth Principal Component of the original one.
Which dataset the network will be trained on, as well as the application of complement coding and PCA, are determined by the chromosome (further details in the section 3).
|
Security |
Operators |
Equation |
DIA |
% ROC
Close |
[(Closet
Closet-1)/Closet-1] · 100 |
|
DIA |
% Diff.
Open - Close |
[(Open
Close)/Open] · 100 |
|
DIA |
% Diff.
High - Low |
[(High
Low)/ Low] · 100 |
|
DIA |
% Diff.
Open and Mob. Avg. 10 dd |
[(Open
Mavg10dd)/ Mavg10dd] · 100 |
|
DIA |
% Diff. High
and Mob. Avg. 10 dd |
[(High
Mavg10dd)/ Mavg10dd] · 100 |
|
DIA |
% Diff.
Low and Mob. Avg. 10 dd |
[(Low
Mavg10dd)/ Mavg10dd] · 100 |
|
DIA |
% Diff.
Close and Mob. Avg. 10 dd |
[(Close
Mavg10dd)/ Mavg10dd] · 100 |
|
DIA |
% Diff.
Volume and Mob. Avg. 10 dd |
[(Volume
Mavg10dd)/ Mavg10dd] · 100 |
|
DIA |
Chaikin
Volatility |
|
|
DIA |
MACD |
MACDt = EMavg1 -
EMavg2 |
|
DIA |
MOMENTUM |
Momentum
= Closet Closet-n |
|
DIA |
RSI |
RS = (Avg. price change on up days
Avg. Price change on down days) RSI =
100 - (100/1+RS) |
|
Nasdaq S.Poor 500 Dow Dow Jones Trasp Dow Jones Util Dow Jones Comp Nikkei Bovespa Dax Ftse 100 Dollar/yen Dollar/Swiss Frank T-Bond 30 Years T-Bond 10 Years T-Bond 5 Years T-Bond 13 weeks |
% Diff.
Open and Mob. Avg. 10 dd % Diff.
High and Mob. Avg. 10 dd % Diff.
Low and Mob. Avg. 10 dd % Diff.
Close and Mob. Avg. 10 dd |
[(Open Mavg10dd)/
Mavg10dd] · 100 [(High
Mavg10dd)/ Mavg10dd] · 100 [(Low
Mavg10dd)/ Mavg10dd] · 100 [(Close
Mavg10dd)/ Mavg10dd] · 100 |
|
Eurobond Gold |
% Diff.
Open and Mob. Avg. 10 dd % Diff.
High and Mob. Avg. 10 dd % Diff.
Low and Mob. Avg. 10 dd % Diff.
Close and Mob. Avg. 10 dd % Diff.
Volume and Mob. Avg. 10 dd |
[(Open
Mavg10dd)/ Mavg10dd] · 100 [(High
Mavg10dd)/ Mavg10dd] · 100 [(Low
Mavg10dd)/ Mavg10dd] · 100 [(Close
Mavg10dd)/ Mavg10dd] · 100 [(Volume
Mavg10dd)/ Mavg10dd] · 100 |
|
DJIA
and T Bond 30 years |
% Diff.
between Normalized DJIA and Normalized T Bond |
Norm. DJIA
Norm. Tbond |
Table 2: This table summarizes the
preprocessing stages applied to the raw data. On the left column, the
securities for which each operator is calculated are listed. The second column
contains the name of the operator and the third column contains the equation by
which the operator is calculated. Abbreviations: ROC = Rate of change; Diff. = Difference;
MAvg = Mobile Average; EMAvg = Exponential
The GA employed in this study is used to
estimate appropriate parameters for m mixtures of networks (MON) used to
predict the closing price of a security. The chromosome used by this GA has 11
genes, which directly define the important parameters of the network, as well
as the type of network and the input data type (database A or B, Table 3 for
details). Every ith MON is
composed of j networks to be selected from one of two types: Recurrent
Backpropagation (Elman, 1991) or RBF networks (Duda et al., 2000). Elman
networks are particularly suited for data when the temporal order of the data
plays a crucial role (Elman, 1991), as it does in financial time series. RBF
networks provide a highly compact representation for the underlying data
distribution in cases where the bases accurately reflect the distribution. Each
jth network is initialized by the chromosome
and it is trained on the section of data defined, again, by the chromosome
(Figure 1). The chromosome defines network type, the architecture, the training
set, and the most important parameters of the network as shown inTable 3.
Every network belonging to the ith MON
is then tested on a blind data set, and single network responses are then
conveyed into a voting procedure. Here a majority voting scheme chooses the
prediction of the ith MON over the blind testing data, and
assesses the fitness of the ith MON. The majority vote is
simply given by:
(4)
sign[(
)/n]
where yj
is the output of a component network, and n is the number of networks in
a given MON.
In this study, the fitness of a MON is given by
the percentage of correct responses over the blind data set. Note that a
different measure of performance, namely the net return of the system over
time, could have been employed. A correct response is defined as a match
between the predicted and the actual direction of variation of the closing
price of DIA in the following trading day.
It follows from this choice that the best
MON in the population is not necessarily the one that has the
lowest MSE on the training set, or the best jth network on
the blind data set, but the one that expresses the highest number of correct voting predictions. This fitness
function allows the dissociation of the performance on the training set for any
single network to the actual generalization capabilities due to selection based
on the collective voting measure, and on a blind data set that the component
networks have not been trained on.
This choice follows the assumption that a
single network cannot perform well on an extensive blind testing data set, but
a family of networks trained on different time intervals could achieve better
results. In this study a simple voting procedure was used. More sophisticated techniques
could include super-ordinate networks that learn a nonlinear combination of the
output of the component networks.

Figure 1: a) Every generation contains a population of MON which in turn
contains n networks (Experts). b)
Each MON is composed of n networks (or experts), which are defined by a
chromosome that specifies the architecture and parameters of the network.

Figure 2: The typical GA cycle. A
chromosome defines the architecture of a network (in our study, a mixture of
networks), which are then selected based on a measure of their fitness. The
winning exemplars are then allowed to generate offspring, which is defined
again trough a new chromosome.
After fitness is calculated, a new set of MON
is created using an even number of high-fitness populations of the past
generation as a basis. In this simulation, the best 4 populations were selected
for reproduction. The chromosomes specifying the networks of two of the winning
populations are coupled with the corresponding chromosomes of another winning
population, and crossover and mutation
|
Gene # |
Expressing these
parameters |
For this network
type |
Param. Range
(min-max) |
0 |
Network type |
RBF
or ELMAN |
0
or 1 |
1 |
Training
epochs |
ELMAN |
20
- 1000 |
|
2 |
Learning rate |
ELMAN |
0.1
- 0.3 |
|
3 |
Type of input
data (Database A or B) |
RBF
and ELMAN |
0
or 1 |
|
4 |
Number of
training data upper bound |
RBF
and ELMAN |
2 |
|
5 |