{"title": "The Observer-Observation Dilemma in Neuro-Forecasting", "book": "Advances in Neural Information Processing Systems", "page_first": 992, "page_last": 998, "abstract": null, "full_text": "The Observer-Observation Dilemma \n\nin Neuro-Forecasting \n\nHans Georg Zimmermann \n\nSiemensAG \n\nCorporate Technology \n\nD-81730 Munchen, Germany \n\nRalph N euneier \n\nSiemens AG \n\nCorporate Technology \n\nD-81730 Munchen, Germany \n\nGeorg.Zimmermann@mchp.siemens.de \n\nRalph.Neuneier@mchp.siemens.de \n\nAbstract \n\nWe explain how the training data can be separated into clean informa(cid:173)\ntion and unexplainable noise. Analogous to the data, the neural network \nis separated into a time invariant structure used for forecasting, and a \nnoisy part. We propose a unified theory connecting the optimization al(cid:173)\ngorithms for cleaning and learning together with algorithms that control \nthe data noise and the parameter noise. The combined algorithm allows \na data-driven local control of the liability of the network parameters and \ntherefore an improvement in generalization. The approach is proven to \nbe very useful at the task of forecasting the German bond market. \n\n1 Introduction: The Observer-Observation Dilemma \n\nHuman beings believe that they are able to solve a psychological version of the Observer(cid:173)\nObservation Dilemma. On the one hand, they use their observations to constitute an under(cid:173)\nstanding of the laws of the world, on the other hand, they use this understanding to evaluate \nthe correctness of the incoming pieces of information. Of course, as everybody knows, \nhuman beings are not free from making mistakes in this psychological dilemma. We en(cid:173)\ncounter a similar situation when we try to build a mathematical model using data. Learning \nrelationships from the data is only one part of the model building process. Overrating this \npart often leads to the phenomenon of overfitting in many applications (especially in eco(cid:173)\nnomic forecasting). In practice, evaluation of the data is often done by external knowledge, \ni. e. by optimizing the model under constraints of smoothness and regularization [7]. If we \nassume, that our model summerizes the best knowledge of the system to be identified, why \nshould we not use the model itself to evaluate the correctness of the data? One approach to \ndo this is called Clearning [11]. In this paper, we present a unified approach of the interac(cid:173)\ntion between the data and a neural network (see also [8]). It includes a new symmetric view \non the optimization algorithms, here learning and cleaning, and their control by parameter \nand data noise. \n\n\fThe Observer-Observation Dilemma in Neuro-Forecasting \n\n993 \n\n2 Learning \n\n2.1 Learning reviewed \nWe are especially interested in using the output of a neural network y( x, w), given the \ninput pattern, x, and the weight vector, w, as a forecast of financial time series. In the \ncontext of neural networks learning nonnally means the minimization of an error function \nE by changing the weight vector w in order to achieve good generalization performance. \nTypical error functions can be written as a sum of individual terms over all T training \npatterns, E = ~ 'L,;=1 Et. For example, the maximum-likelihood principle leads to \n\nEt = 1/2 (y(x, w) - yt)2 , \n\n(1) \n\nwith yt as the given target pattern. If the error function is a nonlinear function of the pa(cid:173)\nrameters, learning has to be done iteratively by a search through the weight space, changing \nthe weights from step T to T + 1 according to: \n\nThere are several algorithms for choosing the weight increment ~W(T) , the most easiest \nbeing gradient descent. After each presentation of an input pattern, the gradient gt := \nVEt Iw of the error function with respect to the weights is computed. In the batch version \nof gradient descent the increments are based on all training patterns \n\n(2) \n\n~W(T) = -\"1g = -\"1 T L gt, \n\n1 T \n\nt=l \n\n(3) \n\nwhereas the pattern-by-pattern version changes the weights after each presentation of a \npattern Xt (often randomly chosen from the training set): \n\nThe learning rate \"1 is typically held constant or follows an annealing procedure during \ntraining to assure convergence. Our experiments have shown that small batches are most \nuseful, especially in combination with Vario-Eta, a stochastic approximation of a Quasi(cid:173)\nNewton method [3]: \n\n(4) \n\n~W(T) = -\n\n. ~ L gt, \nJ + 'L,(gt - g)2 N t=l \n\n\"1 \n\nN \n\n(5) \n\nwith and N ~ 20. Learning pattern-by-pattern or with small batches can be viewed as a \nstochastic search process because we can write the weight increments as: \n\n(6) \n\nThese increments consist of the terms 9 with a drift to a local minimum and of noise terms \n(k 'L,~ 1 gt - g) disturbing this drift. \n\n2.2 Parameter Noise as an Implicit Penalty Function \nConsider the Taylor expansion of E ( w) around some point w in the weight space \n\nE(w + ~w) = E(w) + VE ~w + 2~W'H ~w \n\n1 \n\n(7) \n\n\f994 \n\nH. G. Zimmermann and R. Neuneier \n\nwith H as the Hessian of the error function. Assume a given sequence of T disturbance \nvectors ~ Wt, whose elements ~ Wt ( i) are identically, independently distributed (i .i.d.) with \nzero mean and variance (row-)vector var(~wi) to approximate the expectation (E( w) by \n\n(E(w) ~ T L...J E(w + ~Wt) = E(w) + :2 L...J var(~w(z))Hii' \n\n1\", \n\n1\",\n\n. \n\n(8) \n\nt \n\ni \n\nwith Hii as the diagonal elements of H. In eq. 8, noise on the weights acts implicitly as a \npenalty term to the error function given by the second derivatives H ii . The noise variances \nvar( ~ w( i)) operate as penalty parameters. As a result of this flat minima solutions which \nmay be important for achieving good generalization performance are favored [5]. \nLearning pattern-by-pattem introduces such noise in the training procedure i.e., ~ Wt = \n-1] \u2022 gt\u00b7 Close to convergence, we can assume that gt is i.i.d. with zero mean and variance \nvector var(gi) so that the expected value can be approximated by \n82 E \n(E(w) ~ E(w) + - Lvar(gd-8 2' \nWi \n\nTJ2 \n2\n\n(9) \n\n. \n\nI \n\nThis type of learning introduces to a local penalty parameter var( ~ w ( i) ), characterizing \nthe stability of the weights w = [Wdi=l, ... ,k. \nThe noise effects due to Vario-Eta learning ~wt(i) = -R . gti leads to \n\n82 E \n(E(w) ~ E(w) + 2 :L 8w~' \n\n1]2 \n\ni \n\nI \n\n(10) \n\nBy canceling the term var(gi) in eq. 9, Vario-Eta achieves a simplified uniform penalty \nparameter, which depends only on the learning rate 1]. Whereas pattern-by-pattern learning \nis a slow algorithm with a locally adjusted penalty control, Vario-Eta is fast only at the cost \nof a simplified uniform penalty term. We summarize this section by giving some advice on \nhow to learn to flat minima solutions: \n\n\u2022 Train the network to a minimal training error solution with Vario-Eta, which is a \n\nstochastic approximation of a Newton method and therefore very fast. \n\n\u2022 Add a final phase of pattem-by-pattern learning with uniform learning rate to fine \n\ntune the local curvature structure by the local penalty parameters (eq. 9). \n\n\u2022 Use a learning rate 1] as high as possible to keep the penalty effective. The training \n\nerror may vary a bit, but the inclusion of the implicit penalty is more important. \n\n3 Cleaning \n3.1 Cleaning reviewed \nWhen training neural networks, one typically assumes that the data is noise-free and one \nforces the network to fit the data exactly. Even the control procedures to minimize over(cid:173)\nfitting effects (i.e., pruning) consider the inputs as exact values. However, this assumption \nis often violated, especially in the field of financial analysis, and we are taught by the \nphenomenon of overfitting not to follow the data exactly. Clearning, as a combination of \ncleaning and learning, has been introduced in the paper of [11]. The motivation was to \nminimize overfitting effects by considering the input data as corrupted by noise whose dis(cid:173)\ntribution has also to be learned. The Cleaning error function for the pattern t is given by \nthe sum of two terms \n\nE yx 1[( \nt' =\"2 \n\nYt - Yt + Xt - x t \n\nd)2 \n\n( \n\nd)2] \n\n= Et + Et \n\ny \n\nx \n\n(11) \n\n\fThe Observer-Observatio\u00b7n Dilemma in Neuro-Forecasting \n\n995 \n\nwith xf, yt as the observed data point. In the pattem-by-pattem learning, the network \noutput y( x t, w) determines the adaptation as usual, \n\n(12) \n\nWe have also to memorize correction vectors &t for all input data of the training set to \npresent the cleaned input Xt to the network, \n\nThe update rule for the corrections, initialized with ~x~O) = 0 can be described as \n\nXt = xf + &t \n\n~X~T+I) \n\n(1 - 1])~X!T) - 1](Yt - yt) ~ \n\n(13) \n\n(14) \n\nAll the necessary quantIties, i. e. (Yt - yt) &Y~;w) are computed by typical back(cid:173)\npropagation algorithms, anyway. We experienced, that the algorithms work well, if the \nsame learning rate 1] is used for both, the weight and cleaning updates. For regression, \ncleaning forces the acceptance of a small error in x. which can in turn decrease the error in \nY dramatically, especially in the case of outliers. Successful applications of Cleaning are \nreported in [11] and [9]. \n\nAlthough the network may learn an optimal model for the cleaned input data, there is \nno easy way to work with cleaned data on the test set. As a consequence, the model is \nevaluated on a test set with a different noise characteristic compared to the training set. We \nwill later propose a combination of learning with noise and cleaning to work around this \nserious disadvantage. \n\n3.2 Data Noise reviewed \nArtificial noise on the input data is often used during training because it creates an infinite \nnumber of training examples and expands the data to empty parts of the input space. As \na result, the tendency of learning by heart may be limited because smoother regression \nfunctions are produced. \nNow, we are considering again the Taylor expansion, this time applied to E (x) around \nsome point x in the input space. The expected value (E (x)) is approximated by \n(E(x)} ~ ~ L E(x + &t} = E(x) + ~ L var(&(j))Hjj, \n\n(15) \n\nt \n\nj \n\nwith Hjj as the diagonal elements of the Hessian Hxx of the error function with respect to \nthe inputs x. Again, in eq. 15, noise on the inputs acts implicitly as a penalty term to the \nerror function with the noise variances var( & (j)) operating as penalty parameters. Noise \non the input improve generalization behavior by favoring smooth models [1]. \n\nThe noise levels can be set to a constant value, e. g. given by a priori knowledge, or adaptive \nas described now. We will concentrate on a uniform or normal noise distribution. Then, \nthe adaptive noise level ~j is estimated for each input j individually. Suppressing pattern \nindices, we define the noise levels ~j or ~J as the average residual errors: \n\nuniform residual error:.. \n\nGaussian residual error: \n\nt). = _1 ' \" IOEY I, \nt)2. ___ 1 '\" (OEY) 2 \n.. \n\n~ \nt OXj \n\n~ \n\nT \n\nT \n\nt \n\nOXj \n\nActual implementations use stochastic approximation, e. g. for the uniform residual error \n\n\u20ac~T+1) = (1 _ ~ )~~T) + ~ I oEY I. \n\nT) \n\nT OXj \n\nJ \n\n(16) \n\n(17) \n\n(18) \n\n\f996 \n\nH. G. Zimmennann and R. Neuneier \n\nThe different residual error levels can be interpreted as follows: A small level ~j may \nindicate an unimportant input j or a perfect fit of the network concerning this input j. In \nboth cases, a small noise level is appropriate. On the other hand, a high value of ~j for an \ninput j indicates an important but imperfectly fitted input. In this case high noise levels are \nadvisable. High values of ~j lead to a stiffer regression model and may therefore increase \nthe generalization perfonnance of the network. \n\n3.3 Cleaning with Noise \nTypically, training with noisy inputs takes a data point and adds a random variable drawn \nfrom a fixed or adaptive distribution. This new data point Xt is used as an input to the \nnetwork. If we assume, that the data is corrupted by outliers and other influences, it is \npreferable to add the noise tenn to the cleaned input. For the case of Gaussian noise the \nresulting new input is: \n\n(19) \nwith \u00a2 drawn from the nonnal distribution. The cleaning of the data leads to a corrected \nmean of the data and therefore to a more symmetric noise distribution, which also covers \nthe observed data x t \u2022 \n\nXt = xf + ~Xt + ~\u00a2, \n\nWe propose a variant which allows more complicated noise distributions: \n\n(20) \n\nwith k as a random number drawn from the indices of the correction vectors (~xtlt=l , ... ,T. \nIn this way we use a possibly asymmetric and/or dependent noise distribution, which still \ncovers the observed data Xt by definition of the algorithm. \nOne might wonder, why to disturb the cleaned input xf + ~Xt with an additional noisy \ntenn ~x k. The reason for this is, that we want to benefit from representing the whole input \ndistribution to the network instead of only using one particular realization. \n\n4 A Unifying Approach \n\n4.1 The Separation of Structure and Noise \nIn the previous sections we explained how the data can be separated into clean infonnation \nand unexplainable noise. Analogous, the neural network is described as a time invariant \nstructure (otherwise no forecasting is possible) and a noisy part. \n\ndata \nneural network-ttime invariant parameters+parameter noise \n\n-t cleaned data \n\n+ time invariant data noise \n\nWe propose to use cleaning and adaptive noise to separate the data and to use learning and \nstochastic search to separate the structure of the neural network. \n\ndata \nneural network~learning (data) \n\n~ cleaning(neural network) + adaptive noise (neural network) \n\n+ stochastic search(data) \n\nThe algorithms analyzing the data depend directly o\"n the network whereas the methods \nsearching for structure are directly related to the data. It should be clear that the model \nbuilding process should combine both aspects in an alternate or simultaneous manner. The \ninteraction of algorithms concerning data analysis and network structure enables the real(cid:173)\nization of the the concept of the Observer-Observation Dilemma. \n\n\fThe Observer-Observation Dilemma in Neuro-Forecasting \n\n997 \n\nThe aim of the unified approach can be described, exemplary assuming here a Gaussian \nnoise model, as the minimization of the error due to both, the structure and the data: \n\n2~ I: [(Yt - y1)2 + {Xt - x1)2] -+ ~~fJ \n\nT \n\nt=l \n\n(21) \n\nCombining the algorithms and approximating the cumulative gradient 9 by g, we receive \n\ndata \n\nstructure \n\n(1 - 0: )gH + O:(Yt - yt) P(cid:173)\n\nWIT) -\n\n7]g(T) \n'-....-' \nlearning \n\n-7](9t - g0-)) \n---......-.-\n\nnoise \n\n(22) \n\nThe cleaning of the data by the network computes an individual correction term for each \ntraining pattern. The adaptive noise procedure according to eq. 20 generates a potentially \nasymmetric and dependent noise distribution which also covers the observed data. The \nimplied curvature penalty, whose strength depends on the individual liability of the input \nvariables, can improve the generalization performance of the neural network. \n\nThe learning of the structure searches for time invariant parameters characterized by \n-j; L 9t = O. The parameter noise supports this exploration as a stochastic search to find \nbetter \"global\" minima. Additionally, the generalization performance may be further im(cid:173)\nproved by the implied curvature penalty depending on the local liability of the parameters. \nNote that, although the description of the weight updates collapses to the simple form of \neq. 4, we preferred the formula above to emphasize the analogy between the mechanism \nwhich handles the data and the structure. \n\nIn searching for an optimal combination of data and parameters, the noise of both parts is \nnot a disastrous failure to build a perfect model but it is an important element to control the \ninteraction of data and structure. \n\n4.2 Pruning \nThe neural network topology represents only a hypothesis of the true underlying class of \nfunctions. Due to possible misspecification, we may have defects of the parameter noise \ndistribution. Pruning algorithms are not only a way to limit the memory of the network, but \nthey also appear useful to correct the noise distribution in different ways. \nStochastic-Pruning [2] is basically a t-test on the weights w. Weights with low testw values \nconstitute candidates for pruning to cancel weights with low liability measured by the size \nof the weight divided by the standard deviation of its fluctuations. By this, we get a stabi(cid:173)\nlization of the learning against resampling of the training data. A further weight pruning \nmethod is EBD, Early-Brain-Damage [10], which is based on the often cited OBD prun(cid:173)\ning method of [6]. In contrast to OBD, EBD allows its application before the training has \nreached a local minimum. One of the advantages of EBD over OBD is the possibility to \nperform the testing while being slidely away from a local minimum. In our training pro(cid:173)\ncedure we propose to use noise even in the final part of learning and therefore we are only \nnearby a local minimum. Furthermore, EBD is also able to revive already pruned weights. \nSimilar to Stochastic Pruning, EBD favors weights with a low rate of fluctuations. If a \nweight is pushed around by a high noise, the implicit curvature penalty would favor a flat \nminimum around this weight which leads to its elimination by EBD. \n\n\f998 \n\nH. G. Zimmermann and R. Neuneier \n\n5 Experiments \nIn a research project sponsored by the European Community we are applying the proposed \napproach to estimate the returns of 3 financial markets for each of the G7 countries subse(cid:173)\nquently using these estimations in an asset allocation scheme to create a Markowitz-optimal \nportfolio [4]. This paper reports the 6 month forecasts of the German bond rate, which is \none of the more difficult tasks due to the reunification of Germany and GDR. The inputs \nconsist of 39 variables achieved by preprocessing 16 relevant financial time series. The \ntraining set covers the time from April, 1974 to December, 1991, the test set runs from J an(cid:173)\nuary, 1992 to May, 1996. The network arcitecture consists of one hidden layer (20 neurons, \ntanh transfer function) and one linear output. First, we trained the neural network until \nconvergence with pattern-by-pattern learning using a small batch size of 20 patterns (clas(cid:173)\nsical approach). Then, we trained the network using the unified approach as described in \nsection 4.1 using pattern-by-pattern learning. We compare the resulting predictions of the \nnetworks on the basis of four performance measures (see table). First, the hit rate counts \nhow often the sign of the return of the bond has been correctly predicted. As to the other \nmeasures, the step from the forecast model to a trading system is here kept very simple. If \nthe output is positive, we buy shares of the bond, otherwise we sell them. The potential \nrealized is the ratio of the return to the maximum possible return over the test (training) \nset. The annualized return is the average yearly profit of the trading systems. Our approach \nturns out to be superior: we almost doubled the annualized return from 4.5% to 8.5% on \nthe test set. The figure compares the accumulated return of the two approaches on the test \nset. The unified approach not only shows a higher profitability, but also has by far a less \nmaximal draw down. \n\nI approach \n\nhit rate \n\nII \n\nour \n\nclassical \n\n81%(96%) \n\n66%(93%) \n\nrealized potential \n\n75%(100%) \n\n44%(96%) \n\nannualized return \n\n8.5% (11.2%) \n\n4.5%(10.1%) \n\n35 i -- --~ -- -.., - - , \n\n- - . -\n\n-. \n\n30\\ ~~~/.// -\n25r \n.3 \n:\n!!!20 \n\n~'5'JI \" \n\n1i \n\n\" \nI \n\" '-\nj'O' \n. \n. \n,'. \n. \n. \n. , \n::. .... \n. ..... .. .... \n.{--'0\u00b7 - 20-\" 30 \n\n- . \nI \n\n40-\n\nsO \n\nI \n- -So \n\n11 ... \n\nReferences \n[I] Christopher M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, 1994. \n[2] W. Finnoff, F. Hergert, and H. G. Zimmennann. Improving generalization perfonnance by \n\nnonconvergentmodel selection methods. In proc. of ICANN-92, 1992. \n\n[3] W. Finnoff, F. Hergert, and H. G. Zimmennann. Neuronale Lemverfahren mit variabler Schritt(cid:173)\n\nweite. 1993. Tech. report, Siemens AG. \n\n[4] P. Herve, P. Nairn, and H. G. Zimmennann. Advanced Adaptive Architectures for Asset Allo-\n\ncation: A Trial Application. In Forecasting Financial Markets, 1996. \n\n[5] S. Hochreiter and J. Schmid huber. Flat minima. Neural Computation, 9(1): 1-42, 1997. \n[6] Y. Ie Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. NIPS*89, 1990. \n[7] J. E. Moody and T. S. Rognvaldsson. Smoothing regularizers for projective basis function \n\nnetworks. NIPS 9, 1997. \n\n[8] R. Neuneier and H. G. Zimmennann. How to Train Neural Networks. In Tricks of the Trade: \n\nHow to make algorithms really to work. Springer Verlag, Berlin, 1998. \n\n[9] B. Tang, W. Hsieh, and F. Tangang. Cleaming neural networks with continuity constraints for \n\nprediction of noisy time series. ICONIP '96, 1996. \n\n[10] V. Tresp, R. Neuneier, and H. G. Zimmennann. Early brain damage. NIPS 9, 1997. \n[II] A. S. Weigend, H. G. Zimmennann, and R. Neuneier. Cleaming. Neural Networks in Financial \n\nEngineering, (NNCM95), 1995. \n\n\f", "award": [], "sourceid": 1385, "authors": [{"given_name": "Hans-Georg", "family_name": "Zimmermann", "institution": null}, {"given_name": "Ralph", "family_name": "Neuneier", "institution": null}]}