Long Memory Models
Summary and Keywords
Long memory models are statistical models that describe strong correlation or dependence across time series data. This kind of phenomenon is often referred to as “long memory” or “long-range dependence.” It refers to persisting correlation between distant observations in a time series. For scalar time series observed at equal intervals of time that are covariance stationary, so that the mean, variance, and autocovariances (between observations separated by a lag j) do not vary over time, it typically implies that the autocovariances decay so slowly, as j increases, as not to be absolutely summable. However, it can also refer to certain nonstationary time series, including ones with an autoregressive unit root, that exhibit even stronger correlation at long lags. Evidence of long memory has often been been found in economic and financial time series, where the noted extension to possible nonstationarity can cover many macroeconomic time series, as well as in such fields as astronomy, agriculture, geophysics, and chemistry.
As long memory is now a technically well developed topic, formal definitions are needed. But by way of partial motivation, long memory models can be thought of as complementary to the very well known and widely applied stationary and invertible autoregressive and moving average (ARMA) models, whose autocovariances are not only summable but decay exponentially fast as a function of lag j. Such models are often referred to as “short memory” models, becuse there is negligible correlation across distant time intervals. These models are often combined with the most basic long memory ones, however, because together they offer the ability to describe both short and long memory feartures in many time series.
Introductory Definitions and Discussion
Some basic notation must be introduced. Let , be an equally spaced, real valued time series. We suppose initially that is covariance stationary, so that the mean
and lag-j autocovariances (or variance when j=0)
do not depend on . We further suppose that has a spectral density, denoted
where denotes “frequency.” Note that is a non-negative, even function. We might then say that has “long memory” if
so that diverges at frequency zero. The extreme alternative that
is, on the other hand, possible; this phenomenon is sometimes referred to as “negative dependence” or “anti-persistence.” The intermediate situation is
when we say that has “short memory.” It is also possible that might diverge or be zero at one or more frequencies in , possibly indicating seasonal or cyclic behavior. The modeling of such phenomena will be discussed, but the main focus is on behavior at zero frequency, which empirically seems the most interesting. An excellent textbook reference to theory and methods for long memory is Giraitis, Koul, and Surgailis (2012).
Nonparametric estimates of have been found to be heavily peaked around zero frequency in case of many economic time series, going back to Adelman (1965), lending support for the presence of long memory. Moreover, empirical evidence of long memory in various fields, such as astronomy, chemistry, agriculture, and geophysics, dates from much earlier times; see for example Fairfield Smith (1938) and Hurst (1951).
One feature of interest in early work was behavior of the sample mean,
If is continuous and positive at ,
but, for example, Fairfield Smith (1938) fitted a law , to spatial agricultural data, disputing the law. At this point it is convenient to switch notation, to , because , referred to as the “differencing” parameter, features more commonly in econometric modeling. Fairfield Smith’s (1938) law for the variance of the sample mean is thus , which arises if(1.1)
for . Under additional conditions (see Yong, 1974), (1.1) is equivalent to a corresponding power law for near zero frequency,(1.2)
for . The behavior of the sample mean under such circumstances, and the form and behavior of the best linear unbiased estimate of the population mean, was discussed by Adenstedt (1974). He anticipated the practical usefulness of (1.2) in the long memory range , but also treated the anti-persistent case . The sample mean tends to be highly statistically inefficient under anti-persistence, but for long memory Samarov and Taqqu (1988) found it to have remarkably good efficiency.
A number of explanations of how long memory behavior might arise have been proposed. Macroeconomic time series, in particular, can be thought of as aggregating across micro-units. Consider the random-parameter autoregressive model of order 1 (),
where indexes micro-units, the are independent and homoscedastic with zero mean across , , and is a random variable with support or . Then, conditional on , is a stationary sequence. Robinson (1978a) showed that the “unconditional autovariance,” which we again denote by , is given by(1.3)
and that the “unconditional spectrum” at is proportional to , and thus infinite, if has a probability density with a zero at 1 of order less than or equal to 1. One class with this property considered by Robinson (1978a) was the (possibly translated) Beta distribution, for which Granger (1980) explicitly derived the corresponding power law behavior of the spectral density of cross-sectional aggregates , where the are independent drawings: clearly is due to the independence properties. Indeed, if has a Beta distribution on , for , , decays like , so (1.3) decays like , as in (1.1). Intuitively, a sufficient density of individuals with close-to-unit-root behavior produces the aggregate long memory. For further developments in relation to more general models see, for example, Lippi and Zaffaroni (1998).
The differencing parameter, , concisely describes long memory properties, and much interest in the possibility of long memory or anti-persistence focuses on the question of its value. In practice is typically regarded as unknown, and so its estimation has been the focus of much research. Indeed, an estimate of is useful even in estimating the variance of the sample mean.
In order to estimate we need to consider the modeling of dependence in more detail. The simplest possible realistic model for a covariance stationary series is a parametric one that expresses for all , or for all , as a parametric function of just two parameters, and an unknown scale factor. The earliest such model is “fractional noise,” which arises from considerations of self-similarity. A stochastic process ; is self-similar with “self-similarity parameter” if, for any , has the same distribution as . If the differences , for integer , are covariance stationary, we obtain
This decays like as , so on taking we have again the asymptotic law (1.1); is the unknown scale parameter in this model.
This model was studied by Mandelbrot and Van Ness (1968) and others, but, it extends less naturally to richer stationary series, and nonstationary series, and has an unpleasant spectral form (see the discussion of Whittle estimates), so it has received less attention in recent years than another two-parameter model, the “fractional differencing” model proposed by Adenstedt (1974):(2.1)
When , this is just the spectral density of a white noise series (with variance , while for both properties (1.1) and (1.2) hold, Adenstedt (1974) giving a formula for as well as other properties. Note that is necessary for integrability of , that is for to have finite variance; this restriction is sometimes called the stationarity condition on . Another mathematically important restriction is that of invertibility, .
The “typical spectral shape of an economic variable” was identified by Granger (1966) as entailing not only spectral divergence at zero frequency, but monotonic decay with frequency. Both “fractional differencing” and “fractional noise” models have this simple property. But even if monotonicity holds, as it may, at least approximately, in case of deseasonalized series, the notion that the entire autocorrelation structure can be explained by a single parameter, , is highly questionable. Though determines the long-run or low-frequency behavior of , greater flexibility in modeling short-run, high-frequency behavior may be desired. The model (2.1) was referred to as “fractional differencing” because it is the spectral density of generated by(2.2)
where is a sequence of uncorrelated variables with zero mean and variance , is the lag operator, and
where denotes the gamma function. With (and a suitable initial condition), (2.2) would describe a random walk model. The model(2.3)
was stressed by Box and Jenkins (1971), here being a positive integer, and being the polynomials
all of whose zeros are outside the unit circle, with and having no zero in common to ensure identifiability of the autoregressive (AR) order and the moving average (MA) order . Granger and Joyeux (1980) considered instead fractional in (2.3), giving a fractional autoregressive integrated moving average model of orders (often abbreviated as or ). It has spectral density(2.4)
Granger and Joyeux (1980) principally discussed the simple case (2.1) of Adenstedt (1974), but they also considered estimation of , prediction, and simulation of long memory series. Further discussion of models was provided by Hosking (1981), much of it based on Adenstedt’s (1974) model (2.1), but he also gave results for the general case (2.4), especially the .
An enduringly popular proposal for estimating , or , used the adjusted rescaled range statistic
of Hurst (1951), Mandelbrot and Wallis (1969). Large sample statistical properties of the R/S statistic were studied by Mandelbrot and Taqqu (1979), and Taqqu (1975), and it was considered in an economic context by Bloomfield (1972). But its limit distribution is nonstandard and difficult to use in statistical inference, while it has no known optimal efficiency properties with respect to any known family of distributions.
Despite the distinctive features of long memory series, there is no overriding reason why traditional approaches to parametric estimation in time series should be abandoned in favor of rather special approaches like . In fact, if is assumed Gaussian, the Gaussian maximum likelihood estimate (MLE) might be expected to have optimal asymptotic statistical properties, and unlile , can be tailored to the particular parametric model assumed.
The literature on the Gaussian MLE developed first with short memory processes in mind (see, e.g., Whittle, 1951; Hannan, 1973). One important finding was that the Gaussian likelihood can be replaced by various approximations without affecting first order limit distributional behavior. Under suitable conditions, estimates maximizing such approximations, called “Whittle estimates” are all -consistent and have the same limit normal distribution as the Gaussian MLE.
One particular Whittle estimate that seems particularly computationally advantageous is the discrete-frequency form. Suppose the parametric spectral density has form , where is an -dimensional unknown parameter vector and is a scalar as in (2.1). If is regarded as varying freely from , and for all admissible values of , then we have what might be called a “standard parameterization.” For example, we have a standard parameterization in (2.1) with , and in (2.4) with determining the , and , . Define also the periodogram
and the Fourier frequencies . Denoting by the true value of , then the discrete frequency Whittle estimate of minimizes (2.5) to a constant minus the Gaussian log likelihood,(2.5)
Hannan (1973) stressed this estimate. It has the advantages of using directly the form of , which is readily written down in case of autoregressive moving average (ARMA) models, Bloomfield’s (1972) spectral model, and others; on the other hand, autocovariances, partial autocovariances, AR coefficients, and MA coefficients, which variously occur in other types of Whittle estimates, tend to be more complicated except in special cases; indeed, for (2.4) the form of autocovariances, for example, can depend on the question of multiplicity of zeros of . Another advantage of (2.5) is that it makes direct use of the fast Fourier transform, which enables the periodograms to be rapidly computed even when is very large. A third advantage is that mean-correction of is dealt with simply by omission of the frequency .
A notable feature of Whittle estimates of , first established in case of short memory series, is that while they are only asymptotically efficient when is Gaussian, their limit distribution (in case of “standard parameterizations”) is unchanged by many departures from Gaussianity. Thus the same, relatively convenient, statistical methods (hypothesis testing, interval estimation) can be used without worrying too much about the question of Gaussianity. Hannan established asymptotic statistical properties for several Whittle forms in case has a linear representation in homoscedastic stationary martingale differences having finite variance.
It is worth noting that Hannan (1973) established first consistency under only ergodicity of , so that long memory was actually included here. However, for his asymptotic normality result, with -convergence, which is crucial for developing statistical inference, his conditions excluded long memory, and clearly (2.5) appears easier to handle technically in the presence of a smooth than of one with a singularity. Robinson (1978b) developed extensions to cover “nonstandard parameterizations,” his treatment hinting at how a modest degree of long memory might be covered. He reduced the problem to a central limit theorem for finitely many sample autocovariances, whose asymptotic normality had been shown by Hannan (1976) to rest crucially on square integrability of the spectral density; note that (2.1) and (2.4) are square integrable only for . In fact for some forms of Whittle estimate, Yajima (1985) established the central limit theorem, again with -rate, in case of model (2.1) with .
Fox and Taqqu (1986) provided the major breakthrough in justifying Whittle estimation in long memory models. Their objective function was not (2.5) but the continuous frequency form(2.6)
but their basic insight applies to (2.5) also. Because the periodogram is an asymptotically unbiased estimate of the spectral density only at continuity points it can be expected to “blow up” as . However, because also “blows up” as and appears in the denominator, some “compensation” can be expected. Actually, limiting distributional behavior depends on the “score” (the derivative in of (2.6) or (2.5)) at being asymptotically normal; Fox and Taqqu (1987) gave general conditions for such quadratic forms to be asymptotically normal, which then apply to Whittle estimates with long memory such that .
Gaussianity of was assumed by Fox and Taqqu (1986), and by Dahlhaus (1989), who considered the actual Gaussian MLE and discrete-frequency Whittle estimate, and established asymptotic efficiency. For (2.6) Giraitis and Surgailis (1990) relaxed Gaussianity to a linear process in independent and identically distributed (iid) innovations, thus providing a partial extension of Hannan’s (1973) work to long memory. The bulk of this asymptotic theory has not directly concerned the discrete frequency form (2.5), and has focused mainly on the continuous frequency form (2.6), though the former benefits from the neat form of the spectral density in case of the popular class (2.4); on evaluating the integral in (2.6), we have a quadratic form involving the Fourier coefficients of , which are generally rather complicated for long memory models. Also, in (2.6) and the Gaussian MLE, correction for an unknown mean must be explicitly carried out, not dealt with merely by dropping zero frequency.
Other estimates have been considered. While Whittle estimation of the models (2.2) and (2.3) requires numerical optimization, Kashyap and Eom (1988) proposed a closed-form estimate of in (2.2) by a log periodogram regression (across , ). This idea does not extend nicely to models (2.3) with or , but it does to(2.7)
(see Robinson, 1994a), which combines (2.2) with Bloomfeld’s (1972) short memory exponential model; Moulines and Soulier (1999) provided asymptotic theory for log peridogram regression estimation of (2.7). They assumed Gaussianity, which for technical reasons is harder to avoid when a nonlinear function of the periodogram, such as the log, is involved, than in Whittle estimation, despite this being originally motivated by Gaussianity. Whittle estimation is also feasible with (2.7); indeed Robinson (1994a) noted that it can be reparameterized as
taking , , , from which it can be deduced that the limiting covariance matrix of Whittle estimates is desirably diagonal.
In econometrics generalized method of moments (GMM) has beeen proposed for estimating many models, including long memory models. But GMM objective functions seem in general to be less computationally attractive than (2.5), require stronger regularity conditions in asymptotic theory, and do not deal so nicely with an unknown mean. Also, unless a suitable weighting is employed they will be less efficient than Whittle estimates in the Gaussian case, have a relatively cumbersome limiting covariance matrix, and are not even asymptotically normal under . But note that -consistency and asymptotic normality of Whittle estimates cannot even be taken for granted, having been shown not to hold over some or all of the range for certain nonlinear functions of a underlying Gaussian long memory process (see, e.g., Giraitis & Taqqu, 1999).
Even assuming Gaussianity of , nonstandard limit distributional behavior for Whittle estimates can arise in certain models. As observed, a spectral pole (or zero) could arise at a non-zero frequency, to explain a form of cyclic behavior. Gray, Zhang, and Woodward (1989) proposed the “Gegenbauer” model(2.8)
for . To compare with (2.1), diverges at frequency if . When is known, the previous discussion of estimation and asymptotic theory applies. If is unknown, then Whittle procedures can be adapted, but it seems that such estimates of (but not of the other parameters) will be -consistent with a nonstandard limit distribution. Giraitis, Hidalgo, and Robinson (2001) established -consistency for an estimate of that, after being suitably standardized, cannot converge in distribution.
“Semiparametric” models for long memory retain the differencing parameter but treat the short memory component nonparametrically. Correct specification of and is very important in parametric fractional autoregressive integrated moving average models. In particular, under-specification of or leads to inconsistent estimation of autoregressive (AR) and moving average (MA) coefficients, but also of , as does over-specification of both, due to a loss of identifiability. Procedures of order-determination developed for short memory models, such as Akaike’s information criterion (AIC), have been adapted to FARIMA models, but there is no guarantee that the underlying model belongs to the finite-parameter class proposed. That an attempt to seriously model short-run features can lead to inconsistent estimation of long-run properties seems very unfortunate, especially if the latter happen to be the aspect of most interest.
Short-run modelling is seen from (1.1) and (1.2) to be almost irrelevant at very low frequencies and very long lags, where dominates. This suggests that estimates of can be based on information arising from only one or the other of these domains, and that such estimates should have validity across a wide range of short memory behavior. Because this robustness requires estimates to essentially be based on only a vanishingly small fraction of the data as sample size increases, one expects slower rates of convergence than for estimates based on a correct finite-parameter model. But in very long series, such as arise in finance, the degrees of freedom available may be sufficient to provide adequate precision. These estimates are usually referred to as “semiparametric,” though their slow convergence rates make them more akin to “nonparametric” estimates in other areas of statistics; indeed, some are closely related to the smoothed nonparametric spectrum estimates familiar from short memory time series analysis.
It is worth stressing that not just point estimation of is of interest, but also interval estimation and hypothesis testing. Probably the test of most interest to practitioners is a test of long memory, or rather, a test of short memory against long memory alternatives , or anti-persistent alternatives , or both, . For this we need a statistic with a distribution that can be satisfactorily approximated, and computed, under , and that has good power. In a parametric setting, tests of —perhaps of Wald, Lagrange multiplier or likelihood-ratio type—can be based on Whittle functions such as (2.5) and the family. Actually, much of the limit distribution theory for Whittle estimation primarily concerned with stationary long memory, , does not cover , or , but other earlier short memory theory, such as Hannan’s (1973), can provide null limit theory for testing . Because the test statistic is based on assumed and , the null limit distribution developed on this basis is generally invalid if and are misspecified, as discussed earlier; this can lead, for example, to mistaking unnaccounted-for short memory behavior for long memory, and rejecting the null too often. The invalidity of tests for for the R/S statistic introduced previously in the presence of unanticipated short memory autocorrelation was observed by Lo (1991), who proposed a corrected statistic (using smoothed nonparametric spectral estimation at frequency zero) and developed its limit distribution under in the presence of a wide range of short memory dependence (described by mixing conditions), and tested stock returns for long memory.
The null limit theory of Lo’s (1991) modified R/S statistic is nonstandard. Any number of possible statistics has sensitivity to long memory. Of these, some have the character of “method-of-moments” estimates, minimizing a “distance” between population and sample properties. Robinson (1994b) proposed an “averaged periodogram” estimate of , employing what would be a consistent estimate of under , establishing consistency under finiteness of only second moments and allowing for the presence of an unknown slowly varying factor in , so that (1.2) is relaxed to(3.1)
In this setting, Delgado and Robinson (1996) proposed data-dependent choices of the bandwidth number (analogous to the one discussed later in relation to log periodogram estimation, for example) that is required in the estimation, and Lobato and Robinson (1996) established limit distribution theory, which is complicated: the estimate is asymptotically normal for , but non-normal for . Various other semiparametric estimates of share this latter property, which is due to not being square-integrable for .
The traditional statistical practice of regression turns out to be fruitful. The asymptotic law (1.1) suggests two approaches, nonlinearly regressing sample autocovariances on , and ordinary linear regression (OLS) of logged sample autocovariances on and an intercept, as proposed by Robinson (1994a). But the limit distributional properties of these estimates are as complicated as those for the averaged periodogram estimate, intuitively because OLS is a very ad hoc procedure in this setting, the implied “disturbances” in the “regression model” being far from uncorrelated or homoscedastic.
We can expect OLS to yield nice results only if the disturbances are suitably “whitened.” In case at least of short memory series the (Toeplitz) covariance matrix of is approximately diagonalized by a unitary transformation, such that normalized periodograms (cf. (2.4)), sufficiently resemble a zero-mean, uncorrelated, homoscedastic sequence. In case of long memory series, (1.2) suggests consideration of
for a positive constant and close to zero, as pursued by Geweke and Porter-Hudak (1983), though they instead employed a narrow band version of the “fractional differencing” model (2.1), specifically replacing by . They carried out OLS regression over , where , a bandwidth or smoothing number, is much less than but is regarded as increasing slowly with in asymptotic theory. (Geweke and Porter-Hudak’s  approach was anticipated by a remark of Granger and Joyeux ). Geweke and Porter-Hudak argued, in effect, that as their estimate satisfies(3.2)
giving rise to extremely simple inferential procedures. But the heuristics underlying their argument are defective, and they, and some subsequent authors, did not come close to providing a rigorous proof of (3.2). One problem with their heuristics is that for long memory (and anti-persistent) series the are not actually asymptotically uncorrelated or homoscedastic for fixed with , as shown by Künsch (1986), and elaborated upon by Hurvich and Beltrao (1993) and Robinson (1995a). Robinson (1995a) showed that this in itself invalidates Geweke and Porter-Hudak’s (1983) argument. Even for increasing with , the approximation of the by an uncorrelated, homoscedastic sequence is not very good, and this, and the nonlinearly involved periodogram, makes a proof of (3.2) non-trivial.
In Robinson (1995a), (3.2) was established, explicitly in case of the approximation (3.1) rather than Geweke and Porter-Hudak’s version, though indicating that the same result holds there. His result applies to the range , providing simple interval estimates as well as a simple test of short memory, . Robinson (1995a) assumed Gaussianity, but Velasco (2000) gave an extension to linear processes , both authors employing Künsch’s (1986) suggestion of trimming out the lowest to avoid the anomalous behavior of periodograms there, but Hurvich, Deo, and Brodsky (1998) showed that this was unnecessary for (3.2) to hold, under suitable conditions. These authors also addressed the issue of choice of the bandwidth, , providing optimal asymptotic minimum mean-squared error theory. If is twice differentiable at , the optimal bandwidth is of order , but the multiplying constant depends on unknown population quantities. A consistent estimate of this constant was proposed by Hurvich and Deo (1999), and hence a feasible, data-dependent choice of . Hurvich and Beltrao (1994) had related mean squared error to integrated mean squared error in spectral density estimation, and thence proposed cross-validation procedures for choosing both and the trimming constant. The “log-periodogram estimates” just discussed have been greatly used empirically, deservedly so in view of their nice asymptotic properties and strong intuitive appeal. But in view of the limited information it employs there is a concern about precision, and it is worth asking at least whether the information can be used more efficiently. In fact Robinson (1995a) showed that indeed the asymptotic variance in (3.2) can be reduced by “pooling” adjacent periodograms, prior to logging.
A proposal of Künsch (1987), however, leads to an alternative frequency-domain estimate that does even better. He suggested a narrow-band discrete-frequency Whittle estimate (cf. (2.5)). This essentially involves Whittle estimation of the “model” over frequencies , , where plays a similar role as in log periodogram estimation. After that, can be eliminated by a side calculation (much as the innovation variance is eliminated in getting (2.5)), and is estimated by , which minimizes(3.3)
There is no closed-form solution to (3.3), but it is easy to handle numerically. Robinson (1995b) established that(3.4)
For the same sequence, is then more efficient than the log periodogram estimate (cf. (3.2)), while the pooled log periodogram estimate of Robinson (1995a) has asymptotic variance that converges to from above as the degree of pooling increases. While is only implicitly defined, it is nevertheless easy to locate, and the linear involvement of the periodogram in (3.3) makes it possible to establish (3.4) under simpler and milder conditions than needed for (3.2), Robinson employing a linear process for in martingale difference innovations. This, and the coverage of all , may have implications also for further development of the asymptotic theory of parametric Whittle estimates discussed earlier. An additional feature of the asymptotic theory of Robinson (1995a), and that of Robinson (1995b), is the purely local nature of the assumptions on and the way in which the theory fits in with earlier work on smoothed nonparametric spectral estimation for short memory series; (1.2) is refined to
where is analogous to the local smoothness parameter involved in the spectral estimation work, and no smoothness, or even boundedness, is imposed on away from zero frequency. Note that the parameter also enters into rules for optimal choice of ; see Henry and Robinson (1996). Lobato and Robinson (1998) provided a Lagrange multiplier test of the short memory hypothesis based on (3.3) that avoids estimation of .
Various refinements to the semiparametric estimates and , and their asymptotic theory, have been developed. Hurvich and Beltrao (1994) and Hurvich and Deo (1999) have proposed bias-reduced estimates, while Andrews and Guggenberger (2003) and Robinson and Henry (2003) have developed estimates that can further reduce the bias, and have smaller asymptotic minimum mean squared error, using, respectively, an extended regression and higher-order kernels, Robinson and Henry (2003) at the same time introducing a unified -estimate class that includes and as special cases. Giraitis and Robinson’s (2003) development of an Edgeworth expansion for a modified version of also leads to bias reduction, and a rule for bandwidth choice. An alternative refinement of was developed by Andrews and Sun (2004). Additionally, Moulines and Soulier (1999, 2000) and Hurvich and Brodsky (2001) considered a broadband version of originally proposed by Janacek (1982), effectively extending the regression in (3.1) over all Fourier frequencies after including cosinusoidal terms, corresponding to the model (2.7) with , now a bandwidth number, increasing slowly with . These authors showed that if is analytic over all frequencies, an asymptotic mean squared error of order can thereby be obtained, which is not achievable by the refinements to and discussed, though the latter require only local-to-zero assumptions on .
For financial time series, “long memory” has been found not so much in raw time series as in nonlinear instantaneous functions such as their squares, . Thus, whereas we have so far presented long memory as purely a second-order property of a time series, referring to autocovariances or spectral structure, these do not completely describe non-Gaussian processes, where “memory” might usefully take on a rather different meaning. Passing a process through a nonlinear filter can change asymptotic autocovariance structure, and as Rosenblatt (1961) showed, if is a stationary long memory Gaussian process satisfying (1.1), then has autocovariance decaying like , so has “long memory” only when , and even here, because , has “less memory” than .
Financial time series frequently suggest a reverse kind of behavior. In particular, asset returns, or logged asset returns, may exhibit little autocorrelation, as is consistent with the efficient markets hypothesis, whereas their squares are noticeably correlated. Whereas our previous focus on second order moments led to linear time series models, we must now consider nonlinear ones. There is any number of possibilities, but Engle (1982) proposed to model this phenomenon by the autoregressive conditionally heteroscedastic model of order (), such that
with , , , and is a sequence of independent and identically distributed (iid) random variables (possibly Gaussian). Under suitable conditions on the , it follows that the are martingale differences (and thus uncorrelated), whereas the have an representation, in terms of martingale difference (but not conditionally homoscedastic) innovations. The model was extended by Bollerslev (1986) to the generalized autoregressive conditionally heteroscedastic model of index , (), which implies that the have an representation in a similar sense.
The ARCH and GARCH models have found considerable use in finance. But they imply that the autocorrelations of the squares either eventually cut off completely or decay exponentially, whereas empirical evidence of slower decay perhaps consistent with long memory has accumulated; see, for example, Whistler (1990) and Ding, Granger, and Engle (1993). Robinson (1991) had already suggested ARCH-type models capable of explaining greater autocorrelation in squares, so that (4.1) is extended to(4.2)
or replaced by(4.3)
In case of both models, and related situations, Robinson (1991) developed Lagrange multiplier or score tests of “no-ARCH” (which is consistent with , ) against general parameterizations in (4.2) and (4.3); such tests should be better at detecting autocorrelation in that falls off more slowly than ones based on the , (4.2), say.
We can formally rewrite (4.2) as(4.4)
where the are martingale differences. Robinson (1991) suggested the possibility of using for in (4.4) the AR weights from the model (see (2.1)), taking , and Whistler (1990) applied this version of his test to test in exchange rate series. This case was further considered by Ding and Granger (1996), along with other possibilities, but sufficient conditions of Giraitis, Kokoszka, and Leipus (2000) for existence of a covariance stationary solution of (4.4) rule out long memory, though they do permit strong autocorrelation in that very closely approaches it, and Giraitis and Robinson (2001) have established asymptotic properties of Whittle estimates based on squares for this model. For AR weights in (4.2), is not covariance stationary when , , and Baillie, Bollerslev, and Mikkelsen (1996) called this FIGARCH, a model that has since been widely applied in finance.
For model (4.3), Giraitis, Robinson, and Surgailis (2000) have shown that if the weights decay like , , then any integral power , such as the square, has long memory autocorrelation, satisfying (1.1) irrespective of . This model also has the advantage over (4.2) of avoiding the non-negativity constraints on the , and an ability to explain leverage.
An alternative approach to modeling autocorrelation in squares, and other nonlinear functions, alongside possible lack of autocorrelation in , expresses directly in terms of past , rather than past , leading to a nonlinear MA form. Nelson (1991) proposed the exponential GARCH (EGARCH) model, where we take
being a user-chosen nonlinear function; for example, Nelson stressed , which is useful in describing a leverage effect. Nelson (1991) noted the potential for choosing the to imply long memory in , but stressed short memory, ARMA, weights . On the other hand, Robinson and Zaffaroni (1997) proposed nonlinear MA models, such as(4.5)
where the are an iid sequence. They showed the ability to choose the such that has long memory autocorrelation, and proposed use of Whittle estimation based on the .
Another model, closely related to (4.5), proposed by Robinson and Zaffaroni (1998), replaces the first factor by , where the are iid and independent of the , and again long memory potential was shown. This model is a special case of(4.6) (4.7)
the being MA weights in the . They considered Whittle estimation based on squares, discussing its consistency, and applying the model to stock price data.
Asymptotic theory for ML estimates of models such as (4.5), (4.6), and (4.7) is considerably more difficult to derive; indeed, it is hard to write down the likelihood, given, say, Gaussian assumptions on and . In order to ease mathematical tractability in view of the nonlinearity in (4.7), Gaussianity of was stressed by Breidt, Crato, and de Lima (1998). In that case, we can write the exponent of in (4.7) as , where z is a stationary Gaussian, possibly long memory, process, and likewise the second factor in (4.5). Such models are all covered by modeling as a general nonlinear function of a vector unobservable Gaussian process . Starting from an asymptotic expansion for the covariance of functions of multivariate normal vectors, Robinson (2001) indicated how long memory in nonlinear functions of depends on the long memory in and the nature of the nonlinearity involved, with application also to cyclic behavior, cross-sectional and temporal aggregation, and multivariate models. Allowance for quite general nonlinearity means that relatively little generality is lost by the Gaussianity assumption on , while the scope for studying autocorrelation structure of functions such as can avoid the assumption of a finite fourth moment in , which has been controversial.
Semiparametric models and methods for long memory in volatility have also been considered. In particular, Hurvich, Moulines, and Soulier (2005) investigated properties of the narrow-band discrete-frequency Whittle estimate based on the series.
In time series econometrics, unit root models have been a major focus since the late 1980s. Previously to this, modeling of economic time series typically involved a combination of short memory, , series and ones that are nonstochastic, either in the sense of sequences such as dummy variables or polynomial time trends, or of conditioning on predetermined economic variables. On the other hand, unit root modeling starts from the random walk model, that is, (2.2) for with white noise and , and then generalizes to be a more general process, modeled either parametrically or nonparametrically; is then said to be an process. Such models, often with the involvement also of nonstationary time trends, have been successfully used in macroeconometrics, frequently in connection with cointegration analysis.
One essential preliminary step is the testing of the unit root hypothesis. Numerous such tests have been proposed, often directed against alternatives, and using classical Wald, Lagrange multiple, and likelihood-ratio procedures, see, for example, Dickey and Fuller (1979). In classical situations, these lead to a null limit distribution, a non-central local limit distribution, Pitman efficiency, and a considerable degree of scope for robustness to the precise implementation of the test statistics, for example to the estimate of the asymptotic variance matrix that is employed. The unit root tests against alternatives lose such properties; for example, the null limit distribution is nonstandard. This nonstandard behavior arises essentially because the unit root is nested unsmoothly in an AR system: in the case, the process is stationary with exponentially decaying autocovariance structure when the AR coefficient lies between -1 and 1, has unit root nonstationarity at , and is “explosive” for . The tests directed against AR alternatives seem not to have very good powers against fractional alternatives, as Monte Carlo investigation of Diebold and Rudebusch (1991) suggests.
Any number of models can potentially nest a unit root, and the fractional class turns out to have the “smooth” properties that lead classically to the standard, optimal asymptotic behavior referred to earlier. Robinson (1994c) considered the model(5.1)
where is an process with parametric autocorrelation and(5.2)
where the are given distinct real numbers in , and the , , are arbitrary real numbers. The initial condition in (5.1) avoids an unbounded variance, the main interest being in nonstationary . Robinson (1994c) proposed tests for specified values of the against, fractional, alternatives in the class (5.2). For example, in the simplest case the unit root hypothesis can be tested, but against fractional alternatives for , or . Some other null may be of interest, for example, , this being the boundary between stationarity and nonstationarity in the fractional domain. The region has been referred to as mean-reverting, MA coefficients of decaying, albeit more slowly than under stationary, . Note that the models (5.1) and (5.2) also cover seasonal and cyclical components (cf. the Gegenbauer model (2.8)) as well as stationary and overdifferenced ones. Robinson (1994c) showed that his Lagrange multiplier tests enjoy the classical large-sample properties of such tests.
To intuitively explain this outcome, note that unlike in unit root tests against AR alternatives, the test statistics are based on the null differenced , which are under the null hypothesis. This would suggest that estimates of memory parameters in (5.1) and (5.2) and of parameters describing , such as Whittle estimates, will also continue to possess the kind of standard asymptotic properties—-consistency and asymptotic normality—under nonstationarity as we have encountered in stationary circumstances. Beran (1995), in case and white noise, indicated this, though the initial consistency proof he provides, an essential preliminary to asymptotic distribution theory for his implicitly defined estimate, appears to assume that the estimate lies in a n