Performance Measures for Evaluating the Accuracy of Time Series Hybrid Model Using High Frequency Data

-244-Abstract: Given that the traditional ARIMAX model has rarely been applied to any of the climate change and environmental agents, which are the most cognate agents with associated exogenous variables; to neutralize the model for a better and enhanced prediction of the system, a distributional form of the error term that is robust and sufficient in capturing and accommodating both the external covariate(s) and high frequency data is required. This study therefore evaluates the forecasting accuracy of two forecasting models namely ARIMAX and log-ARIMAX. The monthly adjusted high frequency data recorded by four Oil and Gas companies from 2005 – 2020 were used. The forecastability of the two models was evaluated with different error matrices. The effect of Akaike Information Criterion (AIC) and the linear correlation on candidate models among the considered oil spill data tested were discussed. Results for ARIMAX and LOG-ARIMAX Models selection with respect to AIC show that log-ARIMAX is more efficient and performed better than the traditional ARIMAX model for observations characterized by kurtosis, skewness, outliers, high frequency and large fluctuation series with heavy tailed traits as seen in environmental data.


I. Introduction
Non-linear time series models are limited because they are neither robust nor sufficient in capturing high-frequency observational time series, such as those including exogenous factor(s). To improve forecast accuracy, an autoregressive integrated moving average model, ARIMAX (p, d, q, b), for short-memory observational time series data with exogenous covariate(s) to capture times series data like interest rate, number of stocks sold, changes in monthly commodity prices, etc.) was developed. However, for long-memory frequency observations, like the climate data and ecological data; a modification will be necessary to neutralize the model for a better and improved prediction of the system.

Long-memory
frequencies' observations, like the climate-measured daily/weekly/average monthly temperatures recorded, changes in daily recorded climate; currencies exchange rates; Consumer Price Index (CPI) and GDP, a modification will be necessarily needed to neutralize or log-linearize the ARIMAX (p,d,q,b) model for the betterment and improvement of the (constant rate of slope change), constant trend (constant mean), transfer function, residual process and prediction of the system. Having ascertained the power of logarithm in the process of differencing or transformation to help stabilize, eliminate (or reducing) trends, mean of the time series and seasonality signal if any when characterized with high or long memory series. So, a log function would be added to the ARIMAX(p,d,q,b) to make it Log-ARIMAX (p,d,q,b)to neutralized the threat posed by long-memory traits that might likely affect not only the parameterization (over-parameterization or underparameterization) and the end product of given a reliable system forecasts. Forecast generally usually emanated from the generalization of the residual processes attached to the model. However, Log-ARIMAX (p,d,q,b) would not be an exception but additional components of in-sample and out-sample forecast would be incorporated and tested via forecast indexes via AR, MA and the exogenous residual processes. The Log-ARIMAX abstraction of reality would not only make room for merging linear/non-linear regression with ARMA model for better broadening of the applicability of non-linear time series models but also going to serve as a platform of introducing Generalized Non-linear/linear time series for transfer function (otherwise called mean function and impulse weight function in Generalized Linear Model (GLM). This technique of solving residual of different residual structures of different distributional forms and different variable types will be employed to treat the white noise of the Log-ARIMAX model. Additionally, appropriate formulations for the autocorrelation structure (Partial Autocorrelation Function (PACF)) of the error term from the regression equation function (transfer function) of the long memory (highly frequency data) would be sufficiently identify and compare to ARMA and ARMAX models. Another meritorious trait of both the Log-ARIMAX and ARIMAX models would be the attachment of degree of fit (contributions of the exogenous variables) as measured by the coefficient of R-squared and its variants to fitted models; and capturing of the dynamics of seasonal variation change patterns over time. The long memory (high frequency) associated to economic, environmental, climate change, wave data (sea and ocean wavy pattern record) etc. will be a typical example of most long memory data due to their fluctuations, higher values, dependency, and switching circular traits.

II. Review of Literature
Shilpa & Sephardi (2019) in their paper modeled ARIMAX model that was incorporated with exogenous variables which is an extension of ARIMA model using STLF on a time series data of Karnataka State Demand pattern. They enhanced ARIMA model by considering hours of days of the week as the independent variables for ARIMAX model. Utility of data electrical loaded demanded from Karnataka Power Transmission Corporation Ltd. (KPTCL) website was used to develop and test the proposed forecasting model of ARIMAX. However, economic index like inflation, which has persistent and appreciable rise in the general level of prices; general price level might be response or predictive variable (Frimpong and Oteng-Abayie, 2010). Another example of long memory and conditionally covariates series exchange rate (which is the value of the domestic currency in terms of foreign currency). Exchange rate changes can affect the relative prices, thereby the competitiveness of domestic and foreign producers. Theoretically, exchange rate will have a negative or positive relationship with economic growth. This is because currency depreciation will foster a country export that will lead to an increase in Gross Domestic Product (GDP) while currency depreciation will also discourage a country import, thus leading to decrease in GDP of that country. It counts out that appreciation of exchange rate exerts positive influence on GDP and real economic growth (Aliyu, 2011). Therefore, exchange rate and GDP might be dependency or predictor for an ARIMAX or ARIMA model depending on their context. Among other congenital examples of realization that best fit the Log-ARIMAX conceptualization are environmental and climate changes of oil spillage and temperature. Gopinath and Kavithamani (2019) constructed and fitted an ARIMAX model to production of sugarcane in India and as well to know the future values of sugarcane production in India from 2015 to 2026. The aim of the research was centered on accurate prediction. The secondary data used was collected from Sugarcane Breeding Institute, Coimbatore in India. The ARIMAX model was introduced and ascertainment of its order, parameter and diagnostic checking was followed via Box and Jenkins method with both ARIMA and ARIMAX models given greater accuracy in comparison.  reported that the inappropriate spraying, untreated incidence and inexact forecast of cocoa black pod disease around the world have led to incurable losses of more than $400 million. They accounted external factors has the contributing this disease affecting cocoa. Factors like relative humidity, rainfall and temperature are the influencing external factors to this cocoa black pod disease.
Iflah and Parul (2020) noted that Auto-Regressive Integrated Moving Average (ARIMA) and Artificial Neural Networks (ANN) are leading linear and non-linear models in Machine learning respectively for time series forecasting. Their survey paper presents a review of recent advances in the area of Machine Learning techniques and artificial intelligence used for forecasting different events. This paper presents an extensive survey of work done in the field of Machine Learning where hybrid models are compared to the basic models for forecasting on the basis of error parameters like Mean Absolute Deviation (MAD), Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Normalized Root Mean Square Error (NRMSE). The results of their work summarize important discuss on the basis of some parameters which explain the efficiency of hybrid models or when the model is used in isolation. They concluded that hybrid model has realized accurate results as compared when the models were used in isolation yet some research papers argue that hybrids cannot always outperform individual models.
Ahmed et al. (2020) proposed a new heavy-tailed exponential distribution that accommodates bathtub, upside-down bathtub, decreasing, decreasing-constant, and increasing hazard rates for actuarial data. Actuarial measures including value at risk, tail value at risk, tail variance, and tail variance premium are derived. A computational study for these actuarial measures was conducted, proving that the proposed distribution has a heavier tail as compared with the alpha power exponential, exponentiated exponential, and exponential distributions.
Literature showed that quite a number of researchers have studied ARIMAX associated with exogenous covariate (s), using different short-memory frequency data, with little or no strength to capture long-memory (high frequency) observations with heavy tailed traits. Having in mind that conventional ARIMAX model has been rarely applied to any of the climate change and environmental agents which are the most cognate agent with associated exogenous variables and are usually characterized by kurtosis, skewness, outliers, long memory (high frequency) and large fluctuation series; a distributional form of the error term that would be robust and sufficient in capturing and accommodating both the external covariate(s) and long memory (high frequency) data would be needed to neutralize the model for a better and improved prediction of the system. Therefore, the motivation to propose and formulate a Log-ARIMAX model whose distributional form would be robust and sufficient in capturing and accommodating both the external covariate (s) and the heavy-tailed properties of longmemory (high frequency) observational time series events becomes necessary.

Autoregressive Moving Average
Autoregressive Moving Average (ARMA) model is considered the mixture of AR and MA models, it is given by: (3) (4) (5)

a. Criteria for Selection of Optimal Models
According to , the following are criteria for selection of optimal models: Final Prediction Error: Akaike's Information Criterion (AIC): Bayesian Information Criterion (BIC): (13)

Figure 2. A Time Plot of the Observed Data for BG (2007-2020)
-250-On the hand Figure 2 below shows the in-sample time plots of the selected models in the second time regime (2007 -2020). It is evident from the graph that all the candidate models performed well as each model followed the time plot of the observed data (BG). However, in the first regime, the candidate models followed the observed data (BG) very as compared to that of the second regime (2007 -2020).). Error metrics for the second time regime is large as compared to the first-time regime. The graph in the second time regime is not as compact as the one in the first-time regime.   Figure 4 below shows the in-sample time plots of the selected models in the second time regime (2007 -2020). It is evident from the graph that all the candidate models performed well as each model followed the time plot of the observed data (BP). However, in the first regime, the candidate models followed the observed data (BP) very as compared to that of the second regime (2007 -2020).
Fundamental measures that will be very instrumental in the forecastability of the candidate methods are the Akaike Information Criterion (AIC) and the linear correlation between the considered Oil spills given the respective history of data. These Oil spills are: BUNGE LIMITED (BG GRP), BP, CAINE ENERGY (CNE) and TULLOW OIL (TLW). The linear correlation between the considered Oil spills given the six-year history of monthly adjusted close price data is given below in Table 2. Subsequent to table 3 is the linear correlation between the considered Oil spills given the four-year history of monthly adjusted close price data is given below in Table 3.  Bunge Limited is a company in the Oil and Gas Industry in Nigeria. With reference to the above correlation matrix in the first-time regime, it has the lowest correlation with BP of -252-(-0.2490) and the highest with TLW which is (+ 0.9114). Likewise, data in the second time regime also have the lowest and highest correlation with BP and TLW as (-0.0678) and (+0.8106) respectively.
Oil spills with the lowest and highest correlation with BG served as the exogenous variables to the ARIMAX model using BG as the univariate variable in the case of the twotime regimes shown in Table 2. Unlike the second time regime, in first time regime, ARIMAX with/without the exogenous variable had the same model but significantly with different Akaike's Information Criterion (AICs). Even though the AICs for the respective models in the two-time regimes are not far from each other, it is evident that the univariate ARIMAX model with an exogenous variable had smaller AIC's, exogenous variables with the highest correlation with the univariate variables (BG) had the lowest AIC followed by the exogenous variable with the lowest correlation. The AIC of the models considered in both regimes are arranged in ascending order as (DBG/DTLW, DBG/DBP, DBG) and (DBG*/DTLW*, DBG*/DBP*, DBG*) respectively. In both regimes, ARIMAX had the largest AIC. The considered risk metric (i.e., MAE, RMSE and MSE) had smaller values for highly correlated exogenous variables with DBG. The linear correlation amongst the variables seems to have a significant impact on both the AIC and risk metrics. Likewise, the AIC is having some level of impact on the error metrics. This is evident in table 5.

Discussion
From the analysis above, Fig 4.1-Fig 4.8 are the time plots of the observed data for the two different time horizons which show an upward pattern of growth in the oil spill data from BG, BP, CNE and TLW. Besides, the graphs depict heavy fluctuations and outliers in the observed oil spill data in the two-time horizons.
Also, from the analysis above, Tables 2 and 3 show the linear correlation between the considered oil spills of the four oil companies in the two time zones of 2005-2020 and 2007-2020 respectively. The results show that the volumes of oil spills from the four oil companies are not significantly correlated. None of the random walk test of all the considered oil spills in the Oil and Gas Industry was significant both with homoskedastic and heteroskedastic errors.  Tables 4.3-Table 4.5.

Summary
The summary of the findings as well as the conclusion are presented in this chapter. Recommendations suggested by the researcher have also been included in this section that provides a frame work of how stakeholders in the financial industry can improve upon insample stock forecasting accuracy.
From the background to the problem, the objectives of the research study and the data used and analysed, the researcher has established the status and how to improve the insample forecasting accuracy of oil spills using the ARIMAX models with/without an exogenous variable.
With reference to the first objective of this study, it is empirically evident that ARIMAX model with an exogenous variable (LOG-ARIMAX) performed creditably well in all cases and scenarios as outlined in chapter four. This emphasizes that, when improving the in -sample forecasting accuracy of oil spills using the Box -Jenkins model, it is in order to incorporate an exogenous variable to further augment the accuracy of the in -sample forecast. In this study, historical adjusted oil spills recorded by four Oil and Gas companies in Nigeria were use as possible exogenous variable or as public information.
On the other hand, linear correlation between the ARIMAX model with exogenous variable did very little to improve the in-sample forecasting accuracy of all the considered scenarios in this study. In most cases, the high and low linear correlation between oil spills of candidate models only gave signal to the corresponding Akaike Information Criterion (AIC) value. High correlation in most cases gave a lower value of the AIC and vice-versa. However, this assertion was not consistent. Evidently, the Diebold and Mariano test of accuracy is dependent AIC of the candidate models. However, in most cases smaller AIC values turn to minimize the considered error metrics (i.e., MAE, RMSE and MSE) and vice versa. This is evident throughout the results. The linear correlation on the other hand had little or no impact on the performing models.
The Box-Jenkins Method with/without an exogenous variable supports the semistrong form of EMH. Thus, the information, set comprising of the past and current oil spills and all publicly available information supports the Efficient Market Hypothesis (EMH) in its semi-strong form. Timmermann and Granger, (2004) in their paper "Efficient market hypothesis and forecasting" argued that traditional time series forecasting methods relying on individual forecasting models or stable combinations of these are not likely to be useful. This in one way or the other confirms our findings that Log-ARIMAX model is an improvement of an ARIMAX model in most cases.

V. Conclusion
This study proposes a hybrid ARIMAX model to capture and accommodate both the external covariate(s) and the heavy-tailed properties of observational time series events using secondary datasets of the long memory types of oil spillage. The results of the analysis show that the hybridization of Logarithm and ARIMAX (LOG-ARIMAX) as propounded in this work is more robust, efficient, sufficient and reliable in forecasting long-memory data characterized by heavy tailed traits.