Compatible deep neural network framework with financial time series data, including data preprocessor, neural network model and trading strategy
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Compatible deep neural network framework with financial time series data, including data preprocessor, neural network model and trading strategy Mohammadmahdi Ghahramani1Y , Hamid Esmaeili Najafabadi2‡ 1 Mathematics Department/University of Padova, Padua, Italy 2 Electrical and Computer Engineering Department/University of Calgary, Calgary, arXiv:2205.08382v1 [cs.LG] 11 May 2022 Canada Ymuhammadmahdi0007@gmail.com ‡hamid.esmaeilinajafa@ucalgary.ca Abstract Experience has shown that trading in stock and cryptocurrency markets has the potential to be highly profitable. In this light, considerable effort has been recently devoted to investigate how to apply machine learning and deep learning to interpret and predict market behavior. This research introduces a new deep neural network architecture and a novel idea of how to prepare financial data before feeding them to the model. In the data preparation part, the first step is to generate many features using technical indicators and then apply the XGBoost model for feature engineering. Splitting data into three categories and using separate autoencoders, we extract high-level mixed features at the second step. This data preprocessing is introduced to predict price movements. Regarding modeling, different convolutional layers, an long short-term memory unit, and several fully-connected layers have been designed to perform binary classification. This research also introduces a trading strategy to exploit the trained model outputs. Three different datasets are used to evaluate this method, where results indicate that this framework can provide us with profitable and robust predictions. Introduction Acting on cryptocurrency and stock markets has recently gained a great deal of popularity. People’s curiosity about financial keywords such as Bitcoin (BTC), Ethereum (ETH), Tether (USDT), cryptocurrency, technical analysis, etc., has increased significantly. There has been a 200 percent increase in this interest for some Asian countries during the last year as per google trends data. Apart from people’s willingness, researchers have also been seeking to develop efficient methods to deal with the complexity of financial markets. Focusing more on modeling, a significant number of studies were conducted on how to predict the price movement direction using machine learning algorithms and deep neural networks (DNNs). Thanks to advances in neural networks, the dream of forecasting financial time series is now much closer to reality. Generally speaking, when it comes to financial data, there are two main problems we should address. The first one is to introduce robust features through which the market can be correctly represented [1] and the second one is market complexity and volatility [2]. Because of market elements’ nonlinear relationships [3–5], it is not May 18, 2022 1/26
surprising deep neural networks are used a lot to predict market fluctuations. They vary from using fully-connected neural networks [6] to taking advantage of convolutional neural networks (CNNs) [7–10], recurrent neural networks (RNNs) [2, 11] and long short-term memory (LSTM) [12–14]. However, the critical point would be that instead of spending too much time designing complex neural network architectures, we should equally care about modeling and extracting valuable features. This study is divided into two parts: data preparation and modeling. This research tries to provide intuition for each step to make it closer to what is happening in the stock or cryptocurrency markets. Feature generation (FG), data denoising (DD), feature engineering (FE), data splitting (DS), create representative dataset (CD), and using auto-encoders (AE) are considered as six crucial steps for the first part. Modeling part also involves convolutional layers, LSTM units, and fully-connected layers for binary classification. Since the concern of robustness in financial tasks has always been an issue, the performance of the proposed method is evaluated on three different datasets. They include four-hour data of BTC and ETH coins from the cryptocurrency market and daily data for the S&P500 index. For a trader, it is imperative to use technical analysis according to the current market state correctly. Traders cannot afford to use only one specific indicator from which they might have continuously gained profit, e.g., exponentially moving average (EMA). Although it may have brought them significant profit in one of the market states, it immediately loses its performance as the market state changes. They also need to utilize other tools, such as price channels or other indicators like relative strength index (RSI). It is indeed the unique quality distinguishing amateur traders from professionals. Taking advantage of this intuition, it is quickly unfolded that feature engineering should be dynamic. Gathering a set of features to represent all or most corresponding states not only for different markets but also for different seasons of the single market is critical. In this light, we first generate numerous potential features from open, high, low, close prices and volume data (OHLCV) to design a dynamic feature extractor. Then, we apply a wavelet transformer to denoise the generated features [5, 15]. Finally, feature engineering is applied to generated features since the importance levels of these features are still unknown [16]. Neural network models are notably susceptible to features quality. If they are fed with low-quality features, their performance is less likely to be as desired. Here, the idea of using auto-encoders is introduced to ensure that only high-quality features are fed to our primary model. They have been used in other studies to extract both high-quality features [5] and data denoising [17]. However, the approach used in this paper is totally different from these studies since we designed customized CNN-based auto-encoders [18] and used two auto-encoders instead of one. An auto-encoder aims to learn a compressed, distributed representation for a dataset or reduce data dimensionality. It can give us the edge to make the proposed model work better. Auto-encoder’s performance depends on feature correlation such that a higher degree of correlation between the input features boosts the auto-encoder’s performance. When it comes to financial markets, features are not necessarily correlated enough. Hence, feeding them all directly into an auto-encoder would not result in extracting high-quality features. To address this challenge, we consider the data splitting step. Before feeding these splitted data to separate auto-encoders, it is worth deliberating on how traders are trading and whether they only look at one single candle or they tend to review the price path when deciding to buy or sell. This intuition tells us a single candle should not be considered a training instance but instead their trend. In this light, a new dataset is created in step CD, where candle sets are considered training instances. The data is now ready to be fed to auto-encoders. The rest of this paper is about designing a neural network-based architecture to May 18, 2022 2/26
binary-classify the prepared data coming from the preparation part. Even though some studies attempted to predict the value of price directly using regression methods and then define a trading strategy to exploit results [5], it seems it would be more efficient and closer to reality to predict the price movement direction instead of the price value. This paper considers a trading strategy part that determines how this model should be applied in our financial activities to exploit our model outputs efficiently. This part has been rarely considered in other studies. It is precisely why they mainly can not be used by organizations and traders to earn profit from financial markets. It first explains several unrealistic assumptions which may make us wrongly evaluate our model and then suggests intuitive solutions to address them. Secondly, it talks about indexes to correctly assess the model because coming up with a highly accurate model does not necessarily result in earning profit from the trading system [12]. In contrast, none of all low-accurate models are useless. Related works Having access to the high performance GPUs and hardware along with advances in deep learning and machine learning, the opportunity to utilize all deep learning methods and techniques has been provided for researchers. Consequently, standard deep learning methods that are typically used in specific areas have been tested on financial time series data. J Vahala et al. [6] review the performance of fully-connected layers to predict financial markets. Focusing on feature extraction, Xiao Ding et al. [19] use a deep convolutional network to improve prediction accuracy. Using the same concept, Jia Wang et al. [7] propose one-dimensional convolutional neural networks with the aim of market movement forecasting. With respect to the nature of financial data, RNNs and LSTM units are also widely used. Using these networks, Razvan Zanc et al. [12] conclude that accuracy is not a good indicator to evaluate the model profitability, and high values of accuracy may not contribute positively to a profitable trading system. In another paper, Avraam Tsantekidis et al. [2] attempt to exploit RNNs on time-series data on limit order books. Justin Sirignano et al. [20], mark that there is path-dependence in price dynamics. Wei Bao et al. [5] utilize stacked auto-encoders in the combination of LSTM units to predict one-step-ahead close price . In another work by Ali Caner Türkmen et al. [17], an array of features are generated, and the performance of several machine learning algorithms are tested. Even though various innovative ideas have been implemented in this field, financial data preprocessing and taking advantage of traders’ intuitions are not adequately considered. This paper covers this topic by considering a dynamic feature engineering method. In related research, the intuition behind selecting a method for modeling was explained. However, none of them has tried to select the model’s parameters according to the common rules in the market and among traders. As the second edge, not only does this research make an effort to construct an intuitive network, it also suggests setting the network’s parameters intuitively. While the importance of existing a trading strategy is not undeniable by traders, defining a trading strategy is another topic that other researchers did not consider. This work also proposes a strategy to turn the model’s output into profit. May 18, 2022 3/26
Materials and methods 1 Data preparation Providing an informative set of data has always played a vital role in machine learning tasks. The focus of this part is that gathering and engineering data for the market is as crucial as designing complex neural networks. One of the trader’s primary tools to analyze the price movement is OHLCV data. Interaction between these features helps market actors to predict upward or downward trends in advance. Based on their importance, we start with a raw dataset including OHLCV data. 1.1 Feature generation Financial markets are renowned for their complexity [21, 22]. This complexity is coming from a volatile supply and demand pressure which determines the price. Consequently, traders encounter market states where price behavior is unpredictable since different technical tools might provide them with opposite signals. This condition gets even more challenging when traders need to struggle with setting hyperparameters for their technical tools. Consider that it is desired to earn five percent worth of profit by buying Bitcoin, but there is an uncertainty about whether this is the right time. One possible approach is to take a look at the BTC/USDT chart and analyze the price. Then, it must be figured out which set of indicators are better to use. Having done that, optimum parameters for the selected indicators should be set. For example, an EMA indicator with the window of 21 may provide a better prediction rather than the window of 35. To take this fact into account, this part generates too many indicators with different parameters whose summary is brought in Table 1. Therefore, the new dataset may include several EMA columns, each with its own window. Consequently, the variety in both indicators and parameters is provided. 1.2 Data denoising People use exchanges to trade in financial markets. Anyone can trade on them, whether they are professionals or novices. As a result, financial markets get noisier than what they already are. All market activities affect demand and supply pressure and, as a result, price. It is important to consider only transactions that contains information. This research suggests taking the data denoising step. However, handling all kinds of noises in financial markets is impossible. This paper recommends using the wavelet transforms to convert the data to a smoother version of that [23]. Wavelet transformers are widely-used to signal denoising, especially in electrical engineering. Since we can suppose our time-series data like a signal, this transformer can be employed with the aim of denoising [5]. Wavelet analysis was first developed in the early 1980s in seismic signal analysis in the form of an integral transform with a localized kernel function with continuous parameters of dilation and translation. They have been applied to various kinds of one-dimensional or multidimensional signals, for example, to identify an event or to denoise the signal. Apart from data denoising, applying wavelet transforms protects the model against over-fitting. 1.3 Feature engineering One of the most essential steps for every machine learning task is feature engineering. Feature engineering is the process of using domain knowledge to extract features from raw data. This part cares about transforming raw data into components that better May 18, 2022 4/26
Table 1. List of some common technical indicators to generate features Indicators Name Equivalent Description ADX Average directional index To determine when the price is trending strongly ATR Average true range Volatility indicator that shows how much an asset moves BOLL Bollinger bands Two standard deviations above and below a simple moving average CCI Commodity channel index Difference between the current and the historical average price ChaikinAD Anticipate directional changes in the accumulation distribution EMA Exponential moving average GARCH Estimates the volatility of returns for stocks MA Moving average MACD Moving average convergence divergence Trend following momentum indicator MFI Money flow index Uses both price and volume to measure buying and selling pressure MOM Momentum indicator Measures a security’s rate of change ParabolicSAR Determines the direction that an asset is moving ROC Rate of change Calculates the percent change in price between periods RSI Relative strength index Evaluates overbought or oversold conditions Stochastic Momentum indicator that uses support and resistance levels VWMA Volume weighted moving average Wilder Measures a trend’s strength and direction Williams Measures overbought and oversold levels Table notes that various indicators and oscillators are considered to cover common approaches indeed when analyzing financial markets. represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. Till now, we generated features and applied wavelet transforms with the aim of denoising. These two basic steps are not adequate to prepare the data. Adding several indicators to a price chart is simply the process of feature generation which even amature traders can do. They may also ignore some upper or lower shadows of candles and imagine the smoother version of the chart. Since the mere use of these steps do not equip them with robust and dependable forecasting, no wonder if the FG and DD steps alone do not give us a significant edge. This part introduces a reliable method to extract valuable features. The paper divides the FE step into two sections. The first one is explained here, and the second one is covered on the auto-encoder part. The market contains different complex states and we must use a dynamic feature engineering method to represent them. To design that, this study takes advantage of another intuition. Financial market participants tend to consult with experts and use their insights before investing in financial markets. In our case, there is a machine learning method whose primary duty is to determine feature importance. This paper selects XGBoost classifier [24] to play an expert role since tree-based machine learning algorithms are not sensitive about the quality of every single feature and can ignore those features with a lower information [25]. This feature turns the XGBoost classifier into an excellent candidate to calculate feature importance. Each time it is fed with May 18, 2022 5/26
different data, feature importance changes. Fig 1 is a snapshot that illustrates how XGBoost’s output looks like. Having access to the importance of features, we can simply select the top 25 features, for example, and reduce the number of columns to those whose importance have been endorsed by the XGBoost model. WILLR_close_50 WILLR_close_14 FASTK MTM_close_14 FASTD SMA_ratio_close_5_15 RSI_ratio_close_5_15 WVAD_50 MA_close_28 WILLR_close_21 SLOWD Feature RSI_close_14 ROC_close_14 Wilder_close_21 ATR_ratio_close_5_15 VWMA_close_14 WILLR_close_28 Relativevolume_21 WVAD_28 ATR_close_14 MA_on_RSI_5_15_9 Parabolic_SAR BOLL_UPPER_20 VWMA_close_28 BOLL_LOWER_20 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Feature importance Fig 1. Feature importance produced by XGBoost classifier. The figure illustrates the top 25 features, ranked according to their importance produced by XGBoost for daily data of the S&P500 market. 1.4 Data splitting Before proceeding with auto-encoders and finishing the data preparation part, this research ensures that the auto-encoders are fed as expected. Auto-encoders expect to receive an adequately correlated set of features, but current XGBoost-passed features are not necessarily correlated enough with each other. To address this challenge, this study suggests splitting the dataset into correlated bunches of features. It categorizes columns into three main groups named OHLCV, price-like, and non-price-like. The OHLCV group, as its name suggests, includes open, high, low, close, volume columns. The rest of the features are also split into two groups. By price-like group, we refer to a group whose features’ unit is the same as price unit. In a nutshell, they can be plotted on a price chart. This group may include different kinds of moving average indicators or any other features which are reported in dollars. By contrast, the last group, a non-price-like group, has features whose units are not the same as price units. This group may include oscillators like RSI and CCI, which are dimensionless. Since we would not like to lose any information from our primary group, the OHLCV data is not passed through an auto-encoder. However, two separate auto-encoders are designed for both price-like and non-price-like groups to extract high-level features. 1.5 Create Representative Dataset This section provides an intuitive method to form our training dataset. Price path is much more important than the current candle information. The fact that technical analysis such as candlestick patterns is all about tracking the price path is substantial evidence on why this study considers this step. May 18, 2022 6/26
This research suggests defining each training instance as a set of previous consecutive candles. For example, if we set training samples to consist of 24 candles, the dimension for every sample would be number of features×1×24, representing the number of channels, height, and width, respectively. Fig 2 illustrates how training instances are being created from the dataset. These samples will soon be used to feed the auto-encoders. Index Feature 1 Feature 2 Feature 3 Feature 4 ... Feature N Height 0 Ch ann 1 els th Wid 2 1 ... w window - 1 ndo N Wi window window + 1 window + 2 ... 2window - 1 ... Fig 2. Define training instances. 1.6 Auto-encoder This part talks about the architecture of auto-encoders whose job is to extract high-level features. Their usage is not limited to feature extraction and can be employed to, for example, predict bankruptcy [26]. As mentioned earlier, this section is the second step associated with feature engineering. The study designs two auto-encoders, one for price-like features and another one for non-price-like features. As their name suggests, auto-encoders are a specific type of neural network where the input is the same as the output. This kind of learning is categorized as an unsupervised technique since they do not need explicit training labels. More precisely, they are self-supervised because they generate their own labels from the training data. Auto-encoders attempt to reconstruct the input data such that there is a minimum difference between the original input data and the reconstructed output. An auto-encoder consists of three components named encoder, latent-space representation, and decoder. The encoder compresses the input and produces the latent-space representation, also called code. Decoder then reconstructs the input only using this code. The applications of auto-encoders mainly include dimensional reduction [27], latent space extraction as the high-level compressed features and anomaly detection [28]. It is essential to feed auto-encoders with correlated data. This paper cares about this correlation and takes some measures to correctly consider this fact. Fig 3 shows the most typical version of auto-encoders, including encoder, code, and decoder. Even though most auto-encoders are designed as in Fig 3, this study constructs the auto-encoders with convolutional layers since the shape of input data requires such May 18, 2022 7/26
Code ... ... ... ... ... ... Encoder Decoder Fig 3. Typical auto-encoder. The figure illustrates three parts of an auto-encoder. The input and output data are the same as each other. The encoder layer converts the input data into the code. Decoder layer also reconstructs the input data using the code. Having auto-encoder trained, the code, which represents the input data, is used instead of the input data for further processes. layers both for the encoder and for the decoder. Instead of hidden neural layers, we use one-dimensional convolutional layers (Conv1d layers) to encode the input data and decode the code. This paper uses the same auto-encoder architecture for price-like and non-price-like features applied separately on their corresponding features. 1.6.1 Encoder Encoder consists of Conv1d layers along with max pool and up sampling layers. The encoded data, red-colored in Fig 4, is the same in height and width as the input data and the encoder only reduces the channel size. Max pool and up sampling layers make auto-encoder learn better. Consequently, this paper incorporates up sampling layers in addition to max pool layers. More importantly, having the same dimension in width and height requires us to use both max pool and up sampling layers to negate each other. Conv 1 MaxPool 1 Conv 2 Upsample 1 Conv 3 MaxPool2 Fig 4. Encoder architecture. This figure considers the code inside the encoder layer just for illustration purposes. Different arguments for Conv1d layers result in different dimensions for data. These layers must ensure that dimension of the code differs only in channel size with input data. This study manipulates the stride and padding arguments of Conv1d layers to keep the data width constant. Eq 1 explains how these arguments relate to each other May 18, 2022 8/26
as following: Lin + 2 ∗ padding − dilation ∗ (kernel size − 1) − 1 Lout = +1 , (1) stride where Lout and Lin are the lengths of the output and input data to Conv1d layers, respectively. The dilation is the spacing between kernel elements and is set to one by default. If stride is set to one and padding size is calculated according to Eq 2, kernel size − 1 padding = , (2) 2 the layer does not change the data width. Fig 5 provides an example of how encoder layers change data dimensions. (#Features, 1, 24) (9, 1, 24) (9, 1, 12) (6, 1, 12) (6, 1, 24) (3, 1, 24) (3, 1, 12) Conv 1 MaxPool 1 Conv 2 Upsample 1 Conv 3 MaxPool 2 Fig 5. Data dimension when passing to the encoder. The figure is the encoder part of an auto-encoder, applied non-price-like features of S&P500 market data. 1.6.2 Decoder As for the decoder part, similar layers are used with this difference that here we reconstruct the code using opposite setups for our Conv1d, max pool, and up sampling layers. Like the encoder part, Conv1d layers of the decoder part do not change any dimension but the channel size. Max pool and up sampling layers also have the same performance as the encoder part. Fig 6 is the decoder architecture used in this study. The only difference is that the last layer is a sequence of up sampling and Conv1d layers. Up sampling layers have no weights to update, and during backpropagation, there is nothing associated with these layers to change. As a result, ending up with these layers, our auto-encoder is not trained. That is why Fig 6 considers the Conv F right after the up sampling layer. Fig 7 provides an example of how decoder layers change data dimensions. Conv 4 Upsample 2 Conv 5 MaxPool 3 Conv 6 Upsample 3 & Conv F Fig 6. Decoder architecture. 1.7 Overview of data preparation part To sum up the first part of this study, data preparation, Fig 8 provides a schematic overview of the sequence of steps being taken place till now. In the first part, to prevent confusion between the auto-encoders learning process and the primary model’s learning process, this research did not explain how to label our training instances. However, in the second part it is reviewed in detail. May 18, 2022 9/26
(3, 1, 12) (6, 1, 12) (6, 1, 24) (9, 1, 24) (9, 1, 12) (#Features, 1, 12) (#Features, 1, 24) Upsample 3 & Conv 4 Upsample 2 Conv 5 MaxPool 3 Conv 6 Conv F Fig 7. Data dimension when passing to the decoder. The figure shows the decoder part of an auto-encoder, applied non-price-like features of S&P500 market data. Applying wavelet Feature generation transform Raw OHLCV data OHLCV data and indicators Feature engineering Data splitting using XGBoost High level features Denoised data Create representative dataset OHLCV features Create eepresentative Auto-encoder one dataset Price-Like features Create representative Auto-encoder two dataset Non-Price-Like features Fig 8. Data preparation part overview. The black-colored data is the prepared data that is ready to be fed to our main model. 2 Modeling The preparation of data in an accurate manner is important, but putting together a reliable market forecast network is equally important. Traders may have access to too many data, but still cannot decide whether to buy or sell. This part ensures that there is a deep neural network in charge of forecasting the price movement direction for a one-step-ahead closing price. Similarly to the data preparation part, this part uses intuitions culled from financial markets to develop the model. Firstly, it explains how to label training instances. Once done, it constructs the body of the deep network in accordance with intuitive reasons. 2.1 Labeling This step follows a straightforward labeling method. To label the first training instance, we compare the closing price of the window-th candle to the closing price of the (window+1)-th candle. If the price went up, the label +1, and in the case, that price May 18, 2022 10/26
went down the label -1 is assigned to this instance. Eq 3, Ln = F(Pn∗window , Pn∗window+1 ), and (3) 0 p −p F(p, p0 ) = Sign( ), p0 is the general formula to label training samples, where Ln ∈ {−1, +1} is the label assigned to the n-th training instance and Pi corresponds to the closing price of the i-th candle. 2.2 Network architecture This section introduces and designs a deep network through which binary classification tasks is performed. The network is divided into three elements. The first element includes three separate lines of convolutional layers (Conv element), each being applied on OHLCV, price-like, and non-price-like features. The next element is a many-to-one LSTM unit (LSTM element), which detects and learns the sequential patterns in data. Finally, two fully-connected layers (FC element) are considered to binary-classify the data instances. 2.2.1 Conv element Before drilling down to the detailed construction of convolutional layers, it is worth reviewing why we should consider this element. We are required to design such an element due to the concept of time frame in financial markets. It is important to become familiar with the tick data on financial markets to better understand time frames. A tick is a measure of the minimum upward or downward movement in the price of a security. Each candle may contain thousands of ticks. Their corresponding quantity and price generally report ticks. Despite we can access the tick data, price charts avoid visualizing them because tick data may exceed a hundred thousand for one day. Instead, they summarize a high number of tick data to one candle. This is done by determining equal intervals and aggregating all ticks within the interval to calculate open, high, low, and close prices. This time interval is the time frame that we referred to previously. For example, when we set the time frame to one-hour, all tick data in this interval is used to form its corresponding candle. Most of the platforms which provide us analyzing tools allow us to change this time frame. It could range from one-minute to one-month. The intuition is that traders do not suffice on just one time frame to open a position. They may analyze BTC/USDT charts on different time frames and try to find a pattern. The most important thing is that after detecting a pattern, they try to give confirmation from low-frequency time frames like daily time frames. They may find a good-looking channel price for BTC/USDT in the time frame of four-hour inside which price is fluctuating between upper and lower lines. Suppose the price breaks out of the upper line, and breaks its resistance, and it seems an excellent time to open a long position in the future market or a buy position in the spot market. However, they do not open any position until getting a strong confirmation that this breakout is not fake. One way to give the confirmation is to change the time frame and observe the price position in the daily time frame. Generally, financial events such as breaking out a channel price or pulling back to the broken resistance or support lines are more valid on low-frequency time frames like daily or weekly time frames. May 18, 2022 11/26
This section considers convolutional layers whose duty is to reduce the width size of every training instance. Recall that the width size is the number of consecutive candles through which we want to analyze the price path. Reducing this number is exactly like using low-frequency time frames. This paper designs three separate layers containing one max pool and two one-dimensional convolutional layers for OHLCV, price-like and non-price-like data. Fig 9 shows the detailed construction of these lines. It also illustrates that we finally stack outputs together for the LSTM element. Fig 10 also shows how these layers and lines are changing the data dimension. Conv 1 MaxPool Conv 2 Fig 9. Conv element - the first element of the model. The green, red and blue colored data are OHLCV, price-like and non-price-like data respectively. (5, 1, 24) (5, 1, 20) (5, 1, 10) (5, 1, 8) OHLCV Conv 1.1 MaxPool Conv 2.1 (11, 1, 8) (3, 1, 24) (3, 1, 20) (3, 1, 10) (3, 1, 8) Price-Like Conv 1.2 MaxPool Conv 2.2 Stacking Together (3, 1, 24) (3, 1, 20) (3, 1, 10) Non-Price-Like Conv 1.3 MaxPool Conv 2.3 (3, 1, 8) Fig 10. Data dimension when passing to the Conv element. The figure shows a Conv element, applied on training instances of S&P500 market data. Reducing the width from 24 to 8 could be considered like changing the time frame from 8h to daily. May 18, 2022 12/26
2.2.2 LSTM element In financial markets, the price path is much more important than the current absolute price. The reason is that events on previous candles may affect price behavior on the next candles. When determining static or dynamic support and resistance lines, traders use previous candles since they believe the information coming from previous candles can be used to analyze the price fluctuations. RNNs [29, 30] are designed precisely to address such issues in which the information from the past is also essential for current prediction. Hidden states are in charge of transforming this information. One of the disadvantages associated with RNNs is gradient vanishing when backpropagating through time [31, 32]. It causes long-term dependencies are not detected. To improve the performance of RNNs to also detect long term dependencies, LSTM units are introduced [33]. However, LSTM is not the only way to address memory problem. Gated recurrent units (GRUs), the simpler version of LSTM units [34], address the same problem either. The idea behind either LSTM or GRU is to consider a memory cell that determines how much information should be memorized and transformed to the next units. LSTM with three gates named forget gate, update gate, and output gate, and GRU with one gate called update gate, handle this memory transformation. Fig 10 indicates that the data sequence is not long enough to cause gradient vanishing. However, this paper prefers to use LSTM units to detect chronological patterns. It undergoes a more computational process but ensures that required information is passed to the next units as much as possible. Compared to RNNs, LSTM takes advantage of a memory cell c to transform information. c is calculated by weighting to the previous memory cell c and a candidate cell c̃ which is related to the current LSTM unit. The update and forget gates handle the weighting task. They specify how much memory from previous units must be transformed and how much information from the current unit must be considered. Having c calculated, the output gate is used to calculate the current hidden state a . Finally, a is used to make a prediction. As LSTM units are consecutive, a and c are transformed to the next LSTM unit, and this process continues until the last unit. The following equations describe the order in which each of the above elements is calculated: c̃ = tanh(Wc [a , x ] + bc ), (4) Γu = σ(Wu [a , x ] + bu ), (5) Γf = σ(Wf [a , x ] + bf ), (6) Γo = σ(Wo [a , x ] + bo ), (7) where W s and bs are weight matrices which are updated through backpropagation step and biases respectively. Γu , Γf and Γo are update gate, forget gate and output gate respectively. σ is also the sigmoid activation function. Using these gates along with candidate cell c̃ , Eq 8 and Eq 9 calculate c and a to be passed through the rest of LSTM units as following: c = Γu ◦ c̃ + Γf ◦ c , (8) a = Γo ◦ tanh(c ). (9) May 18, 2022 13/26
In each unit, the corresponding value of the forget gate determines how much information we are better to forget. If it is close to 1, the previous memory cell contains valuable information from the past and should be transformed to the current memory cell. If it is close to 0, it means that the major part of the current memory cell is calculated by the candidate cell. The Γu , Γf and Γo are not constant for all LSTM units. They are vectors containing the sigmoid output for each LSTM unit. It is worth mentioning that gates, hidden states, memory cells, and candidate cells are all vectors, so Eq 8 and Eq 9 use element-wise products between these vectors. As the last step, if we wish to predict the current LSTM unit, Eq 10, ŷ = g(Wy a + by ), (10) is employed to return the output where g is an activation function. Since LSTM units are not generally used to directly binary-classify the data instances, most of the time, like in this paper, g is a softmax activation function. Fig 11 illustrates a schematic overview of a classic LSTM unit. There are other kinds of LSTM units such as peephole connections which apart from a and x considers c to calculate gates values in Eq 5, Eq 6 and Eq 7 [35]. This study takes advantage of a many-to-one LSTM architecture, provided in Fig 12. Output< t > c< t-1 > + c< t > Tanh a< t > Forget Gate Update Gate Tanh Output Gate a< t-1 > x< t-1 > Fig 11. Single LSTM unit architecture. The output length can vary according to the hidden states’ dimension. This dimension is a hyperparameter to set. This paper designs LSTM units which require the output to be the same length as hidden states’ length. Fig 13 illustrates how this process happens. 2.2.3 FC element As the last element of the model, fully-connected layers binary-classify the LSTM output vector. Despite its name, the layers used in this section are not really fully-connected since dropout layers are used to protect the model from over-fitting [36]. Two fully-connected layers are used to determine whether the next close price goes up or down. Fig 14 illustrates applying these layers to the LSTM output on actual data. 2.3 Model quality check When it comes to financial time series data, the model is occasionally not trained at all. When back-propagating, the loss would decrease, but it is not significant to claim the May 18, 2022 14/26
Output a< 0 > a< 1 > a< 2 > a< L-1 > LSTM LSTM LSTM 1 2 L x< 1 > x< 2 > x< L > Fig 12. LSTM element - the second element of the model. x , a are the data sequence and hidden states respectively. Output a< 0 > (20, ) (20, ) Stacking Together (11, 1, 8) LSTM LSTM LSTM x< 1 > x< 2 > x< 8 > (11, ) (11, ) (11, ) Fig 13. Data dimension when passing to the LSTM element. The figure shows the implementation of the LSTM element on S&P500 market data. Output and hidden states vectors have the same length of 20. model is trained. In other words, the model loss converges too soon. For example, you may consider 1000 epochs to train your model, but the model loss decreases until the 200th epoch, and for the rest of the epochs, there is no significant reduction in model loss. This section will discuss how the training process can be checked. In the financial time series, both model over-fitting and under-fitting are issues. The concern of over-fitting is abated by training the model on the high number of data and using denoisers and dropout layers in the network. This part mainly talks about an index by which we can make sure our model is not under-fitted and can detect patterns. The proposed index to quality check the training process is obtained by reviewing the loss function. Eq 11, m 1 X L(y, ŷ) = − ( yi log ŷi + (1 − yi ) log(1 − ŷi )), (11) m i=1 is a binary cross-entropy loss function widely used in binary classification tasks where yi ∈ {0, 1}, ŷi ∈ (0, 1) and m are the actual label of i-th data instance, predicted label May 18, 2022 15/26
LSTM Output FC 1 FC 2 1 1 1 ReLU ReLU Sigmoid 20 10 10 (20, ) LSTM x< 8 > (11, ) Fig 14. FC element - the third element of the model. The figure shows the implementation of the FC element on S&P500 market data. of i-th data instance and number of data instances respectively. Through the backpropagation, the loss function is supposed to decrease. The equation says, on average, how wrong the model is classifying data instances. This paper firstly suggests grabbing the value of the converged loss function and then calculating its corresponding sigmoid value. It is obtained by using Eq 12, σ ∗ = 10−Lc , (12) where σ ∗ is the corresponding sigmoid value of the converged loss function Lc . The σ ∗ represents what sigmoid value, on average, the model would have if we suppose the whole data as a single data instance with an actual label of 1. According to Eq 13, by putting a threshold on σ ∗ , we can figure out whether the model is trained properly: ( Well-trained : if σ ∗ ≥ ζ Training quality = , (13) Under-fitted : if σ ∗ < ζ where ζ is an optional threshold for the σ ∗ . This paper uses a value of between 0.8 and 0.85 for ζ, Taking advantage of grid-searching. Suppose it is set to 0.8, equivalent to the loss function of 0.07 according to Eq 12. In this case, it means that if we consider all data instances as a single data instance with label 1, the model classifies this single data instance with the sigmoid value of 0.8, which is confident enough to claim the model is trained. On the other hand, if the model loss function converges to 0.5 too soon and does not change anymore, it means that if we consider all data instances as a single data instance with label 1, the model classifies this single data instance with the sigmoid value of 0.31 which is not confident at all and is a sign of model under-fitting. Forcing the model to pass higher values of ζ results in over-fitting. Fig 15 illustrates a plotted loss function for a healthy trained model, which is neither over-fitted nor under-fitted. Studies typically use for-loops in the code when training the network to make the model iterate for a certain number of epochs. However, this research uses a while-loop instead. Whenever the converged loss resulted in a σ ∗ greater than ζ, it breaks the while-loop. Fig 16 illustrates the model training process. May 18, 2022 16/26
0.7 Under-fitted model BCE loss function value 0.6 Well-trained model Over-fitted model 0.5 0.4 0.3 0.2 0.1 0.0 0 200 400 600 800 0.9 0.8 0.7 0.6 * 0.5 0.4 0.3 0.2 0 200 400 600 800 Number of epochs Fig 15. Loss function value role to measure training quality. The figure shows the loss function and σ ∗ values for three categories of models. The green one belongs to a model trained on BTC/USDT data (time frame of four-hour). The converged loss function is near 0.07, which means that if we consider all of the data instances as a single instance with label 1, the model correctly predicts this instance with the Sigmoid value of 0.85. 3 Trading strategy Through data preparation and modeling parts, we classified our data instances to predict the next closing price. However, it does not mean the currently trained model is ready to act on financial markets. When using Artificial intelligence (AI), we should carefully map the model outputs to our needs; otherwise, the outputs are not helpful in our real-world problems. Every unrealistic assumption to virtually test our model’s performance does not help us to detect weaknesses. While we expect a sixty-percent accurate model to be profitable, it may make us lose the margin. This part aims to introduce a trading strategy whose essential duty is to ignore all unrealistic assumptions by selecting informative indicators to evaluate the model. 3.1 Unrealistic assumptions One of the drawbacks of using this model without further measurements is that it acts on every candle. It does not make sense since no trader acts this way. Even having a one hundred percent accurate model, acting on every candle costs a fortune for us May 18, 2022 17/26
Start Initialize network weights randomly Store loss function Calculate loss Storage values function Calculate Select X: the latest N Elements Calculate Update network Back propagate weights Is loss converged? End Fig 16. Model training flow chart. This paper calculates the σ ∗ for each epoch. If it was equal or greater than ζ, the latest N loss function values(X) are retrieved from the storage. Then the function f decides whether the loss has converged. If so, it breaks the loop and returns the trained model. Function f considers the first element of X as the primary point and then calculates the slope of N − 1 lines, starting from the primary point and ending at other elements of X. If the average of these slopes is less than a threshold, it means that the loss is converged. The suitable threshold is a negative value close to zero. because exchanges require the transaction fee for each position we wish to open. It is the first unrealistic assumption that must be amended. The second unrealistic assumption is that to evaluate the model performance, we use some indicators such as accuracy, which considers equal weights for positions. In other words, it does not matter how big the positions’ profit/loss is. For instance, positions with ten percent profit are treated the same as positions with one percent profit. Furthermore, accuracy does not care about the positions’ margin. No matter how much money we specify to one position, it just checks whether the price direction was truly predicted or not. It is not close to mind since traders do not necessarily divide their margin equally for their positions. It is also important for them how much profit they May 18, 2022 18/26
earn from each position. They may open eight wrong positions but still be satisfied since their other two correct positions give them much more profit than the amount of loss coming from their wrong positions. That is why accuracy is not a solid and creditable index to evaluate the model performance. 3.1.1 Acting on every single candle To deal with the first unrealistic assumption, acting on every single candle, this research recommends setting a threshold for the output coming from the last sigmoid function prior to binary-classifying data. The sigmoid production is between 0 and 1. A common approach to classify the data is to consider label 0 for those instances with sigmoid output lower than 0.5 and label 1 for instances with sigmoid output equal or greater than 0.5. By changing this threshold, we can sift predictions with higher confidence. Traders may detect good entry points to open a position, but they do not risk opening all of them. They tend to measure an index named Risk/Reward firstly. The less this index [37], the more attractive the position appears for traders. The inverted version of this index says how big the estimated reward (profit) is compared to the potential loss we undergo if we open a position. It is generally calculated by Stop loss (SL) and Take profit (TP) lines which traders set before opening a position. The SL is a price on which traders no longer wish to undergo loss. Similarly, if the price moves according to their expectation, TP is defined as an earlier price on which traders wish to save their profit. The Risk/Reward index is nothing but the absolute difference between entry point and SL price divided by the absolute difference between TP and entry point price as indicated on Eq 14. Fig 17 shows how this index is calculated according to SL and TP prices. TP Reward Entry Point Risk SL Fig 17. Stop loss and take profit prices. Setting SL and TP prices is a form of art and traders use different strategies to do that. This figure uses the highest and lowest prices as TP and SL since these prices can play a support and resistance lines for this snapshot. |E − SL| Risk/Reward = , (14) |T P − E| where E is an entry point price. Traders want this index as little as possible. However, each of them has the desired value for this index at which they enter a position. The most common value for this index is 13 among traders. The threshold for the sigmoid function can be determined according to the May 18, 2022 19/26
Risk/Reward index. This paper refers to this index as θ for ease of pointing. Eq 15, 1 Upper = 1+θ , (15) θ Lower = 1+θ defines two thresholds for sigmoid output, named upper and lower thresholds, respectively. Data instances with a sigmoid value greater than the upper threshold or less than the lower threshold are selected as entry points. This way, we do not act on those candles with a sigmoid output of between upper and lower thresholds. Eq 16, ( Long/Buy : if σt ≥ U pper Signal = , (16) Short/Sell : if σt ≤ Lower describes when this strategy opens either a long or short position where σt is the value of the sigmoid function corresponding to the t-th data instance. Depending on the value of θ, this strategy ignores a part of positions whose confidence is not adequate. Consequently, the problem of acting on each candle is addressed. 3.1.2 Bet sizing To address the second issue, specifying equal amounts of margin and weights for each position, this part considers a simple approach to specify positions’ margins. By considering the absolute difference between the sigmoid function value and 0.5, positions’ margins are calculated. For example, if the sigmoid function value is 0.1 and this number is less than the lower threshold, we specify 0.5 - 0.1 = 0.4 of our total margin for this position. By increasing the lower threshold or decreasing the upper threshold, we can allow our strategy to open much more positions, each having much less amount of margin. 3.2 Evaluate the performance There are various indexes by which the performance of models is evaluated, especially in our case that the model’s task is to binary-classify the data. For example, we can calculate the accuracy, recall and precision indexes. These indicators are related to the model outputs, not to trading strategy outputs. We may have an excellent trained model that is 60 percent accurate, but it makes us lose money since positions’ margins and their corresponding profit/loss may not be equal. This model may correctly predict 60 percent of instances whose corresponding margins and profits are much lower than the margins and loss of another 40 percent of instances. For example, if we correctly open 60 positions with an average initial margin of $100 and an average profit of 4% and wrongly open 40 positions with an average initial margin of 150$ and an average loss of 5%, on the whole, we lose money. In contrast, a 45 percent accurate model would be profitable, hence accuracy is not a good indicator to determine how rich we get if we use this trained model. Although this paper reports the confusion matrix to show the model’s performance, it highly recommends using absolute return to evaluate the trading strategy performance since we are making money from the trading strategy and not from the direct output of the model. Results Selected data to train the model are four-hour data of ETH/USDT and BTC/USDT, each starting from 2017 and ending in 2021, and daily data of S&P500 index beginning May 18, 2022 20/26
from 2001 and ending at 2021. BTC and ETH models are tested on the period of one year of data, and the S&P500 model is tested on four years of data. This research reports two sets of results. The first one investigates the model performance and the second and more important one reports the profitability of the model. Model performance In order to evaluate the model performance, this part reports accuracy, recall, precision, F1 score, converged value of loss function, and its corresponding σ ∗ . These indexes are provided for different values of θ according to Table 2, Table 3 and Table 4. Table 2. Model performance - S&P500 Table 3. Model performance - BTC/USDT May 18, 2022 21/26
Table 4. Model performance - ETH/USDT Trading strategy profitability In order to evaluate the trading strategy profitability, this part reports the total percentage of profit and loss (PnL) in two conditions, without profit saving (consider initial margin plus the amount of profit so far as an accessible margin) and with profit saving (consider just initial margin as an accessible margin). These indicators are provided for different values of θ according to Table 5, Table 6 and Table 7. Table 5. Trading strategy profitability - S&P500 Results show that when we set θ on 1/4, the profitability of the trading strategy decreases. The reason is that too much sensitivity about confidence deprives us of the opportunity to earn positions with lower amounts of profit. May 18, 2022 22/26
Table 6. Trading strategy profitability - BTC/USDT Table 7. Trading strategy profitability - ETH/USDT Discussion This study takes each step according to an intuition that emphasizes we need to have domain knowledge of the field in which we wish to utilize artificial intelligence. This domain knowledge is exactly what helps us to design each part of our research according to its corresponding reason in the real world. Introducing a comprehensive framework that cares about all steps from manipulating raw data to defining a trading strategy is a novel research whose main aim is to be implementable in the real world. Using traders’ behavior and coming up with corresponding elements in our framework is the main edge of this research. It avoids supposing unrealistic assumptions and tries to suggest solutions to help organizations, researchers, or people who would like to earn profit from their financial activities. With respect to all complexity in financial markets, too many other topics, May 18, 2022 23/26
components, and problems have remained unexplored, and further research is required to address them. This research is the first attempt to adopt a compatible mindset with financial markets and define a guideline to interpret these time series data better. Further research will also be considered to address and solve other obstacles we may encounter in financial markets. Conclusion This research provides a novel framework to manipulate financial time series data and exploit deep neural networks to forecast the one-step-ahead closing price. It also recommends defining a trading strategy to be able to implement our work in real-world conditions. The first part of this study, the data preparation, consists of six important steps. Firstly it generates some financial features then denoises the data. Since there is no evidence on how important these features are, the feature engineering part is considered as the third step. Having done that, it splits data into three categories and creates a representative dataset. Finally, it uses auto-encoders to extract high-level features and perform dimensionality reduction. The second part of this study, the modeling, aims at designing an intuitive neural network architecture by which data instances are classified. This part consists of three elements. The first element, the Conv element, aims at changing data instances’ resolution to have a more robust prediction. The second element, the LSTM element, is designed to transform the information from the past to take the price path into account when acting on the current candle. Finally, the FC element is considered to binary-classify the LSTM outputs. The last part of this study, the trading strategy, talks about how to the trained model. It mentions some unrealistic assumptions and suggests solutions to deal with them. Finally, it reviews indexes and indicators that we should use to evaluate the work. The performance of this framework is tested on three different pieces of data, including two coins from the cryptocurrency market and S&P500 market data. The result shows that not only is this model able to provide a robust prediction, but it can also provide a considerable amount of profit such that our balance would increase 10x-12x during one year. References 1. Chong E, Han C, Park FC. Deep learning networks for stock market analysis and prediction: Methodology, data representations, and case studies. Expert Systems with Applications. 2017;83:187–205. 2. Tsantekidis A, Passalis N, Tefas A, Kanniainen J, Gabbouj M, Iosifidis A. Using deep learning to detect price change indications in financial markets. In: 2017 25th European Signal Processing Conference (EUSIPCO). IEEE; 2017. p. 2511–2515. 3. Maknickienė N, Rutkauskas AV, Maknickas A. Investigation of financial market prediction by recurrent neural network. Innovative Technologies for Science, Business and Education. 2011;2(11):3–8. 4. Chen Y, Liu K, Xie Y, Hu M. Financial trading strategy system based on machine learning. Mathematical Problems in Engineering. 2020;2020. 5. Bao W, Yue J, Rao Y. A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PloS one. 2017;12(7):e0180944. May 18, 2022 24/26
6. Vahala J. Prediction of financial markets using Deep learning. MasarykUniversity, Brno, Czechia, Unpublishedmaster thesis. 2016;. 7. Wang J, Sun T, Liu B, Cao Y, Wang D. Financial markets prediction with deep learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE; 2018. p. 97–104. 8. Hoseinzade E, Haratizadeh S, Khoeini A. U-cnnpred: A universal cnn-based predictor for stock markets. arXiv preprint arXiv:191112540. 2019;. 9. Hoseinzade E, Haratizadeh S. CNNpred: CNN-based stock market prediction using a diverse set of variables. Expert Systems with Applications. 2019;129:273–285. 10. Hoseinzade E, Haratizadeh S. CNNPred: CNN-based stock market prediction using several data sources. arXiv preprint arXiv:181008923. 2018;. 11. Samarawickrama A, Fernando T. A recurrent neural network approach in predicting daily stock prices an application to the Sri Lankan stock market. In: 2017 IEEE International Conference on Industrial and Information Systems (ICIIS). IEEE; 2017. p. 1–6. 12. Zanc R, Cioara T, Anghel I. Forecasting Financial Markets using Deep Learning. In: 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP). IEEE; 2019. p. 459–466. 13. Fischer T, Krauss C. Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research. 2018;270(2):654–669. 14. Zhelev S, Avresky DR. Using LSTM neural network for time series predictions in financial markets. In: 2019 IEEE 18th International Symposium on Network Computing and Applications (NCA). IEEE; 2019. p. 1–5. 15. Struzik ZR. Wavelet methods in (financial) time-series processing. Physica A: Statistical Mechanics and its Applications. 2001;296(1-2):307–319. 16. Nobre J, Neves RF. Combining principal component analysis, discrete wavelet transform and XGBoost to trade in the financial markets. Expert Systems with Applications. 2019;125:181–194. 17. Türkmen AC, Cemgil AT. An application of deep learning for trade signal prediction in financial markets. In: 2015 23nd Signal Processing and Communications Applications Conference (SIU). IEEE; 2015. p. 2521–2524. 18. Masci J, Meier U, Cireşan D, Schmidhuber J. Stacked convolutional auto-encoders for hierarchical feature extraction. In: International conference on artificial neural networks. Springer; 2011. p. 52–59. 19. Ding X, Zhang Y, Liu T, Duan J. Deep learning for event-driven stock prediction. In: Twenty-fourth international joint conference on artificial intelligence; 2015. 20. Sirignano J, Cont R. Universal features of price formation in financial markets: perspectives from deep learning. Quantitative Finance. 2019;19(9):1449–1459. 21. Arthur WB. Complexity in economic and financial markets. Complexity. 1995;1(1):20–25. May 18, 2022 25/26
You can also read