Projects/bitcoin.Rmd at master · nguyenmmegan/Projects · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
---
title: "bitcoin"
author: "Megan Nguyen"
date: "12/3/2017"
output:
  word_document: default
  html_document: default
---
Betting on Bitcoin: An attempt to predict future prices

Abstract

In this project, I attempt to predict Bitcoin market prices by choosing the best fit model. In order to make the time series stationary, I first transformed the data using a Box-Cox transformation to stabilize the variance and make the time series more normal. After that, I differentiated the time series so remove trend. I analyzed the data to find the best fit model, using ACF and PACF graphs to locate significant lags for the theoretical ARMA(p,d,q) model before performing a AICc test to find the actual best fit model, which was ARMA(6,1,6). I made predictions according to this model.

After finding the model, I discovered that due to its heteroskedaticy, this model is no the best model for predicting Bitcoin prices. The predictions for future Bitcoin prices were inaccurate compared to the actual Bitcoin prices. In order to find a better fit model, I will need to use other financial models that account for volatility that are not normally distributed, such as a GARCH model.


Introduction

Bitcoin is an accounting system that records transactions and holds values digitally that utilizes an online open ledger called the blockchain that has become extremely popular within the last few years. It is a cryptocurrency that uses cryptography to secure its transactions, verify the creation of additional units, and verify the transfer of assets. Bitcoin lacks a central server, but instead uses its distributed blockchain maintained by miners, computers that are tasked with maintaining the ledger to verify, update, and make sure it is trustworthy. Because these transactions are maintained by a body of independent computers compelled by an incentive system, there is no need for a bank, getting rid of fees, inefficiencies, corruption, and the risk that comes with centralization. In addition, every Bitcoin is accounted for, so there are no counterfeits. Using this cryptocurrency is easily transferable, anonymous, and is a new system of global, decentralized money- the first currency of the internet that everyone is free to use.

In the past few years, the value of Bitcoin has soared. Because there is a cap of 21 million Bitcoins, the value of each Bitcoin increases as its popularity increases. People find value in Bitcoin because it is hard for the government to tax and trace. In addition, people are willing to accept and trade in Bitcoin, and it acts as an equity investment. In the past, the price of Bitcoin has gone through a number of appreciation and depreciation bubbles and busts that have made it extremely volatile. Despite its unpredictability, a large number of people continue to invest in the cryptocurrency, with many believing its value to continuously increase in the future.

In the project, I analyzed the daily prices of Bitcoin from December 2015 to the present day (https://blockchain.info/charts) using a number of statistical processes and models from R, and found the best fit model. First, I transformed the data using Box-Cox transformation and differentiation. After transforming the time series and fitting it into its best model using ACF and PACF graphs, the Yule-Walker method, and AICc to compare the AIC values with each possible (and calculable) model, I forecasted future Bitcoin price values, and compared them with its actual available current prices.

I found that the ARMA(6,1,6) model I used was not satisfactory. It did not follow a normal distribution, due to its heteroskedatic nature. Although the model was not able to accurately predict the actual values of Bitcoin prices, it is able to paint a picture of Bitcoin prices and its extremely volatile prices.

Initial Analysis

```{r}
#install data from 12/5/2015 - 12/2/2017

data <- read.csv("/Users/megannguyen/Desktop/Datasets/BitcoinPrice_2015-2017.csv")
names(data) <- c("DATETIME", "Closing Market Price")
head(data)
```

```{r}
bitcoin <- ts(data[,2]) # R graph
ts.plot(bitcoin, ylab = "Bitcoin Closing Market Price in $", xlab = "Time in Days", main = "Bitcoin Price Since December 6, 2015")

#Using ggplot2
library(tidyr)
data$DATETIME <- as.character(data$DATETIME)
data1 <- separate(data, "DATETIME", into = c("DATE", "TIME"), sep = "[[:space:]]", convert = TRUE)
data1 <- data1[,-2]
names(data1) <- c("DATE", "ClosingPrice")
data1$DATE <- as.Date(data1$DATE)
```
Plotting the time series of the bitcoin prices is strong linear trend, with large variability that increases with the trend. There is no seasonality, as there are no patterns that occur in regular time intervals. In addition, there are sharp changes in its behavior due to the increasing popularity of the cryptocurrency, as well as financial crises within the cryptocurrency market. Because there is a trend, large variability, and sharp changes in behavior, this dataset is not stationary. In order to stabilize the variability, I will perform a Box-Cox Transformation, using the appropriate lambda value. To remove trend from the time series, I will differentiate the data.

Transformation

```{r}
#BOXCOX TRANSFORMATION to make more normal and reduce variability
library(MASS)
t <- 1:length(bitcoin)
fit <- lm(bitcoin~t)
bcTransform <- boxcox(bitcoin~t, plotit = TRUE)

```
The Box-Cox plot shows the appropriate lambda value to use when transforming my data. The lambda value found is around -0.7474747. This transformation will stabilize the variation and make the time series more normally distributed.

```{r}
#Apply transformation using lambda
lambda = bcTransform$x[which(bcTransform$y == max(bcTransform$y))]
bitcoin.bc <- (1/lambda)*(bitcoin^lambda-1)

#Plot
op <- par(mfrow = c(1,2))
ts.plot(bitcoin, main = "Original data", ylab = expression(X[t]))
ts.plot(bitcoin.bc, main = "Box-Cox Transformed data", ylab = expression(Y[t]))
```
```{r}
var(bitcoin)
var(bitcoin.bc)
```
transformed data 𝑌 (the new data with respect to time t), the variability is stabilized throughout
the data. In the original data, the variability is small at first, but increases greatly over time. In the transformed data, the variability appears more normal and is stabilized with time.

```{r}
#ACF of original data
op <- par(mfrow = c(1,2))
acf(bitcoin, main = "Original ACF", ylim = c(-1,1), xlab = "h", ylab = expression(hat(rho)[X](h)))
pacf(bitcoin, main = "Original PACF", ylim = c(-1,1), xlab = "h", ylab = expression(hat(rho)[X](h)))
```
```{r}
#ACF of transformed data
op <- par(mfrow = c(1,2))
acf(bitcoin.bc, main = "Box-Cox Transformed ACF", ylim = c(-1,1), xlab = "h", ylab = expression(hat(rho)[X](h)))
pacf(bitcoin.bc, main = "Box-Cox Transformed PACF", ylim = c(-1,1), xlab = "h", ylab = expression(hat(rho)[X](h)))
```
The ACF of the original data and the Box-Cox transformed data shows that the data is highly dependent with each other. It is smooth and gradually decreases, although there is still a strong trend present. The variability of the original data is 3721153, and the variability of the box-cox transformed data is 1.962039e-05. The PACF appears normal.

```{r}
#Differencing trend at d = 1
bitcoin.d1 = diff(bitcoin.bc, 1)
plot(bitcoin.d1,main = "De-trended Time Series for d = 1",ylab = expression(nabla[1]~Y[t]))
abline(h = 0,lty = 2)
```
```{r}
#Perform unit root test to see if stationary
library(fUnitRoots)
unitrootTest(bitcoin.d1)
```
In order to remove trend from the data, the data was differentiated at d = 1. From the transformed time series, there is no longer a strong linear trend. The variability of the differentiated, Box-Cox transformed data is now minimized to 4.2264e-08. When differentiating the time series for d = 2, the variance is 8.01734e-08, which is greater than 4.2264e-08. At d = 2, the time series is over-differentiated, so the time series should be differentiated at d = 1.
After performing a root test and finding that the roots of the data fall outside the unit circle, it is confirmed that the model is now stationary.

Preliminary model

```{r}
#De-trended ACF and PACF at d = 1
op = par(mfrow = c(1,2))
acf(bitcoin.d1, main = "")
pacf(bitcoin.d1, main = "")
title("De-trended Time Series for d = 1", line = -1, outer=TRUE)
```
```{r}
#Variance at d=1
var(bitcoin.d1)

#ACf and PACF values
acf_d1 <- acf(bitcoin.d1, lag.max = 100, plot = FALSE)
pacf_d1 <- pacf(bitcoin.d1, lag.max = 100, plot = FALSE)

#CI Value
ci <- qnorm((1 + 0.95)/2)/sqrt(728)

#Significant ACF values .. MA presence... that determines MA model order ... only testing for 100 lags
acf_lag_id <- which(abs(acf_d1$acf) > ci)
acf_lags <- acf_d1$lag[acf_lag_id]
acf_lags

#Significant PACF values .. that determines AR model order ... only for 100 lags
pacf_lag_id <- which(abs(pacf_d1$acf) > ci)
pacf_lags <- pacf_d1$lag[pacf_lag_id]
pacf_lags

```
With the variability stabilized, and the trend removed, the ACF and PACF reflect a stationary time series. There are significant ACF values, meaning that there are significant MA (moving average) components present. The oscillating and decaying ACF values indicate the presence of an AR (autoregressive) process. The significant lags that fall outside the 95% confidence interval are 5, 9, 25, 41, and 85 when testing for the first 100 lags. This means that after lag 85, there are no significant ACF values, so the theoretical model would be MA(85).

From the PACF, the significant lags that fall outside the 95% confidence interval are 5, 25, 41, and 85. This means that after lag 85, there are no significant PACF values, so AR(85) would be

the best theoretical AR model. However, these values are too large to calculate for an ARMA (autoregressive moving-average) model in R, so the best theoretical preliminary model would be ARMA(5,1,5). The AIC value for this model is 8894.76.

Fitting Models


```{r}
#
#Yule Walker method
library(smooth)
fit_ar <- ar(bitcoin.d1, method="yule-walker")
fit_ar
fit_ma<- sma(bitcoin.d1)
fit_ma
```
When performing the Yule-Walker method, the estimated order for the AR model is 6, and when performing a smooth moving average function, the estimated order for the MA model is 200. Because this model, is difficult to compute using R, we will stick with the p = 6.

```{r}
library(qpcR)
aiccs1 <- matrix(NA, nr = 10, nc = 10)
dimnames(aiccs1) = list(p=0:9, q=0:9)
for(p in 0:9)
{
for(q in 0:9)
{
aiccs1[p+1,q+1] = AICc(arima(bitcoin.d1, order = c(p,0,q), method="ML", optim.control = list(maxit = 800)))
}
}
aiccs1

```

```{r}
min(aiccs1)

```
Performing the AIC(c) test on the data, we can see that the best fit model for the time series is ARMA(6,1,6), with a minimum AIC value of -10285.03. This aligns with the results we got from the Yule-Walked Method, which found that the best p order is 6. Because the orders in the ARMA(6,1,6) model are computable and have the least AIC value, this is the best fit model. The equation for this model is:
𝑋𝑡 = 0.0053𝑋𝑡−1 − 0.309𝑋𝑡−2 + 0.0668𝑋𝑡−3 + 0.2698𝑋𝑡−4 + 0.0751𝑋𝑡−5 + 0.8741𝑋𝑡−6 + 𝑍𝑡 + 0.1009𝑍𝑡−1 + 0.4457𝑍𝑡−2 + 0.0059𝑍𝑡−3 − 0.3418𝑍𝑡−4 − 0.0418𝑍𝑡−5 − 0.9195𝑍𝑡−6

Residual Analysis of Fitted Model
```{r}
#Residual analysis on ARIMA(6,1,6)
fit <- arima(bitcoin, order = c(6,1,6), method = "ML")
fit
```


```{r}
Box.test(residuals(fit), type = "Ljung")
```

```{r}
#normality test
shapiro.test(residuals(fit))
```


```{r}
ts.plot(residuals(fit),main = "Fitted Residuals")
```

```{r}
par(mfrow=c(1,2),oma=c(0,0,2,0))
# Plot diagnostics of residuals
op <- par(mfrow=c(2,2))
# acf
acf(residuals(fit), lag.max = 800, main = "Autocorrelation")
# pacf
pacf(residuals(fit), lag.max = 800, main = "Partial Autocorrelation")
5
# Histogram
hist(residuals(fit),main = "Histogram")
# q-q plot
qqnorm(residuals(fit))
qqline(residuals(fit),col ="blue")
# Add overall title
title("Fitted Residuals Diagnostics", outer=TRUE)
par(op)
```
Plotting the residuals of the model, we can see that within the past year, the data is extremely volatile. The model is not satisfactory. From the Normal Q-Q plot, it violates the normality assumption with its heavy tails that follow a White Noise distribution. And although it passed the Box-Ljung test, meaning that the data is independently distributed, the time series did not pass the Shapiro-Wilk test, meaning that the data is not normal, which can be seen in the Normal Q-Q plot. Although we were able to find the best fitted ARMA model, it is not the best model for this dataset. This dataset should be fitted with other models that do not follow a Gaussian distribution, such as the GARCH model, that fits better with financial data that are hetereoskedatic.

Forecasting

Although the ARMA(6,1,6) model is not satisfactory, we can still attempt to forecast future bitcoin prices.

```{r}
#Predictions for the next month
mypred <- predict(fit, n.ahead=31)
ts.plot(bitcoin, xlim=c(0, 760), ylim = c(0, 20000), xlab = "Days since December 5, 2015", ylab = "Bitcoin $USD Market Price", main = "Betting on Bitcoin")
points(729:759,mypred$pred)
lines(729:759,mypred$pred+1.96*mypred$se,lty=2, col = "red")
lines(729:759,mypred$pred-1.96*mypred$se,lty=2, col = "red")
```

```{r}
#Predictions for the next year
mypred <- predict(fit, n.ahead=365)
ts.plot(bitcoin, xlim=c(0, 1100), ylim = c(0, 50000), xlab = "Days since December 5, 2015", ylab = "Bitcoin $USD Market Price", main = "Betting on Bitcoin")
points(729:1093,mypred$pred)
lines(729:1093,mypred$pred+1.96*mypred$se,lty=2, col = "red")
lines(729:1093,mypred$pred-1.96*mypred$se,lty=2, col = "red")

```

```{r}
#Predictions for the next 3 years
mypred <- predict(fit, n.ahead=1095)
ts.plot(bitcoin, xlim=c(0, 1850), ylim = c(0, 120000), xlab = "Days since December 5, 2015", ylab = "Bitcoin $USD Market Price", main = "Betting on Bitcoin: 95% Confidence Interval")
points(729:1823,mypred$pred)
lines(729:1823,mypred$pred+1.96*mypred$se,lty=2, col = "red")
lines(729:1823,mypred$pred-1.96*mypred$se,lty=2, col = "red")
```
It is clear that the predicted data values (black) are extremely inaccurate with the actual results (red). Again, this is due to the highly volatile nature of the Bitcoin prices. This supports the fact that the ARMA(6,1,6) model is not a good fit for the data.

Conclusion

In conclusion, using the historical prices of Bitcoin, I created an ARMA(6,1,6) model that best- fit the data after applying a Box-Cox transformation and differentiation to stabilize variability and removing trend:
𝑋 =0.0053𝑋 −0.309𝑋 +0.0668𝑋 +0.2698𝑋 +0.0751𝑋 +0.8741𝑋
𝑡 𝑡−1 𝑡−2 𝑡−3 𝑡−4 𝑡−5 𝑡−6
+ 𝑍𝑡 + 0.1009𝑍𝑡−1 + 0.4457𝑍𝑡−2 + 0.0059𝑍𝑡−3 − 0.3418𝑍𝑡−4 − 0.0418𝑍𝑡−5
− 0.9195𝑍𝑡−6

This model proved to not be significant or accurate according to the Shapiro-Wilk test and the actual vs. predicted data. This is due to the highly volatile nature of the data. Because this financial dataset is hetereoskedatic, it should be fitted to a GARCH model in order to capture its volatility. Although this model is inaccurate, and is not the best fit model, it is still able to use historical data to create possible predictions for Bitcoin prices in the future.