Python金融时间序列分析入门

健谈始于戊戌年 2021-07-01

954

1. What is time series?

什么是时间序列？

Some data change regularly with time. Time series is a sequence of data which are recorded in regular time interval. A time series can be in different frequency: hourly, daily, weekly, monthly,annually.

时间序列就是按照固定时间间隔的一串数字。这些数据的频率可以是每小时，每天，每周，每月，每年等等。

A good example of time series data are stock prices. Usually, we analyze stock prices using daily close prices, volumes,etc.

股票价格就是一串时间序列的数据。一般情况下，我们利用每日的收盘价、成交量等参数进行分析。

Sometimes, hedge funds analyze seconds or minute-wise time series stock prices for their high frequent trading. Therefore,you can choose the frequency of time series based on your purpose and data availability.

有些对冲基金为了做高频交易，需要用到秒或者分钟频率级别的数据。你可以根据你自身的需求以及是否能够获取到有关数据，选择时间序列的频率。

2. Why we need time seriesanalysis?

为什么要进行时间序列分析？

We need time series analysis because we want to know something about the future. From ancient time, people always want to have crystal balls tell them what will happen next. These crystal balls have been evolved to time series analysis and other machine learning models.

我们需要时间序列的原因在于我们想预测未来。从古至今，这个目标从未改变。过去的算命先生，现在有个好听的名字：数据科学家！只要人类存在，预测未来的工作总是有需求。即使你预测错了也没关系，没有人会投诉你，参见天气预报！

Sometimes, time series forecasting is very successful. We know that T mobile and citi bank near Columbia University will have more clients each August because of new students. Hot dog sales in Coney Island will rise every summer. In the long run, investing in stock market will receive a return close to 12%.

事实上，有些时间序列分析很成功。比方说我们知道学校附近的银行每天秋天会有很多新生来开户，夏天烧烤摊的生意会比冬天好。长期投资股市的收益率在12%左右。如果算命完全不正确，也就没有人会去算命了。

3. How to analyze financial timeseries with Python?

如何利用Python进行时间序列分析

The first step is to read financial time series into Python. How can we do that? I will use Tesla stock as an illustration.

我们首先要通过Python读入金融数据。怎么实现呢？我就用特斯拉的股票说明这个问题。

3.1 Read data from Yahoo!Finance

从雅虎金融读入数据

We import some Python libraries Pandas,Numpy, matplotlib.pyplot and pandas_datareader. pandas_datareader is the package that help us read Tesla stock price from Yahoo. If you don’t have pandas_datareader, you can install it in the command line using:

Pip install pandas_datareader

我们先引用一些程序包。Pandas_datareader是用来从Yahoo金融读数据的程序包。你要是没有装，可以打开命令提示符，然后用下列命令安装：

Pip install pandas_datareader

# Import some libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import pandas_datareader as pdr

Since we will analyze financial data, the next step is the set a start date and end date.

接下来是设定起始日期和结束日期：

# Set start and end date

start='2010-12-31'

end='2020-03-04'

Okay, we are ready to retrieve data from Yahoo!Finance. Just use following code:

好了，我们只需要用下面一行程序就可以从雅虎金融上提取特斯拉的数据。

TSLA = pdr.get_data_yahoo('Tsla',start=start, end=end)

If you want to retrieve other stocks like APPL, AMZN, GOOG, FB, just change the symbol. But for this time, I will keep using Tesla as my example.

你要是对别的股票感兴趣，可以换成它们的代码，比方说苹果，亚马逊，谷歌，脸书等等。我这里还是用特斯拉作为例子。

3.2 Check data with Pandas

用Pandas检验数据

A good habit is to check data using Pandas before you analyze them.

我们先用Pandas看看数据。

The time series dataset TSLA starts from 2010-12-31 and end at 2020-3-4. It has 2308 rows and 6 columns including High,Low, Volume, Adj Close. But there are two variables we can’t see.

时间序列数据集从2010年12月31日开始，到2020年3月4日结束。它有2308行以及最高价、最低价、成交量和复权后的价格。但是还有两个变量我们看不到。

TSLA.info

Out[83]:

<bound method DataFrame.info of High Low ... Volume Adj Close

Date ...

2010-12-31 27.250000 26.500000 ... 1417900 26.629999

2011-01-03 27.000000 25.900000 ... 1283000 26.620001

2011-01-04 26.950001 26.020000 ... 1187400 26.670000

2011-01-05 26.900000 26.190001 ... 1446700 26.830000

2011-01-06 28.000000 26.809999 ... 2061200 27.879999

2011-01-07 28.580000 27.900000 ... 2247900 28.240000

2011-01-10 28.680000 28.049999 ... 1342700 28.450001

2020-03-02 743.690002 686.669983 ... 20195000 743.619995

2020-03-03 806.979980 716.109985 ... 25784000 745.510010

2020-03-04 766.520020 724.729980 ... 15004800 749.500000

[2308 rows x 6 columns]>

But if we use head(), then we can see Ope nand Close

我们用head()就可以看到开盘价和收盘价。

TSLA.head()

Out[84]:

High Low Open Close Volume Adj Close

Date

2010-12-31 27.250000 26.500000 26.57 26.629999 1417900 26.629999

2011-01-03 27.000000 25.900000 26.84 26.620001 1283000 26.620001

2011-01-04 26.950001 26.020000 26.66 26.670000 1187400 26.670000

2011-01-05 26.900000 26.190001 26.48 26.830000 1446700 26.830000

2011-01-06 28.000000 26.809999 26.83 27.879999 2061200 27.879999

Shape() tells us similar information

Shape()函数得出的信息也差不多。

TSLA.shape

Out[85]: (2308, 6)

10 years ago, the little stock of tesla is only $26 per share.

10年前，特斯拉的股价只有26美元。

TSLA.Close.head()

Out[86]:

Date

2010-12-31 26.629999

2011-01-03 26.620001

2011-01-04 26.670000

2011-01-05 26.830000

2011-01-06 27.879999

Name: Close, dtype: float64

10 years later, one share of Tesla stock worth over $700. It rose 28.8 times！So if you spend $2,600 to buy 100 share of Tesla at the end of 2010, now theoretically speaking, you have $75,000. In this scenario, you’d better forget them after you bought them.

10年后，每股特斯拉的价格超过700美元。上涨超过28倍！如果你在2010年底买了100股特斯拉，花费2600美元，那么理论上讲现在你手里应该有大约7.5万美元！但是这只不过是理论上讲，能一直拿着不动的人很少很少。不信你可以试试！

TSLA.Close.tail()

Out[87]:

Date

2020-02-27 679.000000

2020-02-28 667.989990

2020-03-02 743.619995

2020-03-03 745.510010

2020-03-04 749.500000

Name: Close, dtype: float64

If we check the adjusted close price, we find they are the same as close price. This means Tesla did not give any dividend to shareholder and there is no stock split. However, if you are a disciple of value investment, you will not buy this stock because of dividend.Life is a box of chocolates!

特斯拉的复权价格和收盘价一样。这说明特斯拉这些年以来没有分红，也没有拆股。但如果你是一个价值投资者，你根本就不会买这样的股票，因为价值投资的原则之一是要买入分红的股票。

TSLA['Adj Close'].head()

Out[96]:

Date

2010-12-31 26.629999

2011-01-03 26.620001

2011-01-04 26.670000

2011-01-05 26.830000

2011-01-06 27.879999

Name: Adj Close, dtype: float64

TSLA['Adj Close'].tail()

Out[97]:

Date

2020-02-27 679.000000

2020-02-28 667.989990

2020-03-02 743.619995

2020-03-03 745.510010

2020-03-04 749.500000

Name: Adj Close, dtype: float64

In Pandas, you can select certain rows.

你还可以选点某些行的数据

TSLA.iloc[10:13]

Out[88]:

High Low Open Close Volume Adj Close

Date

2011-01-14 26.580000 25.610001 26.15 25.750000 1192000 25.750000

2011-01-18 25.639999 24.750000 25.48 25.639999 1621700 25.639999

2011-01-19 25.469999 23.750000 25.27 24.030001 2371500 24.030001

You can also do some statistical analysis for this. The minimum price is $21 and the maximum is $917. This means if you have crystal balls and buy at the bottom, sell at the peak, you can receive over 43 times return. Unfortunately, this is just a day dream for common people.

你还可以做一些统计分析。你会发现特斯拉股价的最低点是21美元，最高点是917美元。所以如果你能买到最低点，卖到最高点，那么你的收益超过43倍。不过对于普通人而言，这仅仅是个黄粱美梦！

TSLA['Close'].mean()

Out[91]: 201.01972715486895

TSLA['Close'].min()

Out[92]: 21.829999923706055

TSLA['Close'].max()

Out[93]: 917.4199829101562

We can also use describe() to make thisstatistics easier.

你可以用describe一步到位。

TSLA['Close'].describe()

Out[94]:

count 2308.000000

mean 201.019727

std 128.548347

min 21.830000

25% 47.615002

50% 219.450005

75% 277.867508

max 917.419983

Name: Close, dtype: float64

In Pandas, you can add variables to datasets. For example, you can add pct_return (daily percentage return) to the TSLA dataset using following code:

我们加入一个新变量，每日百分比收益率（也就是涨跌幅）

TSLA['pct_return'] =TSLA['Close'].pct_change()

Then you can see a new variable pct_return appears in TSLA.

然后这个变量就出现在数据集当中了。

TSLA.head()

Out[102]:

High Low Open ... Volume Adj Close pct_return

Date ...

2010-12-31 27.250000 26.500000 26.57 ... 1417900 26.629999 NaN

2011-01-03 27.000000 25.900000 26.84 ... 1283000 26.620001 -0.000375

2011-01-04 26.950001 26.020000 26.66 ... 1187400 26.670000 0.001878

2011-01-05 26.900000 26.190001 26.48 ... 1446700 26.830000 0.005999

2011-01-06 28.000000 26.809999 26.83 ... 2061200 27.879999 0.039135

[5 rows x 7 columns]

We can “describe” it. As the mean is 0.001967, we can say that if you hold Tesla in the past 10 years, you receive 0.2% everyday. But some days are good, some days are bad. In the best day, you have 24% in one single day! And you may also lose 20% in another day. C’est la vie!

TSLA['pct_return'].describe()

Out[103]:

count 2307.000000

mean 0.001967

std 0.032396

min -0.193274

25% -0.013804

50% 0.000871

75% 0.017642

max 0.243951

Name: pct_return, dtype: float64

We definitely want to know what date is ourlucky day! We have two days return higher than 18%. One day is 2013-5-9 (the VictoryDay!), 24.4%. Another day is 2020-2-3, 19.9%!

mask = (TSLA['pct_return'] > 0.18)

TSLA_high = TSLA[mask]

TSLA_high.head(10)

Out[106]:

High Low ... Adj Close pct_return

Date ...

2013-05-09 75.769997 63.689999 ... 69.400002 0.243951

2020-02-03 786.140015 673.520020 ... 780.000000 0.198949

[2 rows x 7 columns]

One ABNORMAL fact is that the log returns ofstock prices are normally distributed with mean of mu and standard deviation of sigma, which is denoted by N(mu, sigma).

And time series models as well as other financial models (such as the black Scholes model) always have the assumption of normal distribution. Thus, we’d better include daily log returns in our TSLA dataset. This can be done using following codes:

股票的对数收益率符合正态分布，我们把特斯拉每日对手收益率(log_return)也加入数据。这也是Pandas的基本功能。

TSLA['log_price'] = np.log(TSLA.Close)

TSLA['log_return'] =TSLA['log_price'].diff()

Now, we can see that there are pct_return,log_price and log_return in the dataset

现在我们的数据集里就有了log_return

TSLA.head()

Out[108]:

High Low Open ... pct_return log_price log_return

Date ...

2010-12-31 27.250000 26.500000 26.57 ... NaN 3.282038 NaN

2011-01-03 27.000000 25.900000 26.84 ... -0.000375 3.281663 -0.000376

2011-01-04 26.950001 26.020000 26.66 ... 0.001878 3.283539 0.001876

2011-01-05 26.900000 26.190001 26.48 ... 0.005999 3.289521 0.005981

2011-01-06 28.000000 26.809999 26.83 ... 0.039135 3.327910 0.038389

3.3 Test for normality

正态分布检验

Let’s use Shapiro-Wilk test to see if Tesla’s daily log returns are normally distributed. For Shapiro-Wilk test

H0: dataare normally distributed

Ha: dataare not normally distributed

Sine the p-value is much less than alpha,we reject the null hypothesis.

我们用Shapiro-Wilk test检验特斯拉每日对数收益率是否为正态分布。由于p指太小，所以特斯拉每日对数收益率并非正态分布。太不幸了！

# normality test

from scipy.stats import shapiro

stat, p =shapiro(TSLA['log_return'].dropna())

print('Statistics=%.3f, p=%.3f' % (stat,p))

# interpret

alpha = 0.05

if p > alpha:

print('Samplelooks Gaussian (fail to reject H0)')

else:

print('Sampledoes not look Gaussian (reject H0)')

Statistics=0.927, p=0.000

Sample does not look Gaussian (reject H0)

3.4 Visualiztion

可视化

Finally, we need to visualize our data. The first important thing is to plot Tesla’s stock price:

首先，我们画出特斯拉的股票价格：

importmatplotlib.dates as mdate

plt.plot(TSLA.Close,'r',linewidth=0.8,markersize=12,label='Tesla Stock price')

plt.legend()

plt.xlabel('Date')

plt.ylabel('Price')

ax = plt.gca()

ax.xaxis.set_major_formatter(mdate.DateFormatter('%Y-%m-%d'))

plt.xticks(pd.date_range(start,end,freq='2y'))

plt.yticks(fontsize=10,rotation=0)

plt.xlim(start,end)

plt.grid(c='k',linestyle='--')

plt.rcParams['figure.figsize']= (10.0, 6.0)

plt.show()

Fig.1

Then, we plot the daily percentage returns.

接下来，我们画出每日涨跌幅。

plt.plot(TSLA.pct_return,'r',linewidth=0.8,markersize=12,label='Tesla pct return')

plt.legend()

plt.xlabel('Date')

plt.ylabel('pct return')

ax = plt.gca()

ax.xaxis.set_major_formatter(mdate.DateFormatter('%Y-%m-%d'))

plt.xticks(pd.date_range(start,end,freq='2y'))

plt.yticks(fontsize=10,rotation=0)

plt.xlim(start,end)

plt.grid(c='k',linestyle='--')

plt.rcParams['figure.figsize'] = (10.0, 6.0)

plt.show()

Fig.2

At last, we plot the log returns.

最后画出对数收益。

plt.plot(TSLA.log_return,'r',linewidth=0.8,markersize=12,label='Tesla log return')

plt.legend()

plt.xlabel('Date')

plt.ylabel('log return')

ax = plt.gca()

ax.xaxis.set_major_formatter(mdate.DateFormatter('%Y-%m-%d'))

plt.xticks(pd.date_range(start,end,freq='2y'))

plt.yticks(fontsize=10,rotation=0)

plt.xlim(start,end)

plt.grid(c='k',linestyle='--')

plt.rcParams['figure.figsize'] = (10.0,6.0)

plt.show()

Fig.3

Python中的大熊猫（Pandas）！

Python中的大熊猫（Pandas）之二

Python中Numpy的基础知识（1）

Python十二钗之Numpy(2)

Python十二钗之Scikit-learn ：特朗普的就职演说有何不同？

Python十二钗之SciPy：早知有SciPy，何必用美拍？

Python十二钗之Matplotlib：画图原来如此简单！

Python十二钗之NLTK：文本分析神器

Python十二钗之NLTK：文本的情感分析

Python十二钗之NLTK：文本模型的设计和评估！

python

文章转载自健谈始于戊戌年，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

Python金融时间序列分析入门

评论