Temporal Data Analysis System

From GM-RKB
Jump to navigation Jump to search

A Temporal Data Analysis System is a sequential data analysis system that applies a temporal data analysis algorithm to solve a temporal data analysis task.



References

2016

Gmelli_Supervised_Duration_Estimator.160913

A Supervised Duration Estimator

This program demonstrates a supervised duration estimation system (as defined in http://GM-RKB.com/supervised_demand_estimator)

Gabor Melli, Sept 13, 2016


Description

TBD


Source code


The following session uses Python 2.7 and iPython/Jupyter 4.2

Read in the data in a Pandas DataFrame

In [1]:
import pandas as pd
import numpy as np # for random sampling and RMSE analysis

# read the data in a series
data_series = pd.to_datetime( pd.read_json('logins.json')['login_time'])
#print data_series[:4], "\n\n", data_series.describe() # describe the series

# cast the data into a dataframe
data_df = pd.DataFrame({'id':data_series.index, 'login_time':data_series.values})

# report a random sample
print data_df.reindex(np.random.permutation(data_df.index))[0:10].sort_index()
          id          login_time
36661  36661 1970-02-20 00:52:27
40499  40499 1970-02-22 21:35:11
50513  50513 1970-03-06 02:12:32
66557  66557 1970-03-19 22:36:50
68080  68080 1970-03-21 01:41:45
69344  69344 1970-03-21 23:01:33
70263  70263 1970-03-22 11:31:24
72386  72386 1970-03-25 00:31:57
72912  72912 1970-03-25 20:45:42
88183  88183 1970-04-09 00:34:50
In [2]:
# summarize/describe the data
data_df['login_time'].describe()
Out[2]:
count                   93142
unique                  92265
top       1970-02-12 11:16:53
freq                        3
first     1970-01-01 20:12:16
last      1970-04-13 18:57:38
Name: login_time, dtype: object

Observations on the raw timeseries

  • The events in the timeseries start on 1970-01-01 20:12:16 and end on 1970-04-13 18:57:38
  • There are 93,142 events (92,265 of them at unique times)

Let's analyze the 15 minute resolution behavior

  • Let's discretize the data at a 15 minute resolution (as required) to creat a univariate timeseries.
In [3]:
mins15_df = data_df.set_index('login_time').groupby(pd.TimeGrouper(freq='15Min')).count().fillna(0)
mins15_df = mins15_df.reset_index(0) # make the index an integer (instead of the time period)
mins15_df.columns = ['login_period','count']

# Let's also add some additional termporal attributes
mins15_df['dayofweek'] = mins15_df['login_period'].dt.dayofweek
mins15_df['weekofyear'] = mins15_df['login_period'].dt.weekofyear
mins15_df['hourofday'] = mins15_df['login_period'].dt.hour
mins15_df

# report a random sample
# print mins15_df.reindex(np.random.permutation(mins15_df.index))[0:10].sort(); print "\n\n"

# summarize at this level
mins15_df.describe()
Out[3]:
count dayofweek weekofyear hourofday
count 9788.000000 9788.000000 9788.000000 9788.000000
mean 9.515938 3.035554 8.325296 11.496935
std 8.328818 2.012722 4.215948 6.922294
min 0.000000 0.000000 1.000000 0.000000
25% 3.000000 1.000000 5.000000 5.000000
50% 7.000000 3.000000 8.000000 11.000000
75% 13.000000 5.000000 12.000000 17.000000
max 73.000000 6.000000 16.000000 23.000000

Observations on the 15-minute timeperiods:

  • There are 9,788 of 15-minute timeperiods between the first and last event.
  • For all timeperids the minimum and maximum number of events range from 0 to 73.
  • On average there are 9.516 events per period, while the median is 7 logins
  • Every day-of-the-week appears to have login events.
  • Every week day from week 1 to week 16 appears to have login events.
  • Every hour-of-the-day appears to have login events

Peek at the most recent timeperiods

In [4]:
mins15_df.tail(10)
Out[4]:
login_period count dayofweek weekofyear hourofday
9778 1970-04-13 16:30:00 4 0 16 16
9779 1970-04-13 16:45:00 3 0 16 16
9780 1970-04-13 17:00:00 5 0 16 17
9781 1970-04-13 17:15:00 3 0 16 17
9782 1970-04-13 17:30:00 9 0 16 17
9783 1970-04-13 17:45:00 5 0 16 17
9784 1970-04-13 18:00:00 5 0 16 18
9785 1970-04-13 18:15:00 2 0 16 18
9786 1970-04-13 18:30:00 7 0 16 18
9787 1970-04-13 18:45:00 6 0 16 18

Observations on the events in the most recent 15-minute timeperiod(s)

  • The login-events occurred on a Monday (day-of-week/dow=0)
  • The login-events occurred in the early evening (just before 19:00)
  • The very last 15-minute period appears to be complete
  • The median number of events in that range is 5 logins.
  • For predictions then the next few timeperiods will likely be in the range of 4 to 6 logins.

Prepare for visualizations

In [5]:
import matplotlib as mpl
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.dates import MO, TU, WE, TH, FR, SA, SU

days = mdates.DayLocator()
dtfmt = mdates.DateFormatter('%a %b %d')
saturdays = mdates.WeekdayLocator(byweekday=SA)

Analyze the timeperiod event count history

In [6]:
fig, ax = plt.subplots()
mins15_df['count'].value_counts().plot(ax=ax, kind='bar')
fig.set_size_inches(16, 4)
plt.grid(True)
plt.show()

Observations on the timeperiod count history

  • Most 15minute timeperiods contain few events: the mode is 2.
  • For predictions then, the naive "majority" prediction is 2.
  • There are many timeperiods with zero login-events.

Analyze the timeperiod event count distribution

In [7]:
fig, ax = plt.subplots()
plt.hist(mins15_df['count'], bins=73, normed=True, cumulative=False)
fig.set_size_inches(16, 4)
plt.grid(True)
plt.show()

Observations on the timeperiod count distribution

  • The distribution of 15minute-period login-event counts appears to follow a Poisson distribution

Analyze the cummulative timeperiod count distribution

In [8]:
fig, ax = plt.subplots()
plt.hist(mins15_df['count'], bins=73, normed=True, cumulative=True)
fig.set_size_inches(16, 4)
plt.grid(True)
plt.show()

Observations on cummulative timeperiods counts

  • Over half of the timeperiods have 7 or fewer login-events

Now let's look at the daily/day-of-year behavior

In [9]:
periodminutes  = 15
inaday15mins   = (60/periodminutes)*24

days1_df = data_df.set_index('login_time').groupby(pd.TimeGrouper(freq='1D')).count().fillna(0)/inaday15mins
days1_df = days1_df.reset_index(0)
days1_df.columns = ['login_period','count']
days1_trim_df = days1_df[1:-1]
In [10]:
fig = plt.figure()
ax = fig.add_subplot(111)

ax.plot(days1_df['login_period'], days1_df['count'], linestyle='-', linewidth=3)

ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
ax.grid(True, which = "both", linestyle=":")
ax.set_ylabel('Count')
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, fontsize=12)
fig.set_size_inches(16, 4)
plt.show()

Observations on daily/day-of-year behavior

  • There is weekly periodicity and it appears to peak on the weekends.
  • There appears to be a slightly increasing trend.
  • The week leading up to the weekend of Sat Mar 21 was unusually busy (and a relatively quieter weekend).
  • The starting day and the ending day are not fully represented.

Let's visualize the "trimmed" daily timeseries

In [11]:
mins15_trim_df = mins15_df[1:-1]
In [12]:
fig = plt.figure()
ax = fig.add_subplot(111)

ax.plot(days1_trim_df['login_period'], days1_trim_df['count'], linestyle='-', linewidth=5)
ax.plot(days1_df['login_period'], days1_df['count'], linestyle='--', linewidth=2)

ax.xaxis.set_major_locator(saturdays) ; #ax.xaxis.set_major_locator(tuesdays)
ax.xaxis.set_minor_locator(days)
ax.xaxis.set_major_formatter(dtfmt)
ax.grid(True, which = "both", linestyle=":")
ax.set_ylabel('15min Login Count')
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, fontsize=12)
fig.set_size_inches(16, 4)
plt.show()

Observations

  • The trimming seems effective.
  • ...

Next, let's analyze the day-of-week behavior

In [13]:
dayofweek_df = mins15_df.groupby(['dayofweek']).describe().unstack()
dayofweek_df = dayofweek_df[['count']].reset_index(0)
dayofweek_df_ = dayofweek_df['count'][:] # extract only the count
print dayofweek_df_['count']
0    1420.0
1    1344.0
2    1344.0
3    1360.0
4    1440.0
5    1440.0
6    1440.0
Name: count, dtype: float64

Observations on day-of-week data

  • There are slightly fewer Tue(dow=1), Wed(dow=2), and Thursdays(dow=3) in the timeseries
In [14]:
plt.bar(dayofweek_df_.index.values, 
        dayofweek_df_['mean'].values, 
        yerr=dayofweek_df_['std'].values,
        color='lightgray')

plt.ylabel('15min Logins')
plt.xlabel('Day-of-Week')
plt.title('Day-of-Week 15min-Logins Distribution')
axes = plt.gca()
fig = plt.gcf()
fig.set_size_inches(14, 8)
plt.show()

Observations on day-of-week behavior

  • Saturday(dow=5) and Sunday(dow=6) are typically the days with most 15min login-events (~13 / 15min).
  • Monday(dow=0) and Tuesday(dow=1) the ones with the fewest (~6.5 / 15min)
  • There is significant variance in 15min behavior for any given day of the week.
  • Therefore, day-of-week appears to be a predictive factor.

Now the week-level (week-of-year) analysis

In [15]:
weekofyear_df = mins15_df.groupby(['weekofyear']).describe().unstack()
weekofyear_df = weekofyear_df[['count']].reset_index(0)
weekofyear_df['count'][['count','mean']]
Out[15]:
count mean
0 304.0 7.809211
1 672.0 7.763393
2 672.0 7.474702
3 672.0 7.069940
4 672.0 7.059524
5 672.0 8.291667
6 672.0 8.802083
7 672.0 10.468750
8 672.0 9.752976
9 672.0 11.008929
10 672.0 10.919643
11 672.0 13.325893
12 672.0 10.840774
13 672.0 12.046131
14 672.0 9.659226
15 76.0 5.197368

Observations on week-of-year

  • The first and especially the most recent "week" have an incomplete number of 15min timeperiods. Therefore be careful when/if making predictions based on week-level information.
  • The mean now more strongly suggests that login-events are increasing.
In [16]:
weekofyear_trim_df = weekofyear_df['count'][1:-1]
In [17]:
plt.bar(weekofyear_trim_df.index.values, 
        weekofyear_trim_df['mean'].values, 
        yerr=weekofyear_trim_df['std'].values,
        color='lightgray')

plt.ylabel('15min Logins')
plt.xlabel('Week-of-Year')
plt.title('Week-of-Year 15min-Logins Distribution')
axes = plt.gca()
axes.set_ylim([0,weekofyear_trim_df['mean'].max()+weekofyear_trim_df['std'].max()]) # force y to start at zero
fig = plt.gcf()
fig.set_size_inches(14, 8)

plt.show()

Observations

  • The count by week starts at ~7.5 and increases to ~11
  • The most recent week trended down
  • As expected, there is significant variance in 15min behavior over an entire week.

Let's look at hour-of-day

In [18]:
hourofday_df = mins15_df.groupby(['hourofday']).describe().unstack()
hourofday_df = hourofday_df[['count']].reset_index(0)
hourofday_df['count'][['count','mean']]
Out[18]:
count mean
0 408.0 14.688725
1 408.0 15.482843
2 408.0 14.215686
3 408.0 11.840686
4 408.0 12.338235
5 408.0 7.218137
6 408.0 2.789216
7 408.0 1.997549
8 408.0 2.004902
9 408.0 3.742647
10 408.0 7.509804
11 408.0 14.213235
12 408.0 12.166667
13 408.0 8.850490
14 408.0 8.397059
15 408.0 7.446078
16 408.0 6.941176
17 408.0 6.333333
18 408.0 7.303922
19 404.0 8.007426
20 408.0 10.056373
21 408.0 13.781863
22 408.0 16.193627
23 408.0 14.848039

Observations on hour-of-day data

  • Every hour-of-day has an equivalent number of 15min timeslots. Therefore it may be safe to use hour-level information.
In [19]:
plt.bar(hourofday_df.index,
        hourofday_df['count']['mean'],
        yerr=hourofday_df['count']['std'],
        color='lightgrey')

plt.ylabel('15min Logins')
plt.xlabel('Hour-of-Day')
plt.title('Hour-of-Day 15min Logins Distribution')
axes = plt.gca()
axes.set_ylim([0,30])
fig = plt.gcf()
fig.set_size_inches(14, 8)
plt.show()

Observations on hour-of-day behavior

  • There is a pattern to login-events by hour-of-day.
  • There is significant variance in 15min behavior for any given hour.
  • Therefore, time-of-day appears to be a predictive factor.

Decomposition of the 15min logins time series

In [20]:
from statsmodels.tsa.seasonal import seasonal_decompose

# Extract login counts indexed by date
mins15_ts = pd.Series (data=mins15_df ['count'].tolist (), 
                        index=mins15_df ['login_period'])
mins15_ts = mins15_ts.asfreq (pd.infer_freq (mins15_ts.index))
mins15_ts = mins15_ts.astype (np.float64)

# Decompose time series with weekly frequency
decomp_freq = 24*60/15 * 7
mins15_decomposition = seasonal_decompose (mins15_ts.values, freq=decomp_freq)
mins15_trend = mins15_decomposition.trend
mins15_seasonal = mins15_decomposition.seasonal
mins15_residual = mins15_decomposition.resid
C:\Anaconda2\lib\site-packages\statsmodels\tsa\filters\filtertools.py:28: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  return np.r_[[np.nan] * head, x, [np.nan] * tail]
In [21]:
### Plot decomposition

import matplotlib.gridspec as gspec

fig = plt.figure ()
gs =gspec.GridSpec (4, 1)
ax = fig.add_subplot (gs [0])
ax.plot (mins15_ts.index, mins15_ts.values, label="Original")
ax.set_ylabel ("Original")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)

ax = fig.add_subplot (gs [1])
ax.plot (mins15_ts.index, mins15_trend, label="Trend")
ax.set_ylabel ("Trend")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)

ax = fig.add_subplot (gs [2])
ax.plot (mins15_ts.index, mins15_seasonal, label="Seasonality")
ax.set_ylabel ("Seasonality")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)

ax = fig.add_subplot (gs [3])
ax.plot (mins15_ts.index, mins15_residual, label="Residuals")
ax.set_ylabel ("Residuals")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)

labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)

fig.set_size_inches (16, 20)
gs.tight_layout (fig)
plt.show()

Observations based on timeseries decomposition

  • There is a clear increasing trend in the logins over time
  • There is a clear seasonal effect in the logins, with a period of 1 week
  • The residuals appear to be normally distributed.

Testing for stationarity

To verify that the time series is stationary or not, we perform a Dickey-Fuller test.

In [22]:
from statsmodels.tsa.stattools import adfuller

# Get first difference of time series
mins15_ts_diff = mins15_ts - mins15_ts.shift ()
mins15_ts_diff.dropna (inplace=True)

test_result = adfuller (mins15_ts, autolag="AIC")

index= ['Test statistic',
        'p-value',
        'No. lags used',
        'No. of observations used'
        ]
print "Dikey-Fuller Test Results"
for i in range (len (index)):
    print index [i], ":", "%.5f" % test_result [i]
for key, value in test_result[4].items():
        print 'Critical Value (%s)' % key, "%.5f" % value
Dikey-Fuller Test Results
Test statistic : -10.33795
p-value : 0.00000
No. lags used : 38.00000
No. of observations used : 9749.00000
Critical Value (5%) -2.86184
Critical Value (1%) -3.43102
Critical Value (10%) -2.56693

Observations from Dickey-Fuller test

  • The Dickey-Fuller test statistics is smaller than the critical value for 1% which confirms that the time series is stationary (Bisgaard & Kulahci, 2011).
  • Because the series is stationary, we can safely apply an ARIMA model.



Predictive Modeling


  • Let's predict the number of login-events
  • First let's evaluate which model in general achieves the lowest error
  • Then apply that model the subsequent four 15 minute timeperiods.

Analyze the performance of a naive baseline model

  • A simple model is to naively predict a constant value, such as the overall expected value or median.
  • Recall that the overall median of login-events in 15minute periods is 5
  • Let's also explore other nearby values; specifically, the fifteen values from zero to fourteen.
In [23]:
# Let's switch to Numpy(np.) which faciliates the calculation of several hard-coded predictions.
np.set_printoptions(precision=2) # for pretty print

# gather the squared errors over fifteen different "hard-coded" predictions.
predictions15 = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14])
In [24]:
# Calculate the root mean-squared (RMS) error for all fifteen values
# since there is no training we do not need to worry about information leakage

arr = np.empty((0,15), int)
for index, row in mins15_df.iterrows():
    timeperiod_count  = row['count']
    predictions15_gap = predictions15 - timeperiod_count
    predictions15_gap_sq = np.square(predictions15_gap)

    arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)

# print arr.mean(axis=0)
print np.sqrt(arr.mean(axis=0))
[ 12.65  11.91  11.22  10.57   9.99   9.47   9.04   8.7    8.47   8.34
   8.34   8.46   8.69   9.03   9.46]

Observations of constant-value model predictions

  • The mean-squared error for the overall median of 7 is 8.7
  • The smallest root mean squared-error over the entire dataset occurs with a prediction of between 9 and 10 logins: 8.34.

Use the "Mondays" model?

  • Let's try to exploit the known temporal pattern of day-of-week
  • Given that the expected predictions are for a monday, what is the behavior of a "Monday" model?
In [25]:
arr = np.empty((0,15), int)
for index, row in mins15_df[(mins15_df.dayofweek == 0)].iterrows():
    timeperiod_count  = row['count']
    predictions15_gap = predictions15 - timeperiod_count
    predictions15_gap_sq = np.square(predictions15_gap)

    arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)

rmse=np.sqrt(arr.mean(axis=0))
# pd.DataFrame(data=rmse[1:])
print rmse
[ 7.99  7.24  6.56  5.97  5.49  5.17  5.03  5.09  5.34  5.75  6.29  6.94
  7.67  8.45  9.27]

Observations on Monday-model predictions

  • The minimal squared-error over Mondays occurs with a prediction of 6 logins: 5.03

What is it when restricted to 7PM?

  • Let's try to exploit the known temporal pattern of hour-of-day.
  • Given that the expected predictions are for ~7PM, what is the behavior of a "7PM" model?
In [26]:
arr = np.empty((0,15), int)
for index, row in mins15_df[(mins15_df.hourofday == 19)].iterrows():
    timeperiod_count  = row['count']
    predictions15_gap = predictions15 - timeperiod_count
    predictions15_gap_sq = np.square(predictions15_gap)

    arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)

rmse=np.sqrt(arr.mean(axis=0))
# pd.DataFrame(data=rmse[1:])
print rmse
[ 9.5   8.67  7.89  7.16  6.5   5.93  5.49  5.21  5.11  5.21  5.49  5.93
  6.49  7.15  7.88]

Observations on 7PM predictions

  • The minimal squared-error over 7PMs occurs with a prediction of 8 logins: 5.11 (this is higher than the Monday-model)

What is the best prediction when restricted to Monday at 7PM?

  • Let's combine the two known temporal patterns of day-of-week and hour-of-day
In [27]:
mon7pm_series = mins15_df[(mins15_df.hourofday == 19) & (mins15_df.dayofweek == 0)]['count']
mon7pm_df = pd.DataFrame(data=mon7pm_series.reset_index()[['count']])

arr = np.empty((0,15), int)
for index, row in mon7pm_df.iterrows():
    timeperiod_count  = row['count']
    predictions15_gap = predictions15 - timeperiod_count
    predictions15_gap_sq = np.square(predictions15_gap)

    arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)

rmse=np.sqrt(arr.mean(axis=0))
# pd.DataFrame(data=rmse[1:])
print rmse
[ 5.48  4.66  3.91  3.31  2.93  2.87  3.15  3.68  4.38  5.18  6.04  6.94
  7.86  8.8   9.75]

Observations on Monday at 7PM predictions

  • The minimal squared-error over Monday at 7PMs occurs with a prediction of 5 logins: 2.87.
  • This error is significantly smaller than using either pattern alone

Now charaterize the distribution of this model

In [28]:
mon7pm_df.describe()
Out[28]:
count
count 56.000000
mean 4.678571
std 2.880070
min 0.000000
25% 3.000000
50% 4.000000
75% 6.250000
max 14.000000
  • This model's expected value is 4.67 and has a standard deviation of 2.88

Unfortunately the d-of-w and h-of-d time-series models above were not evaluated as a general approach (for every day/hour combination), nor was testing performed without [[information leakage]] (of future day/hour combinations), so we cannot report this models overall performance.

TO DO: test every combination and do so based only on looking at prior data




ARIMA modelling of 15-min logins

One of the best-practice approaches to stationary univariate timeseries predictive modeling is ARIMA.

  • Let's model the first order difference of timeseries.

ARIMA model parameterization

First we need to determine sound parameter settings.

  • An intuitive way to do this is to look at the Autocorrelation and Partial autocorrelation plots.
In [29]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Plot the ACF and PACF
plot_acf (mins15_ts_diff, lags=20)
plot_pacf (mins15_ts_diff, lags=20)
plt.tight_layout ()
plt.show ()

Selection of ARIMA's paramaters

  • The ACF quicly decays after lag 1. This indicates that the moving average parameter of the ARIMA model should be set to 1.
  • The PACF quickly decays after lag 3 which suggests that the autoregreesive parameter of the ARIMA model should be set to 3.

Building the ARIMA (3,0,1) model for the 15mins logins data

In [30]:
from statsmodels.tsa.arima_model import ARIMA

# Create and fit the model 
model = ARIMA (mins15_ts, order=(3, 0, 1))
mins15_arima = model.fit ()
In [31]:
print mins15_arima.summary ()
                              ARMA Model Results

============================================================================== Dep. Variable:                      y   No. Observations:                 9788
Model:                     ARMA(3, 1)   Log Likelihood              -28362.052
Method:                       css-mle   S.D. of innovations              4.387
Date:                Wed, 10 Aug 2016   AIC                          56736.104
Time:                        23:08:50   BIC                          56779.238
Sample:                    01-01-1970   HQIC                         56750.720
                         - 04-13-1970 

============================================================================== coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          9.5027      0.434     21.874      0.000         8.651    10.354
ar.L1.y        0.6917      0.070      9.883      0.000         0.555     0.829
ar.L2.y        0.1481      0.041      3.612      0.000         0.068     0.228
ar.L3.y        0.0737      0.024      3.031      0.002         0.026     0.121
ma.L1.y       -0.1507      0.070     -2.157      0.031        -0.288    -0.014
                                    Roots

============================================================================= Real           Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.0701           -0.0000j            1.0701           -0.0000
AR.2           -1.5401           -3.2114j            3.5616           -0.3212
AR.3           -1.5401           +3.2114j            3.5616            0.3212
MA.1            6.6358           +0.0000j            6.6358            0.0000
-----------------------------------------------------------------------------

Observations of ARIMA's performance

  • The RMSE of the ARIMA (3,0,1) model is approximately 4.4 which is significantly lower than the naive constant-value model's 8.3.
In [32]:
# Plot ARIMA model and RMSE
# The plot is using obs every 96 intervals to simplify the output
mins15_arima_ts = np.rint (pd.Series (mins15_arima.fittedvalues, copy=True))
rmse = np.sqrt (sum ((mins15_arima_ts-mins15_ts)**2)/len(mins15_ts))

fig = plt.figure ()
plt.plot (mins15_ts [::96], 'k')
plt.plot (mins15_arima_ts [::96], 'r--', lw=2)
plt.title ('RMSE: %.4f' % rmse)
plt.legend (["Original", "ARIMA (3,0,1)"])
fig.set_size_inches (15, 8)
plt.show ()
  • With the exception of the very sharp spikes in the number of logins, the ARIMA (3,0,1) appears to be a good fit for the data.

Predict the next 4 data points using the ARIMA (3, 0, 1) model

In [33]:
mins15_pred = mins15_arima.predict (start=9760, end=9800, dynamic=False)
mins15_pred = np.rint (mins15_pred)

fig = plt.figure ()
plt.plot (mins15_ts [9760:9800], 'o')
plt.plot (mins15_pred, 'r--', lw=2)
plt.legend (["Original", "Predicted"])
fig.set_size_inches (16,8)
plt.show ()
C:\Anaconda2\lib\site-packages\statsmodels\base\data.py:503: FutureWarning: TimeSeries is deprecated. Please use Series
  return TimeSeries(result, index=self.predict_dates)
In [34]:
print "Predicted values:"
print mins15_pred[28:32]
Predicted values:
1970-04-13 19:00:00    6.0
1970-04-13 19:15:00    6.0
1970-04-13 19:30:00    7.0
1970-04-13 19:45:00    7.0
Freq: 15T, dtype: float64

ARIMA prediction of next for 15min timeperiods.

  • 1970-04-13 19:00:00 6.0
  • 1970-04-13 19:15:00 6.0
  • 1970-04-13 19:30:00 7.0
  • 1970-04-13 19:45:00 7.0

Forecast probability distributions (confidence)

In the next cells we will take a closer look at the forecasted values and their confidence intervals.

In [35]:
fig = mins15_arima.plot_predict(start=9770, end=9789, dynamic=False, plot_insample=True)
plt.legend (["Forecast", "Original", "95% Confidence Interval"])
fig.set_size_inches (16, 8)
plt.show ()
C:\Anaconda2\lib\site-packages\statsmodels\tsa\arima_model.py:1724: FutureWarning: TimeSeries is deprecated. Please use Series
  forecast = TimeSeries(forecast, index=self.data.predict_dates)
In [36]:
print "Confidence intervals for the predicted parameters:"
print mins15_arima.conf_int ()
Confidence intervals for the predicted parameters:
                0          1
const    8.651236  10.354129
ar.L1.y  0.554510   0.828842
ar.L2.y  0.067722   0.228451
ar.L3.y  0.026034   0.121306
ma.L1.y -0.287658  -0.013739
In [37]:
forecast = mins15_arima.forecast (1, alpha=.1)
print "Predicted value for the next 15 mins: ", np.rint (forecast [0] [0])
print "90% confidence intervals for the prediction: ", np.rint (forecast [2] [0] [0]), np.rint (forecast [2] [0] [1])

forecast = mins15_arima.forecast (1, alpha=.05)
print "95% confidence intervals for the prediction: ", np.rint (forecast [2] [0] [0]), np.rint (forecast [2] [0] [1])
Predicted value for the next 15 mins:  6.0
90% confidence intervals for the prediction:  -1.0 13.0
95% confidence intervals for the prediction:  -2.0 15.0

References:

  • Søren Bisgaard, and Murat Kulahci. ([[2011]]). “Time Series Analysis and Forecasting by Example." Wiley


Appendix


Let's look at moving average

In [38]:
days1_trim_df['movavg7d'] = pd.rolling_mean(days1_trim_df['count'], 7)
days1_trim_df.tail(10)
C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=7,center=False).mean()
  if __name__ == '__main__':
C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
Out[38]:
login_period count movavg7d
92 1970-04-03 15.666667 11.986607
93 1970-04-04 19.677083 12.583333
94 1970-04-05 12.104167 12.046131
95 1970-04-06 6.406250 11.913690
96 1970-04-07 6.145833 11.645833
97 1970-04-08 7.270833 11.287202
98 1970-04-09 8.520833 10.827381
99 1970-04-10 10.510417 10.090774
100 1970-04-11 14.083333 9.291667
101 1970-04-12 14.677083 9.659226
In [39]:
fig = plt.figure()
ax = fig.add_subplot(111)

ax.plot(days1_trim_df['login_period'], days1_trim_df['count'], linestyle='-', linewidth=1)
ax.plot(days1_trim_df['login_period'], days1_trim_df['movavg7d'], linestyle='--', linewidth=3)

ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
ax.grid(True, which = "both", linestyle=":")

ax.set_ylabel('Count')
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, fontsize=12)

fig.set_size_inches(16, 4)

plt.show()

The End