A Temporal Data Analysis System is a sequential data analysis system that applies a temporal data analysis algorithm to solve a temporal data analysis task.

Context:
- It can range from being a Temporal Outlier Detection System, to being a Temporal Predictive Modeling System, to being ...
- It can range from being a Python-based Temporal Data Analysis System, an R-based Temporal Data Analysis System, ...
Example(s):
- a Time-to-Event Prediction System (solve a survival analysis task).
- Statsmodels Timeseries Python Module (statsmodels).
- HTS R Package.
- …
Counter-Example(s):
- a Natural Language Analysis System.
- a DNA Analysis System.
See: Sequential Data Analysis.

References

2016

Gmelli_Supervised_Duration_Estimator.160913

A Supervised Duration Estimator¶

This program demonstrates a supervised duration estimation system (as defined in http://GM-RKB.com/supervised_demand_estimator)

Gabor Melli, Sept 13, 2016

Description¶

TBD

Source code¶

The following session uses Python 2.7 and iPython/Jupyter 4.2

Read in the data in a Pandas DataFrame¶

In [1]:

import pandas as pd
import numpy as np # for random sampling and RMSE analysis

# read the data in a series
data_series = pd.to_datetime( pd.read_json('logins.json')['login_time'])
#print data_series[:4], "\n\n", data_series.describe() # describe the series

# cast the data into a dataframe
data_df = pd.DataFrame({'id':data_series.index, 'login_time':data_series.values})

# report a random sample
print data_df.reindex(np.random.permutation(data_df.index))[0:10].sort_index()

          id          login_time
36661  36661 1970-02-20 00:52:27
40499  40499 1970-02-22 21:35:11
50513  50513 1970-03-06 02:12:32
66557  66557 1970-03-19 22:36:50
68080  68080 1970-03-21 01:41:45
69344  69344 1970-03-21 23:01:33
70263  70263 1970-03-22 11:31:24
72386  72386 1970-03-25 00:31:57
72912  72912 1970-03-25 20:45:42
88183  88183 1970-04-09 00:34:50

In [2]:

# summarize/describe the data
data_df['login_time'].describe()

Out[2]:

count                   93142
unique                  92265
top       1970-02-12 11:16:53
freq                        3
first     1970-01-01 20:12:16
last      1970-04-13 18:57:38
Name: login_time, dtype: object

Observations on the raw timeseries¶

The events in the timeseries start on 1970-01-01 20:12:16 and end on 1970-04-13 18:57:38
There are 93,142 events (92,265 of them at unique times)

Let's analyze the 15 minute resolution behavior¶

Let's discretize the data at a 15 minute resolution (as required) to creat a univariate timeseries.

In [3]:

mins15_df = data_df.set_index('login_time').groupby(pd.TimeGrouper(freq='15Min')).count().fillna(0)
mins15_df = mins15_df.reset_index(0) # make the index an integer (instead of the time period)
mins15_df.columns = ['login_period','count']

# Let's also add some additional termporal attributes
mins15_df['dayofweek'] = mins15_df['login_period'].dt.dayofweek
mins15_df['weekofyear'] = mins15_df['login_period'].dt.weekofyear
mins15_df['hourofday'] = mins15_df['login_period'].dt.hour
mins15_df

# report a random sample
# print mins15_df.reindex(np.random.permutation(mins15_df.index))[0:10].sort(); print "\n\n"

# summarize at this level
mins15_df.describe()

Out[3]:

	count	dayofweek	weekofyear	hourofday
count	9788.000000	9788.000000	9788.000000	9788.000000
mean	9.515938	3.035554	8.325296	11.496935
std	8.328818	2.012722	4.215948	6.922294
min	0.000000	0.000000	1.000000	0.000000
25%	3.000000	1.000000	5.000000	5.000000
50%	7.000000	3.000000	8.000000	11.000000
75%	13.000000	5.000000	12.000000	17.000000
max	73.000000	6.000000	16.000000	23.000000

Observations on the 15-minute timeperiods:¶

There are 9,788 of 15-minute timeperiods between the first and last event.
For all timeperids the minimum and maximum number of events range from 0 to 73.
On average there are 9.516 events per period, while the median is 7 logins
Every day-of-the-week appears to have login events.
Every week day from week 1 to week 16 appears to have login events.
Every hour-of-the-day appears to have login events

Peek at the most recent timeperiods¶

In [4]:

mins15_df.tail(10)

Out[4]:

	login_period	count	weekofyear	hourofday
9778	1970-04-13 16:30:00	4	16	16
9779	1970-04-13 16:45:00	3	16	16
9780	1970-04-13 17:00:00	5	16	17
9781	1970-04-13 17:15:00	3	16	17
9782	1970-04-13 17:30:00	9	16	17
9783	1970-04-13 17:45:00	5	16	17
9784	1970-04-13 18:00:00	5	16	18
9785	1970-04-13 18:15:00	2	16	18
9786	1970-04-13 18:30:00	7	16	18
9787	1970-04-13 18:45:00	6	16	18

Observations on the events in the most recent 15-minute timeperiod(s)¶

The login-events occurred on a Monday (day-of-week/dow=0)
The login-events occurred in the early evening (just before 19:00)
The very last 15-minute period appears to be complete
The median number of events in that range is 5 logins.
For predictions then the next few timeperiods will likely be in the range of 4 to 6 logins.

Prepare for visualizations¶

In [5]:

import matplotlib as mpl
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.dates import MO, TU, WE, TH, FR, SA, SU

days = mdates.DayLocator()
dtfmt = mdates.DateFormatter('%a %b %d')
saturdays = mdates.WeekdayLocator(byweekday=SA)

Analyze the timeperiod event count history¶

In [6]:

fig, ax = plt.subplots()
mins15_df['count'].value_counts().plot(ax=ax, kind='bar')
fig.set_size_inches(16, 4)
plt.grid(True)
plt.show()

Observations on the timeperiod count history¶

Most 15minute timeperiods contain few events: the mode is 2.
For predictions then, the naive "majority" prediction is 2.
There are many timeperiods with zero login-events.

Analyze the timeperiod event count distribution¶

In [7]:

fig, ax = plt.subplots()
plt.hist(mins15_df['count'], bins=73, normed=True, cumulative=False)
fig.set_size_inches(16, 4)
plt.grid(True)
plt.show()

Observations on the timeperiod count distribution¶

The distribution of 15minute-period login-event counts appears to follow a Poisson distribution

Analyze the cummulative timeperiod count distribution¶

In [8]:

fig, ax = plt.subplots()
plt.hist(mins15_df['count'], bins=73, normed=True, cumulative=True)
fig.set_size_inches(16, 4)
plt.grid(True)
plt.show()

Observations on cummulative timeperiods counts¶

Over half of the timeperiods have 7 or fewer login-events

Now let's look at the daily/day-of-year behavior¶

In [9]:

periodminutes  = 15
inaday15mins   = (60/periodminutes)*24

days1_df = data_df.set_index('login_time').groupby(pd.TimeGrouper(freq='1D')).count().fillna(0)/inaday15mins
days1_df = days1_df.reset_index(0)
days1_df.columns = ['login_period','count']
days1_trim_df = days1_df[1:-1]

In [10]:

fig = plt.figure()
ax = fig.add_subplot(111)

ax.plot(days1_df['login_period'], days1_df['count'], linestyle='-', linewidth=3)

ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
ax.grid(True, which = "both", linestyle=":")
ax.set_ylabel('Count')
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, fontsize=12)
fig.set_size_inches(16, 4)
plt.show()

Observations on daily/day-of-year behavior¶

There is weekly periodicity and it appears to peak on the weekends.
There appears to be a slightly increasing trend.
The week leading up to the weekend of Sat Mar 21 was unusually busy (and a relatively quieter weekend).
The starting day and the ending day are not fully represented.

Let's visualize the "trimmed" daily timeseries¶

In [11]:

mins15_trim_df = mins15_df[1:-1]

In [12]:

fig = plt.figure()
ax = fig.add_subplot(111)

ax.plot(days1_trim_df['login_period'], days1_trim_df['count'], linestyle='-', linewidth=5)
ax.plot(days1_df['login_period'], days1_df['count'], linestyle='--', linewidth=2)

ax.xaxis.set_major_locator(saturdays) ; #ax.xaxis.set_major_locator(tuesdays)
ax.xaxis.set_minor_locator(days)
ax.xaxis.set_major_formatter(dtfmt)
ax.grid(True, which = "both", linestyle=":")
ax.set_ylabel('15min Login Count')
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, fontsize=12)
fig.set_size_inches(16, 4)
plt.show()

Observations¶

The trimming seems effective.
...

Next, let's analyze the day-of-week behavior¶

In [13]:

dayofweek_df = mins15_df.groupby(['dayofweek']).describe().unstack()
dayofweek_df = dayofweek_df[['count']].reset_index(0)
dayofweek_df_ = dayofweek_df['count'][:] # extract only the count
print dayofweek_df_['count']

0    1420.0
1    1344.0
2    1344.0
3    1360.0
4    1440.0
5    1440.0
6    1440.0
Name: count, dtype: float64

Observations on day-of-week data¶

There are slightly fewer Tue(dow=1), Wed(dow=2), and Thursdays(dow=3) in the timeseries

In [14]:

plt.bar(dayofweek_df_.index.values, 
        dayofweek_df_['mean'].values, 
        yerr=dayofweek_df_['std'].values,
        color='lightgray')

plt.ylabel('15min Logins')
plt.xlabel('Day-of-Week')
plt.title('Day-of-Week 15min-Logins Distribution')
axes = plt.gca()
fig = plt.gcf()
fig.set_size_inches(14, 8)
plt.show()

Observations on day-of-week behavior¶

Saturday(dow=5) and Sunday(dow=6) are typically the days with most 15min login-events (~13 / 15min).
Monday(dow=0) and Tuesday(dow=1) the ones with the fewest (~6.5 / 15min)
There is significant variance in 15min behavior for any given day of the week.
Therefore, day-of-week appears to be a predictive factor.

Now the week-level (week-of-year) analysis¶

In [15]:

weekofyear_df = mins15_df.groupby(['weekofyear']).describe().unstack()
weekofyear_df = weekofyear_df[['count']].reset_index(0)
weekofyear_df['count'][['count','mean']]

Out[15]:

	count	mean
0	304.0	7.809211
1	672.0	7.763393
2	672.0	7.474702
3	672.0	7.069940
4	672.0	7.059524
5	672.0	8.291667
6	672.0	8.802083
7	672.0	10.468750
8	672.0	9.752976
9	672.0	11.008929
10	672.0	10.919643
11	672.0	13.325893
12	672.0	10.840774
13	672.0	12.046131
14	672.0	9.659226
15	76.0	5.197368

Observations on week-of-year¶

The first and especially the most recent "week" have an incomplete number of 15min timeperiods. Therefore be careful when/if making predictions based on week-level information.
The mean now more strongly suggests that login-events are increasing.

In [16]:

weekofyear_trim_df = weekofyear_df['count'][1:-1]

In [17]:

plt.bar(weekofyear_trim_df.index.values, 
        weekofyear_trim_df['mean'].values, 
        yerr=weekofyear_trim_df['std'].values,
        color='lightgray')

plt.ylabel('15min Logins')
plt.xlabel('Week-of-Year')
plt.title('Week-of-Year 15min-Logins Distribution')
axes = plt.gca()
axes.set_ylim([0,weekofyear_trim_df['mean'].max()+weekofyear_trim_df['std'].max()]) # force y to start at zero
fig = plt.gcf()
fig.set_size_inches(14, 8)

plt.show()

Observations¶

The count by week starts at ~7.5 and increases to ~11
The most recent week trended down
As expected, there is significant variance in 15min behavior over an entire week.

Let's look at hour-of-day¶

In [18]:

hourofday_df = mins15_df.groupby(['hourofday']).describe().unstack()
hourofday_df = hourofday_df[['count']].reset_index(0)
hourofday_df['count'][['count','mean']]

Out[18]:

	count	mean
0	408.0	14.688725
1	408.0	15.482843
2	408.0	14.215686
3	408.0	11.840686
4	408.0	12.338235
5	408.0	7.218137
6	408.0	2.789216
7	408.0	1.997549
8	408.0	2.004902
9	408.0	3.742647
10	408.0	7.509804
11	408.0	14.213235
12	408.0	12.166667
13	408.0	8.850490
14	408.0	8.397059
15	408.0	7.446078
16	408.0	6.941176
17	408.0	6.333333
18	408.0	7.303922
19	404.0	8.007426
20	408.0	10.056373
21	408.0	13.781863
22	408.0	16.193627
23	408.0	14.848039

Observations on hour-of-day data¶

Every hour-of-day has an equivalent number of 15min timeslots. Therefore it may be safe to use hour-level information.

In [19]:

plt.bar(hourofday_df.index,
        hourofday_df['count']['mean'],
        yerr=hourofday_df['count']['std'],
        color='lightgrey')

plt.ylabel('15min Logins')
plt.xlabel('Hour-of-Day')
plt.title('Hour-of-Day 15min Logins Distribution')
axes = plt.gca()
axes.set_ylim([0,30])
fig = plt.gcf()
fig.set_size_inches(14, 8)
plt.show()

Observations on hour-of-day behavior¶

There is a pattern to login-events by hour-of-day.
There is significant variance in 15min behavior for any given hour.
Therefore, time-of-day appears to be a predictive factor.

Decomposition of the 15min logins time series¶

In [20]:

from statsmodels.tsa.seasonal import seasonal_decompose

# Extract login counts indexed by date
mins15_ts = pd.Series (data=mins15_df ['count'].tolist (), 
                        index=mins15_df ['login_period'])
mins15_ts = mins15_ts.asfreq (pd.infer_freq (mins15_ts.index))
mins15_ts = mins15_ts.astype (np.float64)

# Decompose time series with weekly frequency
decomp_freq = 24*60/15 * 7
mins15_decomposition = seasonal_decompose (mins15_ts.values, freq=decomp_freq)
mins15_trend = mins15_decomposition.trend
mins15_seasonal = mins15_decomposition.seasonal
mins15_residual = mins15_decomposition.resid

C:\Anaconda2\lib\site-packages\statsmodels\tsa\filters\filtertools.py:28: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  return np.r_[[np.nan] * head, x, [np.nan] * tail]

In [21]:

### Plot decomposition

import matplotlib.gridspec as gspec

fig = plt.figure ()
gs =gspec.GridSpec (4, 1)
ax = fig.add_subplot (gs [0])
ax.plot (mins15_ts.index, mins15_ts.values, label="Original")
ax.set_ylabel ("Original")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)

ax = fig.add_subplot (gs [1])
ax.plot (mins15_ts.index, mins15_trend, label="Trend")
ax.set_ylabel ("Trend")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)

ax = fig.add_subplot (gs [2])
ax.plot (mins15_ts.index, mins15_seasonal, label="Seasonality")
ax.set_ylabel ("Seasonality")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)

ax = fig.add_subplot (gs [3])
ax.plot (mins15_ts.index, mins15_residual, label="Residuals")
ax.set_ylabel ("Residuals")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)

labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)

fig.set_size_inches (16, 20)
gs.tight_layout (fig)
plt.show()

Observations based on timeseries decomposition¶

There is a clear increasing trend in the logins over time
There is a clear seasonal effect in the logins, with a period of 1 week
The residuals appear to be normally distributed.

Testing for stationarity¶

To verify that the time series is stationary or not, we perform a Dickey-Fuller test.

In [22]:

from statsmodels.tsa.stattools import adfuller

# Get first difference of time series
mins15_ts_diff = mins15_ts - mins15_ts.shift ()
mins15_ts_diff.dropna (inplace=True)

test_result = adfuller (mins15_ts, autolag="AIC")

index= ['Test statistic',
        'p-value',
        'No. lags used',
        'No. of observations used'
        ]
print "Dikey-Fuller Test Results"
for i in range (len (index)):
    print index [i], ":", "%.5f" % test_result [i]
for key, value in test_result[4].items():
        print 'Critical Value (%s)' % key, "%.5f" % value

Dikey-Fuller Test Results
Test statistic : -10.33795
p-value : 0.00000
No. lags used : 38.00000
No. of observations used : 9749.00000
Critical Value (5%) -2.86184
Critical Value (1%) -3.43102
Critical Value (10%) -2.56693

Observations from Dickey-Fuller test¶

The Dickey-Fuller test statistics is smaller than the critical value for 1% which confirms that the time series is stationary (Bisgaard & Kulahci, 2011).
Because the series is stationary, we can safely apply an ARIMA model.

Predictive Modeling¶

Let's predict the number of login-events
First let's evaluate which model in general achieves the lowest error
Then apply that model the subsequent four 15 minute timeperiods.

Analyze the performance of a naive baseline model¶

A simple model is to naively predict a constant value, such as the overall expected value or median.
Recall that the overall median of login-events in 15minute periods is 5
Let's also explore other nearby values; specifically, the fifteen values from zero to fourteen.

In [23]:

# Let's switch to Numpy(np.) which faciliates the calculation of several hard-coded predictions.
np.set_printoptions(precision=2) # for pretty print

# gather the squared errors over fifteen different "hard-coded" predictions.
predictions15 = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14])

In [24]:

# Calculate the root mean-squared (RMS) error for all fifteen values
# since there is no training we do not need to worry about information leakage

arr = np.empty((0,15), int)
for index, row in mins15_df.iterrows():
    timeperiod_count  = row['count']
    predictions15_gap = predictions15 - timeperiod_count
    predictions15_gap_sq = np.square(predictions15_gap)

    arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)

# print arr.mean(axis=0)
print np.sqrt(arr.mean(axis=0))

[ 12.65  11.91  11.22  10.57   9.99   9.47   9.04   8.7    8.47   8.34
   8.34   8.46   8.69   9.03   9.46]

Observations of constant-value model predictions¶

The mean-squared error for the overall median of 7 is 8.7
The smallest root mean squared-error over the entire dataset occurs with a prediction of between 9 and 10 logins: 8.34.

Use the "Mondays" model?¶

Let's try to exploit the known temporal pattern of day-of-week
Given that the expected predictions are for a monday, what is the behavior of a "Monday" model?

In [25]:

arr = np.empty((0,15), int)
for index, row in mins15_df[(mins15_df.dayofweek == 0)].iterrows():
    timeperiod_count  = row['count']
    predictions15_gap = predictions15 - timeperiod_count
    predictions15_gap_sq = np.square(predictions15_gap)

    arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)

rmse=np.sqrt(arr.mean(axis=0))
# pd.DataFrame(data=rmse[1:])
print rmse

[ 7.99  7.24  6.56  5.97  5.49  5.17  5.03  5.09  5.34  5.75  6.29  6.94
  7.67  8.45  9.27]

Observations on Monday-model predictions¶

The minimal squared-error over Mondays occurs with a prediction of 6 logins: 5.03

What is it when restricted to 7PM?¶

Let's try to exploit the known temporal pattern of hour-of-day.
Given that the expected predictions are for ~7PM, what is the behavior of a "7PM" model?

In [26]:

arr = np.empty((0,15), int)
for index, row in mins15_df[(mins15_df.hourofday == 19)].iterrows():
    timeperiod_count  = row['count']
    predictions15_gap = predictions15 - timeperiod_count
    predictions15_gap_sq = np.square(predictions15_gap)

    arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)

rmse=np.sqrt(arr.mean(axis=0))
# pd.DataFrame(data=rmse[1:])
print rmse

[ 9.5   8.67  7.89  7.16  6.5   5.93  5.49  5.21  5.11  5.21  5.49  5.93
  6.49  7.15  7.88]

Observations on 7PM predictions¶

The minimal squared-error over 7PMs occurs with a prediction of 8 logins: 5.11 (this is higher than the Monday-model)

What is the best prediction when restricted to Monday at 7PM?¶

Let's combine the two known temporal patterns of day-of-week and hour-of-day

In [27]:

mon7pm_series = mins15_df[(mins15_df.hourofday == 19) & (mins15_df.dayofweek == 0)]['count']
mon7pm_df = pd.DataFrame(data=mon7pm_series.reset_index()[['count']])

arr = np.empty((0,15), int)
for index, row in mon7pm_df.iterrows():
    timeperiod_count  = row['count']
    predictions15_gap = predictions15 - timeperiod_count
    predictions15_gap_sq = np.square(predictions15_gap)

    arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)

rmse=np.sqrt(arr.mean(axis=0))
# pd.DataFrame(data=rmse[1:])
print rmse

[ 5.48  4.66  3.91  3.31  2.93  2.87  3.15  3.68  4.38  5.18  6.04  6.94
  7.86  8.8   9.75]

Observations on Monday at 7PM predictions¶

The minimal squared-error over Monday at 7PMs occurs with a prediction of 5 logins: 2.87.
This error is significantly smaller than using either pattern alone

Now charaterize the distribution of this model¶

In [28]:

mon7pm_df.describe()

Out[28]:

	count
count	56.000000
mean	4.678571
std	2.880070
min	0.000000
25%	3.000000
50%	4.000000
75%	6.250000
max	14.000000

This model's expected value is 4.67 and has a standard deviation of 2.88

Unfortunately the d-of-w and h-of-d time-series models above were not evaluated as a general approach (for every day/hour combination), nor was testing performed without [[information leakage]] (of future day/hour combinations), so we cannot report this models overall performance.

TO DO: test every combination and do so based only on looking at prior data

ARIMA modelling of 15-min logins¶

One of the best-practice approaches to stationary univariate timeseries predictive modeling is ARIMA.

Let's model the first order difference of timeseries.

ARIMA model parameterization¶

First we need to determine sound parameter settings.

An intuitive way to do this is to look at the Autocorrelation and Partial autocorrelation plots.

In [29]:

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Plot the ACF and PACF
plot_acf (mins15_ts_diff, lags=20)
plot_pacf (mins15_ts_diff, lags=20)
plt.tight_layout ()
plt.show ()

Selection of ARIMA's paramaters¶

The ACF quicly decays after lag 1. This indicates that the moving average parameter of the ARIMA model should be set to 1.
The PACF quickly decays after lag 3 which suggests that the autoregreesive parameter of the ARIMA model should be set to 3.

Building the ARIMA (3,0,1) model for the 15mins logins data¶

In [30]:

from statsmodels.tsa.arima_model import ARIMA

# Create and fit the model 
model = ARIMA (mins15_ts, order=(3, 0, 1))
mins15_arima = model.fit ()

In [31]:

print mins15_arima.summary ()

                              ARMA Model Results

============================================================================== Dep. Variable:                      y   No. Observations:                 9788
Model:                     ARMA(3, 1)   Log Likelihood              -28362.052
Method:                       css-mle   S.D. of innovations              4.387
Date:                Wed, 10 Aug 2016   AIC                          56736.104
Time:                        23:08:50   BIC                          56779.238
Sample:                    01-01-1970   HQIC                         56750.720
                         - 04-13-1970 

============================================================================== coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          9.5027      0.434     21.874      0.000         8.651    10.354
ar.L1.y        0.6917      0.070      9.883      0.000         0.555     0.829
ar.L2.y        0.1481      0.041      3.612      0.000         0.068     0.228
ar.L3.y        0.0737      0.024      3.031      0.002         0.026     0.121
ma.L1.y       -0.1507      0.070     -2.157      0.031        -0.288    -0.014
                                    Roots

============================================================================= Real           Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.0701           -0.0000j            1.0701           -0.0000
AR.2           -1.5401           -3.2114j            3.5616           -0.3212
AR.3           -1.5401           +3.2114j            3.5616            0.3212
MA.1            6.6358           +0.0000j            6.6358            0.0000
-----------------------------------------------------------------------------

Observations of ARIMA's performance¶

The RMSE of the ARIMA (3,0,1) model is approximately 4.4 which is significantly lower than the naive constant-value model's 8.3.

In [32]:

# Plot ARIMA model and RMSE
# The plot is using obs every 96 intervals to simplify the output
mins15_arima_ts = np.rint (pd.Series (mins15_arima.fittedvalues, copy=True))
rmse = np.sqrt (sum ((mins15_arima_ts-mins15_ts)**2)/len(mins15_ts))

fig = plt.figure ()
plt.plot (mins15_ts [::96], 'k')
plt.plot (mins15_arima_ts [::96], 'r--', lw=2)
plt.title ('RMSE: %.4f' % rmse)
plt.legend (["Original", "ARIMA (3,0,1)"])
fig.set_size_inches (15, 8)
plt.show ()

Observations of ARIMA's visual trends¶

With the exception of the very sharp spikes in the number of logins, the ARIMA (3,0,1) appears to be a good fit for the data.

Predict the next 4 data points using the ARIMA (3, 0, 1) model¶

In [33]:

mins15_pred = mins15_arima.predict (start=9760, end=9800, dynamic=False)
mins15_pred = np.rint (mins15_pred)

fig = plt.figure ()
plt.plot (mins15_ts [9760:9800], 'o')
plt.plot (mins15_pred, 'r--', lw=2)
plt.legend (["Original", "Predicted"])
fig.set_size_inches (16,8)
plt.show ()

C:\Anaconda2\lib\site-packages\statsmodels\base\data.py:503: FutureWarning: TimeSeries is deprecated. Please use Series
  return TimeSeries(result, index=self.predict_dates)

In [34]:

print "Predicted values:"
print mins15_pred[28:32]

Predicted values:
1970-04-13 19:00:00    6.0
1970-04-13 19:15:00    6.0
1970-04-13 19:30:00    7.0
1970-04-13 19:45:00    7.0
Freq: 15T, dtype: float64

ARIMA prediction of next for 15min timeperiods.¶

1970-04-13 19:00:00 6.0
1970-04-13 19:15:00 6.0
1970-04-13 19:30:00 7.0
1970-04-13 19:45:00 7.0

Forecast probability distributions (confidence)¶

In the next cells we will take a closer look at the forecasted values and their confidence intervals.

In [35]:

fig = mins15_arima.plot_predict(start=9770, end=9789, dynamic=False, plot_insample=True)
plt.legend (["Forecast", "Original", "95% Confidence Interval"])
fig.set_size_inches (16, 8)
plt.show ()

C:\Anaconda2\lib\site-packages\statsmodels\tsa\arima_model.py:1724: FutureWarning: TimeSeries is deprecated. Please use Series
  forecast = TimeSeries(forecast, index=self.data.predict_dates)

In [36]:

print "Confidence intervals for the predicted parameters:"
print mins15_arima.conf_int ()

Confidence intervals for the predicted parameters:
                0          1
const    8.651236  10.354129
ar.L1.y  0.554510   0.828842
ar.L2.y  0.067722   0.228451
ar.L3.y  0.026034   0.121306
ma.L1.y -0.287658  -0.013739

In [37]:

forecast = mins15_arima.forecast (1, alpha=.1)
print "Predicted value for the next 15 mins: ", np.rint (forecast [0] [0])
print "90% confidence intervals for the prediction: ", np.rint (forecast [2] [0] [0]), np.rint (forecast [2] [0] [1])

forecast = mins15_arima.forecast (1, alpha=.05)
print "95% confidence intervals for the prediction: ", np.rint (forecast [2] [0] [0]), np.rint (forecast [2] [0] [1])

Predicted value for the next 15 mins:  6.0
90% confidence intervals for the prediction:  -1.0 13.0
95% confidence intervals for the prediction:  -2.0 15.0

References:¶

SÃ¸ren Bisgaard, and Murat Kulahci. ([[2011]]). “Time Series Analysis and Forecasting by Example." Wiley

Appendix¶

Let's look at moving average¶

In [38]:

days1_trim_df['movavg7d'] = pd.rolling_mean(days1_trim_df['count'], 7)
days1_trim_df.tail(10)

C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=7,center=False).mean()
  if __name__ == '__main__':
C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

Out[38]:

	login_period	count	movavg7d
92	1970-04-03	15.666667	11.986607
93	1970-04-04	19.677083	12.583333
94	1970-04-05	12.104167	12.046131
95	1970-04-06	6.406250	11.913690
96	1970-04-07	6.145833	11.645833
97	1970-04-08	7.270833	11.287202
98	1970-04-09	8.520833	10.827381
99	1970-04-10	10.510417	10.090774
100	1970-04-11	14.083333	9.291667
101	1970-04-12	14.677083	9.659226

In [39]:

fig = plt.figure()
ax = fig.add_subplot(111)

ax.plot(days1_trim_df['login_period'], days1_trim_df['count'], linestyle='-', linewidth=1)
ax.plot(days1_trim_df['login_period'], days1_trim_df['movavg7d'], linestyle='--', linewidth=3)

ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
ax.grid(True, which = "both", linestyle=":")

ax.set_ylabel('Count')
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, fontsize=12)

fig.set_size_inches(16, 4)

plt.show()

The End¶

Temporal Data Analysis System

References

2016

A Supervised Duration Estimator¶

Description¶

Source code¶

Read in the data in a Pandas DataFrame¶

Observations on the raw timeseries¶

Let's analyze the 15 minute resolution behavior¶

Observations on the 15-minute timeperiods:¶

Peek at the most recent timeperiods¶

Observations on the events in the most recent 15-minute timeperiod(s)¶

Prepare for visualizations¶

Analyze the timeperiod event count history¶

Observations on the timeperiod count history¶

Analyze the timeperiod event count distribution¶

Observations on the timeperiod count distribution¶

Analyze the cummulative timeperiod count distribution¶

Observations on cummulative timeperiods counts¶

Now let's look at the daily/day-of-year behavior¶

Observations on daily/day-of-year behavior¶

Let's visualize the "trimmed" daily timeseries¶

Observations¶

Next, let's analyze the day-of-week behavior¶

Observations on day-of-week data¶

Observations on day-of-week behavior¶

Now the week-level (week-of-year) analysis¶

Observations on week-of-year¶

Observations¶

Let's look at hour-of-day¶

Observations on hour-of-day data¶

Observations on hour-of-day behavior¶

Decomposition of the 15min logins time series¶

Observations based on timeseries decomposition¶

Testing for stationarity¶

Observations from Dickey-Fuller test¶

Predictive Modeling¶

Analyze the performance of a naive baseline model¶

Observations of constant-value model predictions¶

Use the "Mondays" model?¶

Observations on Monday-model predictions¶

What is it when restricted to 7PM?¶

Observations on 7PM predictions¶

What is the best prediction when restricted to Monday at 7PM?¶

Observations on Monday at 7PM predictions¶

Now charaterize the distribution of this model¶

ARIMA modelling of 15-min logins¶

ARIMA model parameterization¶

Selection of ARIMA's paramaters¶

Building the ARIMA (3,0,1) model for the 15mins logins data¶

Observations of ARIMA's performance¶

Observations of ARIMA's visual trends¶

Predict the next 4 data points using the ARIMA (3, 0, 1) model¶

ARIMA prediction of next for 15min timeperiods.¶

Forecast probability distributions (confidence)¶

References:¶

Appendix¶

Let's look at moving average¶

The End¶

Navigation menu

Search