Temporal Data Analysis System
A Temporal Data Analysis System is a sequential data analysis system that applies a temporal data analysis algorithm to solve a temporal data analysis task.
- Context:
- It can range from being a Temporal Outlier Detection System, to being a Temporal Predictive Modeling System, to being ...
- It can range from being a Python-based Temporal Data Analysis System, an R-based Temporal Data Analysis System, ...
- Example(s):
- Counter-Example(s):
- See: Sequential Data Analysis.
References
2016
A Supervised Duration Estimator¶
This program demonstrates a supervised duration estimation system (as defined in http://GM-RKB.com/supervised_demand_estimator)
Gabor Melli, Sept 13, 2016
Description¶
TBD
import pandas as pd
import numpy as np # for random sampling and RMSE analysis
# read the data in a series
data_series = pd.to_datetime( pd.read_json('logins.json')['login_time'])
#print data_series[:4], "\n\n", data_series.describe() # describe the series
# cast the data into a dataframe
data_df = pd.DataFrame({'id':data_series.index, 'login_time':data_series.values})
# report a random sample
print data_df.reindex(np.random.permutation(data_df.index))[0:10].sort_index()
# summarize/describe the data
data_df['login_time'].describe()
Observations on the raw timeseries¶
- The events in the timeseries start on 1970-01-01 20:12:16 and end on 1970-04-13 18:57:38
- There are 93,142 events (92,265 of them at unique times)
Let's analyze the 15 minute resolution behavior¶
- Let's discretize the data at a 15 minute resolution (as required) to creat a univariate timeseries.
mins15_df = data_df.set_index('login_time').groupby(pd.TimeGrouper(freq='15Min')).count().fillna(0)
mins15_df = mins15_df.reset_index(0) # make the index an integer (instead of the time period)
mins15_df.columns = ['login_period','count']
# Let's also add some additional termporal attributes
mins15_df['dayofweek'] = mins15_df['login_period'].dt.dayofweek
mins15_df['weekofyear'] = mins15_df['login_period'].dt.weekofyear
mins15_df['hourofday'] = mins15_df['login_period'].dt.hour
mins15_df
# report a random sample
# print mins15_df.reindex(np.random.permutation(mins15_df.index))[0:10].sort(); print "\n\n"
# summarize at this level
mins15_df.describe()
Observations on the 15-minute timeperiods:¶
- There are 9,788 of 15-minute timeperiods between the first and last event.
- For all timeperids the minimum and maximum number of events range from 0 to 73.
- On average there are 9.516 events per period, while the median is 7 logins
- Every day-of-the-week appears to have login events.
- Every week day from week 1 to week 16 appears to have login events.
- Every hour-of-the-day appears to have login events
Peek at the most recent timeperiods¶
mins15_df.tail(10)
Observations on the events in the most recent 15-minute timeperiod(s)¶
- The login-events occurred on a Monday (day-of-week/dow=0)
- The login-events occurred in the early evening (just before 19:00)
- The very last 15-minute period appears to be complete
- The median number of events in that range is 5 logins.
- For predictions then the next few timeperiods will likely be in the range of 4 to 6 logins.
Prepare for visualizations¶
import matplotlib as mpl
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.dates import MO, TU, WE, TH, FR, SA, SU
days = mdates.DayLocator()
dtfmt = mdates.DateFormatter('%a %b %d')
saturdays = mdates.WeekdayLocator(byweekday=SA)
Analyze the timeperiod event count history¶
fig, ax = plt.subplots()
mins15_df['count'].value_counts().plot(ax=ax, kind='bar')
fig.set_size_inches(16, 4)
plt.grid(True)
plt.show()
fig, ax = plt.subplots()
plt.hist(mins15_df['count'], bins=73, normed=True, cumulative=False)
fig.set_size_inches(16, 4)
plt.grid(True)
plt.show()
fig, ax = plt.subplots()
plt.hist(mins15_df['count'], bins=73, normed=True, cumulative=True)
fig.set_size_inches(16, 4)
plt.grid(True)
plt.show()
periodminutes = 15
inaday15mins = (60/periodminutes)*24
days1_df = data_df.set_index('login_time').groupby(pd.TimeGrouper(freq='1D')).count().fillna(0)/inaday15mins
days1_df = days1_df.reset_index(0)
days1_df.columns = ['login_period','count']
days1_trim_df = days1_df[1:-1]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(days1_df['login_period'], days1_df['count'], linestyle='-', linewidth=3)
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
ax.grid(True, which = "both", linestyle=":")
ax.set_ylabel('Count')
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, fontsize=12)
fig.set_size_inches(16, 4)
plt.show()
Observations on daily/day-of-year behavior¶
- There is weekly periodicity and it appears to peak on the weekends.
- There appears to be a slightly increasing trend.
- The week leading up to the weekend of Sat Mar 21 was unusually busy (and a relatively quieter weekend).
- The starting day and the ending day are not fully represented.
Let's visualize the "trimmed" daily timeseries¶
mins15_trim_df = mins15_df[1:-1]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(days1_trim_df['login_period'], days1_trim_df['count'], linestyle='-', linewidth=5)
ax.plot(days1_df['login_period'], days1_df['count'], linestyle='--', linewidth=2)
ax.xaxis.set_major_locator(saturdays) ; #ax.xaxis.set_major_locator(tuesdays)
ax.xaxis.set_minor_locator(days)
ax.xaxis.set_major_formatter(dtfmt)
ax.grid(True, which = "both", linestyle=":")
ax.set_ylabel('15min Login Count')
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, fontsize=12)
fig.set_size_inches(16, 4)
plt.show()
dayofweek_df = mins15_df.groupby(['dayofweek']).describe().unstack()
dayofweek_df = dayofweek_df[['count']].reset_index(0)
dayofweek_df_ = dayofweek_df['count'][:] # extract only the count
print dayofweek_df_['count']
Observations on day-of-week data¶
- There are slightly fewer Tue(dow=1), Wed(dow=2), and Thursdays(dow=3) in the timeseries
plt.bar(dayofweek_df_.index.values,
dayofweek_df_['mean'].values,
yerr=dayofweek_df_['std'].values,
color='lightgray')
plt.ylabel('15min Logins')
plt.xlabel('Day-of-Week')
plt.title('Day-of-Week 15min-Logins Distribution')
axes = plt.gca()
fig = plt.gcf()
fig.set_size_inches(14, 8)
plt.show()
Observations on day-of-week behavior¶
- Saturday(dow=5) and Sunday(dow=6) are typically the days with most 15min login-events (~13 / 15min).
- Monday(dow=0) and Tuesday(dow=1) the ones with the fewest (~6.5 / 15min)
- There is significant variance in 15min behavior for any given day of the week.
- Therefore, day-of-week appears to be a predictive factor.
Now the week-level (week-of-year) analysis¶
weekofyear_df = mins15_df.groupby(['weekofyear']).describe().unstack()
weekofyear_df = weekofyear_df[['count']].reset_index(0)
weekofyear_df['count'][['count','mean']]
Observations on week-of-year¶
- The first and especially the most recent "week" have an incomplete number of 15min timeperiods. Therefore be careful when/if making predictions based on week-level information.
- The mean now more strongly suggests that login-events are increasing.
weekofyear_trim_df = weekofyear_df['count'][1:-1]
plt.bar(weekofyear_trim_df.index.values,
weekofyear_trim_df['mean'].values,
yerr=weekofyear_trim_df['std'].values,
color='lightgray')
plt.ylabel('15min Logins')
plt.xlabel('Week-of-Year')
plt.title('Week-of-Year 15min-Logins Distribution')
axes = plt.gca()
axes.set_ylim([0,weekofyear_trim_df['mean'].max()+weekofyear_trim_df['std'].max()]) # force y to start at zero
fig = plt.gcf()
fig.set_size_inches(14, 8)
plt.show()
hourofday_df = mins15_df.groupby(['hourofday']).describe().unstack()
hourofday_df = hourofday_df[['count']].reset_index(0)
hourofday_df['count'][['count','mean']]
Observations on hour-of-day data¶
- Every hour-of-day has an equivalent number of 15min timeslots. Therefore it may be safe to use hour-level information.
plt.bar(hourofday_df.index,
hourofday_df['count']['mean'],
yerr=hourofday_df['count']['std'],
color='lightgrey')
plt.ylabel('15min Logins')
plt.xlabel('Hour-of-Day')
plt.title('Hour-of-Day 15min Logins Distribution')
axes = plt.gca()
axes.set_ylim([0,30])
fig = plt.gcf()
fig.set_size_inches(14, 8)
plt.show()
Observations on hour-of-day behavior¶
- There is a pattern to login-events by hour-of-day.
- There is significant variance in 15min behavior for any given hour.
- Therefore, time-of-day appears to be a predictive factor.
Decomposition of the 15min logins time series¶
from statsmodels.tsa.seasonal import seasonal_decompose
# Extract login counts indexed by date
mins15_ts = pd.Series (data=mins15_df ['count'].tolist (),
index=mins15_df ['login_period'])
mins15_ts = mins15_ts.asfreq (pd.infer_freq (mins15_ts.index))
mins15_ts = mins15_ts.astype (np.float64)
# Decompose time series with weekly frequency
decomp_freq = 24*60/15 * 7
mins15_decomposition = seasonal_decompose (mins15_ts.values, freq=decomp_freq)
mins15_trend = mins15_decomposition.trend
mins15_seasonal = mins15_decomposition.seasonal
mins15_residual = mins15_decomposition.resid
### Plot decomposition
import matplotlib.gridspec as gspec
fig = plt.figure ()
gs =gspec.GridSpec (4, 1)
ax = fig.add_subplot (gs [0])
ax.plot (mins15_ts.index, mins15_ts.values, label="Original")
ax.set_ylabel ("Original")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)
ax = fig.add_subplot (gs [1])
ax.plot (mins15_ts.index, mins15_trend, label="Trend")
ax.set_ylabel ("Trend")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)
ax = fig.add_subplot (gs [2])
ax.plot (mins15_ts.index, mins15_seasonal, label="Seasonality")
ax.set_ylabel ("Seasonality")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)
ax = fig.add_subplot (gs [3])
ax.plot (mins15_ts.index, mins15_residual, label="Residuals")
ax.set_ylabel ("Residuals")
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, fontsize=12)
fig.set_size_inches (16, 20)
gs.tight_layout (fig)
plt.show()
Observations based on timeseries decomposition¶
- There is a clear increasing trend in the logins over time
- There is a clear seasonal effect in the logins, with a period of 1 week
- The residuals appear to be normally distributed.
Testing for stationarity¶
To verify that the time series is stationary or not, we perform a Dickey-Fuller test.
from statsmodels.tsa.stattools import adfuller
# Get first difference of time series
mins15_ts_diff = mins15_ts - mins15_ts.shift ()
mins15_ts_diff.dropna (inplace=True)
test_result = adfuller (mins15_ts, autolag="AIC")
index= ['Test statistic',
'p-value',
'No. lags used',
'No. of observations used'
]
print "Dikey-Fuller Test Results"
for i in range (len (index)):
print index [i], ":", "%.5f" % test_result [i]
for key, value in test_result[4].items():
print 'Critical Value (%s)' % key, "%.5f" % value
Observations from Dickey-Fuller test¶
- The Dickey-Fuller test statistics is smaller than the critical value for 1% which confirms that the time series is stationary (Bisgaard & Kulahci, 2011).
- Because the series is stationary, we can safely apply an ARIMA model.
Predictive Modeling¶
- Let's predict the number of login-events
- First let's evaluate which model in general achieves the lowest error
- Then apply that model the subsequent four 15 minute timeperiods.
Analyze the performance of a naive baseline model¶
- A simple model is to naively predict a constant value, such as the overall expected value or median.
- Recall that the overall median of login-events in 15minute periods is 5
- Let's also explore other nearby values; specifically, the fifteen values from zero to fourteen.
# Let's switch to Numpy(np.) which faciliates the calculation of several hard-coded predictions.
np.set_printoptions(precision=2) # for pretty print
# gather the squared errors over fifteen different "hard-coded" predictions.
predictions15 = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14])
# Calculate the root mean-squared (RMS) error for all fifteen values
# since there is no training we do not need to worry about information leakage
arr = np.empty((0,15), int)
for index, row in mins15_df.iterrows():
timeperiod_count = row['count']
predictions15_gap = predictions15 - timeperiod_count
predictions15_gap_sq = np.square(predictions15_gap)
arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)
# print arr.mean(axis=0)
print np.sqrt(arr.mean(axis=0))
Observations of constant-value model predictions¶
- The mean-squared error for the overall median of 7 is 8.7
- The smallest root mean squared-error over the entire dataset occurs with a prediction of between 9 and 10 logins: 8.34.
Use the "Mondays" model?¶
- Let's try to exploit the known temporal pattern of day-of-week
- Given that the expected predictions are for a monday, what is the behavior of a "Monday" model?
arr = np.empty((0,15), int)
for index, row in mins15_df[(mins15_df.dayofweek == 0)].iterrows():
timeperiod_count = row['count']
predictions15_gap = predictions15 - timeperiod_count
predictions15_gap_sq = np.square(predictions15_gap)
arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)
rmse=np.sqrt(arr.mean(axis=0))
# pd.DataFrame(data=rmse[1:])
print rmse
Observations on Monday-model predictions¶
- The minimal squared-error over Mondays occurs with a prediction of 6 logins: 5.03
What is it when restricted to 7PM?¶
- Let's try to exploit the known temporal pattern of hour-of-day.
- Given that the expected predictions are for ~7PM, what is the behavior of a "7PM" model?
arr = np.empty((0,15), int)
for index, row in mins15_df[(mins15_df.hourofday == 19)].iterrows():
timeperiod_count = row['count']
predictions15_gap = predictions15 - timeperiod_count
predictions15_gap_sq = np.square(predictions15_gap)
arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)
rmse=np.sqrt(arr.mean(axis=0))
# pd.DataFrame(data=rmse[1:])
print rmse
mon7pm_series = mins15_df[(mins15_df.hourofday == 19) & (mins15_df.dayofweek == 0)]['count']
mon7pm_df = pd.DataFrame(data=mon7pm_series.reset_index()[['count']])
arr = np.empty((0,15), int)
for index, row in mon7pm_df.iterrows():
timeperiod_count = row['count']
predictions15_gap = predictions15 - timeperiod_count
predictions15_gap_sq = np.square(predictions15_gap)
arr = np.append(arr, np.array([predictions15_gap_sq]), axis=0)
rmse=np.sqrt(arr.mean(axis=0))
# pd.DataFrame(data=rmse[1:])
print rmse
mon7pm_df.describe()
- This model's expected value is 4.67 and has a standard deviation of 2.88
Unfortunately the d-of-w and h-of-d time-series models above were not evaluated as a general approach (for every day/hour combination), nor was testing performed without [[information leakage]] (of future day/hour combinations), so we cannot report this models overall performance.
TO DO: test every combination and do so based only on looking at prior data
ARIMA modelling of 15-min logins¶
One of the best-practice approaches to stationary univariate timeseries predictive modeling is ARIMA.
- Let's model the first order difference of timeseries.
ARIMA model parameterization¶
First we need to determine sound parameter settings.
- An intuitive way to do this is to look at the Autocorrelation and Partial autocorrelation plots.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Plot the ACF and PACF
plot_acf (mins15_ts_diff, lags=20)
plot_pacf (mins15_ts_diff, lags=20)
plt.tight_layout ()
plt.show ()
Selection of ARIMA's paramaters¶
- The ACF quicly decays after lag 1. This indicates that the moving average parameter of the ARIMA model should be set to 1.
- The PACF quickly decays after lag 3 which suggests that the autoregreesive parameter of the ARIMA model should be set to 3.
Building the ARIMA (3,0,1) model for the 15mins logins data¶
from statsmodels.tsa.arima_model import ARIMA
# Create and fit the model
model = ARIMA (mins15_ts, order=(3, 0, 1))
mins15_arima = model.fit ()
print mins15_arima.summary ()
Observations of ARIMA's performance¶
- The RMSE of the ARIMA (3,0,1) model is approximately 4.4 which is significantly lower than the naive constant-value model's 8.3.
# Plot ARIMA model and RMSE
# The plot is using obs every 96 intervals to simplify the output
mins15_arima_ts = np.rint (pd.Series (mins15_arima.fittedvalues, copy=True))
rmse = np.sqrt (sum ((mins15_arima_ts-mins15_ts)**2)/len(mins15_ts))
fig = plt.figure ()
plt.plot (mins15_ts [::96], 'k')
plt.plot (mins15_arima_ts [::96], 'r--', lw=2)
plt.title ('RMSE: %.4f' % rmse)
plt.legend (["Original", "ARIMA (3,0,1)"])
fig.set_size_inches (15, 8)
plt.show ()
mins15_pred = mins15_arima.predict (start=9760, end=9800, dynamic=False)
mins15_pred = np.rint (mins15_pred)
fig = plt.figure ()
plt.plot (mins15_ts [9760:9800], 'o')
plt.plot (mins15_pred, 'r--', lw=2)
plt.legend (["Original", "Predicted"])
fig.set_size_inches (16,8)
plt.show ()
print "Predicted values:"
print mins15_pred[28:32]
fig = mins15_arima.plot_predict(start=9770, end=9789, dynamic=False, plot_insample=True)
plt.legend (["Forecast", "Original", "95% Confidence Interval"])
fig.set_size_inches (16, 8)
plt.show ()
print "Confidence intervals for the predicted parameters:"
print mins15_arima.conf_int ()
forecast = mins15_arima.forecast (1, alpha=.1)
print "Predicted value for the next 15 mins: ", np.rint (forecast [0] [0])
print "90% confidence intervals for the prediction: ", np.rint (forecast [2] [0] [0]), np.rint (forecast [2] [0] [1])
forecast = mins15_arima.forecast (1, alpha=.05)
print "95% confidence intervals for the prediction: ", np.rint (forecast [2] [0] [0]), np.rint (forecast [2] [0] [1])
References:¶
- Søren Bisgaard, and Murat Kulahci. ([[2011]]). “Time Series Analysis and Forecasting by Example." Wiley
days1_trim_df['movavg7d'] = pd.rolling_mean(days1_trim_df['count'], 7)
days1_trim_df.tail(10)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(days1_trim_df['login_period'], days1_trim_df['count'], linestyle='-', linewidth=1)
ax.plot(days1_trim_df['login_period'], days1_trim_df['movavg7d'], linestyle='--', linewidth=3)
ax.xaxis.set_major_locator(saturdays)
ax.xaxis.set_major_formatter(dtfmt)
ax.xaxis.set_minor_locator(days)
ax.grid(True, which = "both", linestyle=":")
ax.set_ylabel('Count')
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, fontsize=12)
fig.set_size_inches(16, 4)
plt.show()