2019 EmbarrassinglyShallowAutoencode
- (Steck, 2019) ⇒ Harald Steck. (2019). “Embarrassingly Shallow Autoencoders for Sparse Data.” In: The World Wide Web Conference, (WWW-2019).
Subject Headings:
Notes
Cited By
Quotes
Abstract
Combining simple elements from the literature, we define a linear model that is geared toward sparse data, in particular implicit feedback data for recommender systems. We show that its training objective has a closed-form solution, and discuss the resulting conceptual insights. Surprisingly, this simple model achieves better ranking accuracy than various state-of-the-art collaborative-filtering approaches, including deep non-linear models, on most of the publicly available data-sets used in our experiments.
1 INTRODUCTION
Many recent improvements in collaborative ltering can be aributed to deep learning approaches, e.g, [5, 7–9, 13, 21, 25, 26]. Unlike in areas like computer vision, however, it was found that a small number of hidden layers achieved the best recommendation accuracy. In this paper, we take this to the extreme, and dene a linear model without a hidden layer (see Figure 1). e (binary) input vector indicates which items a user has interacted with, and the model’s objective (in its output layer) is to predict the best items to recommend to the user. is is done by reproducing the input as its output, as is typical for autoencoders.1 We hence named it Embarrassingly Shallow AutoEncoder (in Reverse order: easer).
is paper is organized as follows: we dene easer in the next section, using simple elements from the literature. In Section 3.1, we derive the closed-form solution of its convex training objective. is has several implications: (1) it reveals that the neighborhood-based approaches used in collaborative ltering are based on conceptually incorrect item-item similarity-matrices, while easer may be considered a principled neighborhood model, see Sections 3.2 and 4.3; (2) the code for training easer is comprised of only a few lines, see Section 3.3 and Algorithm 1; (3) if the model ts into memory, the wall-clock time for training easer can be several orders of magnitude less than for training a Slim [16] model (see Section 3.4), which is the most similar model to easer. Apart from that, we surprisingly found that easer achieved competitive ranking accuracy, and even outperformed various deep, non-linear, or probabilistic models as well as neighborhood-based approaches on most of the publicly available data-sets used in our experiments (see Section 5).
1 Note, however, that the proposed model does not follow the typical architecture of autoencoders, being comprised of an encoder and a decoder: one may introduce an implicit hidden layer, however, by a (full-rank) decomposition of the learned weight-matrix of this model.
Figure 1: The self-similarity of each item is constrained to zero between the input and output layers.
2 MODEL DEFINITION
…
3 MODEL TRAINING
…
4 RELATED WORK
EASE^R can be viewed as an autoencoder, as a modied version of Slim, and a neighborhood-based approach. We discuss each of the three related approaches in the following.
4.1 Deep Learning and Autoencoders
While the area of collaborative ltering has long been dominated by matrix factorization approaches, recent years have witnessed a surge in deep learning approaches [5, 7–9, 13, 21, 25, 26], spurred by their great successes in other elds. Autoencoders provide the model architecture that ts exactly the (plain-vanilla) collaborative ltering problem. While various network architectures have been explored, it was found that deep models with a large number of hidden layers typically do not obtain a notable improvement in ranking accuracy in collaborative ltering, compared to ’deep’ models with only one, two or three hidden layers, e.g., [7, 13, 21, 26], which is in stark contrast to other areas, like computer vision. A combination of deep and shallow elements in a single model was proposed in [5].
In contrast, easer has no hidden layer. Instead, the self-similarity of each item in the input and output layer is constrained to zero, see also Figure 1. As a consequence, the model is forced to learn the similarity of an item in the output layer in terms of the other items in the input layer. e surprisingly good empirical results of easer in Section 5 suggest that this constraint might be more eective than using hidden layers with limited capacity as to force the model to generalize well to unseen data.
4.2 Slim and Variants
While the Slim model [16] has shown competitive empirical results in numerous papers, it is computationally expensive to train, e.g., see [13, 16] and Section 3.4. is has sparked follow-up work proposing various modications. In [12], both constraints on the weight matrix (non-negativity and zero diagonal) were dropped, resulting in a regression problem with an elastic-net regularization. While competitive ranking results were obtained in [12], in the experiments in [13] it was found that its performance was considerably below par. e square loss in Slim was replaced by the logistic loss in [20], which entailed that both constraints on the weight matrix could be dropped, as argued by the authors. Moreover, the L1-norm regularization was dropped, and a user-userweight-matrix was learned instead of an item-item matrix.
All these approaches take advantage of the fact that the optimization problem decomposes into independent and embarrassingly parallel problems. As discussed in the previous section, however, this is several orders of magnitudes more costly than training easer if the weight matrix ts into memory.
Most importantly, while those modications of Slim dropped the constraint of a zero diagonal in the weight matrix, it is retained in easer. In fact, we found it to be the most crucial property for achieving improved ranking accuracy (see Section 5). Aswe showed in Section 3.1, this constraint can be easily included into the training objective via the method of Lagrangian multipliers, allowing for a closed-form solution.
Compared to Slim [16], we dropped the constraint of non-negative weights, which we found to greatly improve ranking accuracy in our experiments (see Table 1 and Figure 2). Moreover, we did not use L1-norm regularization for computational eciency. We also did not nd sparsity to noticeably improve ranking accuracy (see Section 5). e learned weight matrix ˆB of easer is dense. Also note that extensions to Slim like cSLIM [17], can be turned into an analogous extension of easer.
4.3 Neighborhood-based Approaches
Numerous neighborhood-based approaches have been proposed in the literature (e.g., see [23, 24] and references therein). While model-based approaches were found to achieve beer ranking accuracy on some data sets, neighborhood-based approaches dominated on others, e.g., the Million Song Data Competition on Kaggle [1, 14]. Essentially, the co-occurrence matrix (or some modied variant) is typically used as item-item or user-user similarity matrix in neighborhood-based methods. ese approaches are usually heuristics, as the similarity matrix is not learned by optimizing an objective function (loss function or likelihood). More importantly, the closed-form solution derived in Eqs. 4 and 8 reveals that the inverse of the data Gram matrix is the conceptually correct similarity matrix, see Section 3.2 for more details. is is in contrast to the typical neighborhood-based approaches, which use the data Gram-matrix without inversion. e use of the conceptually correct, inverse matrix in easer may explain the improvement observed in Table 2 compared to the heuristics used by state-of-the-art neighborhood approaches.
5 EXPERIMENTS
…
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2019 EmbarrassinglyShallowAutoencode | Harald Steck | Embarrassingly Shallow Autoencoders for Sparse Data | 2019 |