Consistent Weighted Sampling (CWS) Algorithm
Jump to navigation
Jump to search
A Consistent Weighted Sampling (CWS) Algorithm is a sampling algorithm that is based on Jaccard Distance.
- Context:
- It based on the the generalized Jaccard similarity is defined as:
[math]\displaystyle{ generalized\;J\left(\mathcal{S},\mathcal{T}\right) =\dfrac{\sum_k \mathrm{min}\left(S_k, T_k\right)}{\sum_k \mathrm{max}\left(S_k, T_k\right)} }[/math]
where $S$ and $T$ are two weighted sets.
- …
- It based on the the generalized Jaccard similarity is defined as:
- Example(s):
- Counter-Example(s):
- See: Weighted Set, Stochastic Approximate Bayesian Inference Algorithm, Hypothesis Evaluation Task, MinHash Algorithm.
References
2017a
- (Wu et al., 2017) ⇒ Wei Wu, Bin Li, Ling Chen, Chengqi Zhang and Philip S. Yu (2017). "Improved Consistent Weighted Sampling Revisited". In: arXiv:1706.01172.
- QUOTE: In most real-world scenarios, weighted sets are more commonly seen than binary sets. For example, a document is commonly represented as a tf-idf set. In order to reasonably compute the similarity of two weighted sets, the generalized Jaccard similarity was introduced in (Haveliwala et al., 2000)[1]. Considering two weighted sets, $S$ and $T$ , the generalized Jaccard similarity is defined as
- QUOTE: In most real-world scenarios, weighted sets are more commonly seen than binary sets. For example, a document is commonly represented as a tf-idf set. In order to reasonably compute the similarity of two weighted sets, the generalized Jaccard similarity was introduced in (Haveliwala et al., 2000)[1]. Considering two weighted sets, $S$ and $T$ , the generalized Jaccard similarity is defined as
[math]\displaystyle{ generalized\;J\left(\mathcal{S},\mathcal{T}\right) =\dfrac{\sum_k \mathrm{min}\left(S_k, T_k\right)}{\sum_k \mathrm{max}\left(S_k, T_k\right)} }[/math] | (1) |
- In order to efficiently compute the generalized Jaccard similarity, the Consistent Weighted Sampling (CWS) scheme has been proposed in Manasse et al. (2010).
Definition 2 (Consistent Weighted Sampling Manasse et al., 2010)). Given a weighted set $S = \{S_1, \cdots , S_n\}$, where $S_k \geq 0$ for $k \in \{1, \cdots , n\}$, Consistent Weighted Sampling (CWS) generates a sample $\left(k, y_k\right) : 0 \leq y_k \leq S_k$, which is uniform and consistent.
- Uniformity: The subelement $\left(k, y_k\right)$ should be uniformly sampled from $cup_k \left(\{k\}\times \left[0, S_k\right]\right)$, i.e., the probability of selecting the $k$-th element is proportional to $S_k$, and $y_k$ is uniformly distributed in $\left[0, S_k\right]$.
- Consistency: Given two non-empty weighted sets, $S$ and $T$ , if $\forall_k$, $T_k \leq S_k$, a subelement $\left(k, y_k\right)$ is selected from $S$ and satisfies $y_k \leq T_k$, then $\left(k, y_k\right)$ will also be selected from $T$ .
- CWS has the following property $Pr\left[CWS(\mathcal{S}) = CWS(\mathcal{T})\right] = generalized\;J\left(\mathcal{S}, \mathcal{T} \right)$
- In order to efficiently compute the generalized Jaccard similarity, the Consistent Weighted Sampling (CWS) scheme has been proposed in Manasse et al. (2010).
2017b
- (Wu et al., 2017) ⇒ Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang (2017). "Consistent Weighted Sampling Made More Practical". In: Proceedings of the 26th International Conference on World Wide Web (WWW 2017). DOI:10.1145/3038912.3052598.
- QUOTE: In this paper, we propose a Practical CWS (PCWS) algorithm. We first transform the original form of ICWS into an equivalent expression, based on which we find some interesting properties that inspire us to make the ICWS algorithm simpler and more efficient in both space and time complexities. PCWS is not only mathematically equivalent to ICWS and preserves the same theoretical properties, but also saves 20% memory footprint and substantial computational cost compared to ICWS. The experimental results on a number of real-world text data sets demonstrate that PCWS obtains the same (even better) classification and retrieval performance as ICWS with 1/5~1/3 reduced empirical runtime.
2016
- (Wu et al., 2016) ⇒ Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang (2016). "Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash". In: Proceedings of 16th International Conference on Data Mining (IEEE 2016). DOI:10.1109/ICDM.2016.0174.
- QUOTE: In this paper, we propose a Canonical Consistent Weighted Sampling (CCWS) algorithm, which not only retains the same theoretical complexity as ICWS but also strictly complies with the definition of Min-Hash. The experimental results demonstrate that the proposed CCWS algorithm runs faster than the state-of-the-arts while achieving similar classification performance on a number of real-world text data sets.
2015
- (Li, 2015) ⇒ Ping Li. (2015). "0-Bit Consistent Weighted Sampling". In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2015). ISBN:978-1-4503-3664-2 doi:10.1145/2783258.2783406.
2010a
- (Manasse et al., 2010) ⇒ Mark Manasse, Frank McSherry, and Kunal Talwar. (2010). “Consistent Weighted Sampling". Unpublished technical report.
- QUOTE: We describe an efficient procedure for sampling representatives from a weighted set such that for any weightings S and T, the probability that the two choose the same sample is equal to the Jaccard similarity between …
2010b
- (Ioffe, 2010) ⇒ Sergey Ioffe (2010). "Improved Consistent Sampling, Weighted Minhash and L1 Sketching". In: Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010).
- ↑ T. H. Haveliwala, A. Gionis, and P. Indyk, “Scalable Techniques for Clustering the Web,” in WebDB, 2000, pp. 129–134