Consistent Weighted Sampling (CWS) Algorithm

A Consistent Weighted Sampling (CWS) Algorithm is a sampling algorithm that is based on Jaccard Distance.

Context:
- It based on the the generalized Jaccard similarity is defined as:
  [math]\displaystyle{ generalized\;J\left(\mathcal{S},\mathcal{T}\right) =\dfrac{\sum_k \mathrm{min}\left(S_k, T_k\right)}{\sum_k \mathrm{max}\left(S_k, T_k\right)} }[/math]
  where $S$ and $T$ are two weighted sets.
- …
Example(s):
Counter-Example(s):
- Monte Carlo Sampling Algorithm,
- Poisson Sampling Algorithm.
See: Weighted Set, Stochastic Approximate Bayesian Inference Algorithm, Hypothesis Evaluation Task, MinHash Algorithm.

References

2017a

(Wu et al., 2017) ⇒ Wei Wu, Bin Li, Ling Chen, Chengqi Zhang and Philip S. Yu (2017). "Improved Consistent Weighted Sampling Revisited". In: arXiv:1706.01172.
- QUOTE: In most real-world scenarios, weighted sets are more commonly seen than binary sets. For example, a document is commonly represented as a tf-idf set. In order to reasonably compute the similarity of two weighted sets, the generalized Jaccard similarity was introduced in (Haveliwala et al., 2000)^[1]. Considering two weighted sets, $S$ and $T$ , the generalized Jaccard similarity is defined as

[math]\displaystyle{ generalized\;J\left(\mathcal{S},\mathcal{T}\right) =\dfrac{\sum_k \mathrm{min}\left(S_k, T_k\right)}{\sum_k \mathrm{max}\left(S_k, T_k\right)} }[/math]

(1)

In order to efficiently compute the generalized Jaccard similarity, the Consistent Weighted Sampling (CWS) scheme has been proposed in Manasse et al. (2010).

Definition 2 (Consistent Weighted Sampling Manasse et al., 2010)). Given a weighted set $S = \{S_1, \cdots , S_n\}$, where $S_k \geq 0$ for $k \in \{1, \cdots , n\}$, Consistent Weighted Sampling (CWS) generates a sample $\left(k, y_k\right) : 0 \leq y_k \leq S_k$, which is uniform and consistent.

Uniformity: The subelement $\left(k, y_k\right)$ should be uniformly sampled from $cup_k \left(\{k\}\times \left[0, S_k\right]\right)$, i.e., the probability of selecting the $k$-th element is proportional to $S_k$, and $y_k$ is uniformly distributed in $\left[0, S_k\right]$.
Consistency: Given two non-empty weighted sets, $S$ and $T$ , if $\forall_k$, $T_k \leq S_k$, a subelement $\left(k, y_k\right)$ is selected from $S$ and satisfies $y_k \leq T_k$, then $\left(k, y_k\right)$ will also be selected from $T$ .

CWS has the following property

$Pr\left[CWS(\mathcal{S}) = CWS(\mathcal{T})\right] = generalized\;J\left(\mathcal{S}, \mathcal{T} \right)$

↑ T. H. Haveliwala, A. Gionis, and P. Indyk, “Scalable Techniques for Clustering the Web,” in WebDB, 2000, pp. 129–134

[1] T. H. Haveliwala, A. Gionis, and P. Indyk, “Scalable Techniques for Clustering the Web,” in WebDB, 2000, pp. 129–134

[1]

Consistent Weighted Sampling (CWS) Algorithm

References

2017a

2017b

2016

2015

2010a

2010b

Navigation menu

Search