2015 StreamSamplingforFrequencyCapSt
- (Cohen, 2015) ⇒ Edith Cohen. (2015). “Stream Sampling for Frequency Cap Statistics.” In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2015). ISBN:978-1-4503-3664-2 doi:10.1145/2783258.2783279
Subject Headings:
Notes
Cited By
- http://scholar.google.com/scholar?q=%222015%22+Stream+Sampling+for+Frequency+Cap+Statistics
- http://dl.acm.org/citation.cfm?id=2783258.2783279&preflayout=flat#citedby
Quotes
Author Keywords
Abstract
Unaggregated data, in a streamed or distributed form, is prevalent and comes from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries) and elements with different keys interleave.
Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function f applied to the frequency (the total number of occurrences) of the key. In particular, Distinct is the number of active keys in the segment, Sum is the sum of their frequencies, and both are special cases of frequency cap statistics, which cap the frequency by a parameter T. One important application of cap statistics is staging advertisement campaigns, where the cap parameter is the limit of the maximum number of impressions per user and we estimate the total number of qualifying impressions.
The number of distinct active keys in the data can be very large, making exact computation of queries costly. Instead, we can estimate these statistics from a sample. An optimal sample for a given function f would include a key with frequency w with probability roughly proportional to f (w). But while such a “gold-standard” sample can be easily computed over the aggregated data (the set of key-frequency pairs), exact aggregation itself is costly and slow. Ideally, we would like to compute and maintain a sample without aggregation.
We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design unifies classic solutions for Distinct and Sum. Specifically, our l-capped samples provide nonnegative unbiased estimates of any monotone non-decreasing frequency statistics, and close to gold-standard estimates for frequency cap statistics with T =Î (l). Furthermore, our design facilitates multi-objective samples, which provide tight estimates for a specified set of statistics using a single smaller sample. .
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2015 StreamSamplingforFrequencyCapSt | Edith Cohen | Stream Sampling for Frequency Cap Statistics | 10.1145/2783258.2783279 | 2015 |