2009 ProbabilisticDatabases

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Probabilistic Data Record Set.

Notes

Quotes

Facet 1: Semantics and Representation

The de facto formal semantics of a probabilistic database is the possible worlds model.[Dalvi & Suciu, 2007] By contrast, there is no agreement on a representation system, instead there are several approaches covering a spectrum between expressive power and usability.[Benjelloun et al., 2008] A key concept in most representation systems is that of lineage, which is derived from early work on incomplete databases by Immelinski and Lipski.[Imielinski & Lipski, 1984]

Possible Worlds Semantics. In its most general form, a probabilistic database is a probability space over the possible contents of the database. It is customary to denote a (conventional) relational database instance with the letter I. Assuming there is a single table in our database, I is simply a set of tuples (records) representing that table; this is a conventional database. A probabilistic database is a discrete probability space PDB = (W, P), where W = {I1,I2, ..., In} is a set of possible instances, called possible worlds, and P: W → [0, 1] is such that Σj=1,nP(Ij) = 1. In the terminology of networks of belief, there is one random variable for each possible tuple whose values are 0 (meaning that the tuple is not present) or 1 (meaning that the tuple is present), and a probabilistic database is a joint probability distribution over the values of these random variables. *... Consider some tuple t (we use interchangeably the terms tuple and record in this article). The probability that the tuple belongs to a randomly chosen world is P(t) = ∑j:t∈Ij P(Ij), and is also called the marginal probability of the tuple t. Similarly, if we have two tuples t1, t2, we can examine the probability that both are present in a randomly chosen world, denoted P(t1t2). When the latter is P(t1)P(t2), we say that t1, t2 are independent tuples; if it is 0 then we say that t1, t2 are disjoint tuples or exclusive tuples. If none of these hold, then the tuples are correlated in a nonobvious way. Consider a query Q, expressed in some relational query language like SQL, and a possible tuple t in the query's answer. P(tQ) denotes the probability that, in a randomly chosen world, t is an answer to Q. The job of a probabilistic database system is to return all possible tuples t1, t2, ...together with their probabilities P(t1Q), P(t2Q),....

-

References

  • 1. Andritsos, P. and Fuxman, A., Miller, R.J. Clean answers over dirty databases. In ICDE (2006).

  • 2. Antova, L., Jansen, T., Koch, C. and Olteanu, D. Fast and simple relational processing of uncertain data. In ICDE (2008).

  • 3. Barbara, D., Garcia-Molina, H. and Porter, D. The management of probabilistic data. IEEE Trans. Knowl. Data Eng. 4, 5 (1992), 487–502.

  • 4. Benjelloun, O., Sarma, A.D., Halevy, A., Theobald, M. and Widom, J. Databases with uncertainty and lineage. VLDBJ 17, 2 (2008), 243–264.

  • 5. Burdick, D., Deshpande, P., Jayram, T.S., Ramakrishnan, R. and Vaithyanathan, S. Efficient allocation algorithms for OLAP over imprecise data. In VLDB (2006), 391–402.

  • 6. Cavallo, R. and Pittarelli, M. The theory of probabilistic databases. In Proceedings of VLDB (1987), 71–81.

  • 7. Cheng, R., Kalashnikov, D. and Prabhakar, S. Evaluating probabilistic queries over imprecise data. In SIGMOD (2003), 551–562.

  • 8. Codd, E.F. Relational completeness of data base sublanguages. In Database Systems (1972), Prentice-Hall, 65–98.

  • 9. Cowell, R., Dawid, P., Lauritzen, S. and Spiegelhalter D., eds. Probabilistic Networks and Expert Systems (1999), Springer.

  • 10. Dalvi, N. and Suciu, D. The dichotomy of conjunctive queries on probabilistic structures. In PODS (2007), 293–302.

  • 11. Dalvi, N. and Suciu, D. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4 (2007), 523–544.

  • 12. Dalvi, N. and Suciu, D. Management of probabilistic data: Foundations and challenges. In PODS (Beijing, China, 2007) 1–12 (invited talk).

  • 13. Darwiche, A. A differential approach to inference in bayesian networks. J. ACM 50, 3 (2003), 280–305.

  • 14. DeRose, P., Shen, W., Chen, F., Lee, Y., Burdick, D., Doan, A. and Ramakrishnan, R. Dblife: A community information management platform for the database research community. In CIDR (2007), 169–172.

  • 15. Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J.M. and Hong, W. Model-driven data acquisition in sensor networks. In VLDB (2004), 588–599.

  • 16. Fagin, R., Lotem, A. and Naor, M. Optimal aggregation algorithms for middleware. In Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (2001), ACM Press, 102–113.

  • 17. Friedman, N., Getoor, L., Koller, D. and Pfeffer A. Learning probabilistic relational models. In IJCAI (1999), 1300–1309.

  • 18. Fuhr, N. and Roelleke, T. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15, 1 (1997), 32–66.

  • 19. Grädel, E., Gurevich, Y. and Hirsch, C. The complexity of query reliability. In PODS (1998), 227–234.

  • 20. Gupta, R. and Sarawagi, S. Creating probabilistic databases from information extraction models. In VLDB (2006), 965–976.

  • 21. Halevy, A. Answering queries using views: A survey. VLDB J. 10, 4 (2001), 270–294.

  • 22. Imielinski, T. and Lipski, W. Incomplete information in relational databases. J. ACM 31 (Oct. 1984), 761–791.

  • 23. Jampani, R., Xu, F., Wu, M., Perez, L., Jermaine, C. and Haas, P. MCDB: A Monte Carlo approach to managing uncertain data. In SIGMOD (2008), 687–700.

  • 24. Jayram, T., Kale, S. and Vee, E. Efficient aggregation algorithms for probabilistic data. In SODA (2007).

  • 25. Kanagal, B. and Deshpande, A. Online filtering, smoothing and probabilistic modeling of streaming data. In ICDE (2008), 1160–1169.

  • 26. Lafferty, J., Andrew McCallum and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML (2001).

  • 27. Lakshmanan, L., Leone, N., Ross, R. and Subrahmanian, V. Probview: A flexible probabilistic database system. ACM Trans. Database Syst. 22, 3 (1997).

  • 28. Nierman, A. and Jagadish, H. ProTDB: Probabilistic data in XML. In VLDB (2002), 646–657.

  • 29. Olteanu, D., Huang, J. and Koch, C. SPROUT: Lazy vs. eager query plans for tuple independent probabilistic databases. In ICDE (2009).

  • 30. Rastogi, V., Suciu, D. and Hong, S. The boundary between privacy and utility in data publishing. In VLDB (2007).

  • 31. Ré, C., Dalvi, N. and Suciu, D. Efficient Top-k query evaluation on probabilistic data. In ICDE (2007).

  • 32. Ré, C., Suciu, D. Efficient evaluation of having queries on a probabilistic database. In Proceedings of DBPL (2007).

  • 33. Ré, C. and Suciu, D. Materialized views in probabilistic databases for information exchange and query optimization. In: Proceedings of VLDB (2007)

  • 34. Ré, C., Letchner, J., Balazinska, M. and Suciu, D. Event queries on correlated probabilistic streams. In SIGMOD (Vancouver, Canada, 2008).

  • 35. Roth, D. On the hardness of approximate reasoning. Artif. Intell. 82, 1–2 (1996), 273–302.

  • 36. Sen, P. and Deshpande, A. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.

  • 37. Soliman, M.A., Ilyas, I.F. and Chang, K.C.-C. Probabilistic top- and ranking-aggregate queries. ACM Trans. Database Syst. 33, 3 (2008).

  • 38. Vardi, M.Y. The complexity of relational query languages. In Proceedings of 14th ACM SIGACT Symposium on the Theory of Computing (San Francisco, California, 1982), 137–146.

  • 39. Verma, T. and Judea Pearl Causal networks: Semantics and expressiveness. Uncertainty Artif. Intell. 4 (1990), 69–76.

  • 40. Wong, E. A statistical approach to incomplete information in database systems. ACM Trans. Database Syst. 7, 3 (1982), 470–488.


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 ProbabilisticDatabasesNilesh Dalvi
Christopher Ré
Dan Suciu
Probabilistic Databases: diamonds in the dirtCommunications of the ACM10.1145/1538788.15388102009