DeepDive KBC Platform

From GM-RKB
Jump to navigation Jump to search

A DeepDive KBC Platform is a KBC/knowledge base population system managed by Stanford University.



References

2017

2016

  • http://deepdive.stanford.edu/
    • QUOTE: DeepDive is a system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing software. DeepDive helps bring dark data to light by creating structured data (SQL tables) from unstructured information (text documents) and integrating such data with an existing structured database. DeepDive is used to extract sophisticated relationships between entities and make inferences about facts involving those entities. DeepDive helps one process a wide variety of dark data and put the results into a database. With the data in a database, one can use a variety of standard tools that consume structured data; e.g., visualization tools like Tablaeu or analytics tools like Excel.

       DeepDive is a new type of data management system that enables one to tackle extraction, integration, and prediction problems in a single system, which allows users to rapidly construct sophisticated end-to-end data pipelines, such as dark data BI (Business Intelligence) systems. By allowing users to build their system end-to-end, DeepDive allows users to focus on the portion of their system that most directly improves the quality of their application. By contrast, previous pipeline-based systems require developers to build extractors, integration code, and other components — without any clear idea of how their changes improve the quality of their data product. This simple insight is the key to how DeepDive systems produce higher quality data in less time. DeepDive-based systems are used by users without machine learning expertise in a number of domains from paleobiology to genomics to human trafficking; see our showcase for examples.

       DeepDive is a trained system that uses machine learning to cope with various forms of noise and imprecision. DeepDive is designed to make it easy for users to train the system through low-level feedback via the Mindtagger interface and rich, structured domain knowledge via rules. DeepDive wants to enable experts who do not have machine learning expertise. One of DeepDive's key technical innovations is the ability to solve statistical inference problems at massive scale.

       DeepDive differs from traditional systems in several ways:

      • DeepDive asks the developer to think about features — not algorithms. In contrast, other machine learning systems require the developer to think about which clustering algorithm, classification algorithm, etc. to use. In DeepDive's joint inference based approach, the user only specifies the necessary signals or features.
      • DeepDive systems can achieve high quality: PaleoDeepDive has higher quality than human volunteers in extracting complex knowledge in scientific domains and winning performance in entity relation extraction competitions.
      • DeepDive is aware that data is often noisy and imprecise: names are misspelled, natural language is ambiguous, and humans make mistakes. Taking such imprecision into account, DeepDive computes calibrated probabilities for every assertion it makes. For example, if DeepDive produces a fact with probability 0.9, the fact is 90% likely to be true.
      • DeepDive is able to use large amounts of data from a variety of sources. Applications built using DeepDive have extracted data from millions of documents, web pages, PDFs, tables, and figures.
      • DeepDive allows developers to use their knowledge of a given domain to improve the quality of the results by writing simple rules that inform the inference (learning) process. DeepDive can also take into account user feedback on the correctness of the predictions to improve the predictions.

2015

2015

  • (Zhang, 2015) ⇒ Ce Zhang. (2015). “DeepDive: A Data Management System for Automatic Knowledge Base Construction.” In: PhD thesis Journal.
    • QUOTE: DeepDive is a data management system that we built to study this problem; its ultimate goal is to allow scientists to build a KBC system by declaratively specifying domain knowledge without worrying about any algorithmic, performance, or scalability issues. DeepDive was built by generalizing from our experience in building more than ten high-quality KBC systems, many of which exceed human quality or are top-performing systems in KBC competitions, and many of which were built completely by scientists or industry users using DeepDive. From these examples, we designed a declarative language to specify a KBC system and a concrete protocol that iteratively improves the quality of KBC systems. This flexible framework introduces challenges of scalability and performance. Many KBC systems built with DeepDive contain statistical inference and learning tasks over terabytes of data, and the iterative protocol also requires executing similar inference problems multiple times. Motivated by these challenges, we designed techniques that make both the batch execution and incremental execution of a KBC program up to two orders of magnitude more efficient and/or scalable. This dissertation describes the DeepDive framework, its applications, and these techniques, to demonstrate the thesis that it is feasible to build an efficient and scalable data management system for the end-to-end workflow of building KBC systems.