Apache UIMA System
An Apache UIMA System is an Unstructured Data Management System that follows the UIMA Architecture.
- …
- Counter-Example(s):
- a GATE System.
- See: Apache OpenNLP, Apache Hadoop, Apache Lucene.
References
2011
- http://en.wikipedia.org/wiki/UIMA
- UIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics.
UIMA is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on the website of the Apache Software Foundation.
An example is a logistics analysis software system that could convert unstructured data such as repair logs and service notes into relational tables. These tables can then be used by automated tools to detect maintenance or manufacturing problems.
Other examples are systems that are used in medical environments to analyze clinical notes.
The UIMA architecture can be thought of in four dimensions:
- It specifies component interfaces in an analytics pipeline.
- It describes a set of Design patterns.
- It suggests two data representations: an in-memory representation of annotations for high-performance analytics and an XML representation of annotations for integration with remote web services.
- It suggests development roles allowing tools to be used by users with diverse skills
- UIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics.
- http://uima.apache.org/index.html
- Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA is made of many things. UIMA enables applications to be decomposed into components, for example
"language identification" ⇒ "language specific segmentation" ⇒ "sentence boundary detection" ⇒ "entity detection (person/place names etc.)"
. Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.
Apache UIMA is an Apache-licensed open source implementation of the UIMA specification pdf doc (that specification is, in turn, being developed concurrently by a technical committee within OASIS , a standards organization). We invite and encourage you to participate in both the implementation and specification efforts.
- Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
- (Apache Blog, 2011-02-14) ⇒ Apache Blog. (2011). “Apache UIMA and Apache Hadoop Advance Data Intelligence and Semantics Capabilities of Watson Supercomputer.” Feb 14, 2011
- Apache UIMA: standards-based frameworks, infrastructure and components that facilitate the analysis and annotation of an array of unstructured content (such as text, audio and video). Watson uses Apache UIMA for real-time content analytics and natural language processing, to comprehend clues, find possible answers, gather supporting evidence, score each answer, compute its confidence in each answer, and improve contextual understanding (machine learning) – all under 3 seconds.
- Apache Hadoop: software framework that enables data-intensive distributed applications to work with thousands of nodes and petabytes of data. A foundation of Cloud computing, Apache Hadoop enables Watson to access, sort, and process data in a massively parallel system (90+ server cluster/2,880 processor cores/16 terabytes of RAM/4 terabytes of disk storage).
- The Watson system uses UIMA as its principal infrastructure for component interoperability and makes extensive use of the UIMA-AS scale-out capabilities that can exploit modern, highly parallel hardware architectures. UIMA manages all work flow and communication between processes, which are spread across the cluster. Apache Hadoop manages the task of preprocessing Watson's enormous information sources by deploying UIMA pipelines as Hadoop mappers, running UIMA analytics.