This page describes the Snowball Relation Extraction Algorithm in detail.

Definition

Recall that: Snowball is an Eager Model-based Semi-Supervised Learning-based Relation Recognition Algorithm designed for Information Extraction Tasks that involve the extraction of Binary Relations.
Context:
- - Automated Text Annotation includes Named Entity Recognition.
  - Uses a Five-Tuple Lexically-based Relation Recognition Classifier Classification Model representation.
  - Uses Clustering to generalize Patterns.
  - Uses Bootstrapping to benefit from Unlabeled Data.
  - Uses Pattern Precision and a minimum precision Threshold to stop Pattern Generation.
  - Optimized for One-to-One Relations and One-to-Many Relations.
Example(s):
- http://snowball.cs.columbia.edu/
See: PPLRE Snowball, Snowball Internal Parameters, Snowball Algorithm Description.

Input/Output

This section describes Snowball's input/output requirements

Input: Initial Seed Tuples

One of the inputs is the initial set of seed tuples (Training Cases). For the OrganizationHeadquarterLocation() relation a sample set of seeds is:

| ORGANIZATION || LOCATION | | MICROSOFT || REDMOND | | IBM || ARMONK | | BOEING || SEATTLE | | INTEL || SANTA CLARA |

Input: Annotated Training Corpus

The second input is a corpus where the entities in the relation have been annotated.

Output: Tuples

This is an example of a Snowball output tuple: <Tuple> <TupleField> <FieldName>LOCALIZATION</FieldName> <FieldValue>+periplasm+</FieldValue> </TupleField> <TupleField> <FieldName>PROTEIN</FieldName> <FieldValue>+zncl2+</FieldValue> </TupleField> <TupleRank>0.30240786</TupleRank> <SupportingDocId>%2Fhome%2Fhomira%2Fdata%2FtrainSecondary%2F884</SupportingDocId> </Tuple> <Tuple> <TupleField> <FieldName>LOCALIZATION</FieldName> <FieldValue>+periplasm+</FieldValue> </TupleField> <TupleField> <FieldName>PROTEIN</FieldName> <FieldValue>+pao1161+</FieldValue> </TupleField> <TupleRank>0.305802</TupleRank> <SupportingDocId>%2Fhome%2Fhomira%2Fdata%2FtrainSecondary%2F6066</SupportingDocId> </Tuple>

Output: Patterns

This is an example of a Snowball pattern: <Pattern> <TermVector> </TermVector> <TermVector> <VectorElement> <Term>insertion</Term> <Weight>0.4082483</Weight> </VectorElement> <VectorElement> <Term>h</Term> <Weight>0.4082483</Weight> </VectorElement> <VectorElement> <Term>soluble</Term> <Weight>0.4082483</Weight> </VectorElement> <VectorElement> <Term>protein</Term> <Weight>0.4082483</Weight> </VectorElement> <VectorElement> <Term>produced</Term> <Weight>0.4082483</Weight> </VectorElement> <VectorElement> <Term>volcanii</Term> <Weight>0.4082483</Weight> </VectorElement> </TermVector> <TermVector> </TermVector> <SupportingTupleId>2</SupportingTupleId> <PatternRank>1.0</PatternRank> <PatternClosure>01</PatternClosure> </Pattern>

Algorithm

Process

Start with seed set R of tuples
Generate set P of patterns from R <= GeneratePatterns()
Compute support and confidence for each pattern in P
Discard patterns with low support or confidence
Generate new set T of tuples matching patterns P <= GenerateTuples(Patterns)
Compute confidence of each tuple in T
Add to R the tuples t2T with conf(t)> threshold.
Go back to step 2

Algorithm - Functions

GeneratePatterns() Function

Assign the first context vector representation V as the centroid of cluster C
For Vi, compute the similarity with the centroids of each existing clusters, let S be the highest similarity.
If S the threshold value (St), add the item to the corresponding cluster and re-compute the cluster centroids, else use Vi to form a new cluster
Go to step 2 if another V needs to be clustered, else proceed to step 5
Return the centroids of each cluster, add them (i.e. promoted extraction patterns) to candidate pattern set

  sub GeneratePatterns
    foreach (< oseed; `seed >, str) in Occurrences
  (1) tS =< ls; t1;ms; t2; rs > = makeOccurrence(str);
  (2) cluster best = FindClosestCluster(tS,  sim);
      if(best)
         best.Add(tS );
         best.UpdateCentroid(tS);
         best.AddTuple(< oseed; `seed >);
      else
         CreateNewCluster(< oseed; `seed >, tS);
  (3)Patterns = FilterPatterns( Clusters, sup);
    return Patterns;

GenerateTuples(Patterns) Function

  sub GenerateTuples(Patterns)
    foreach text segment in corpus
  (1) {<o,l>, < ls; t1;ms; t2; rs>} =
        = CreateOccurrence(text segment);
      TC = < o;` >;
      SimBest = 0;
      foreach p in Patterns
  (2)   sim = Match(< ls; t1;ms; t2; rs >; p);
        if (sim >= Tsim)
  (3)   UpdatePatternSelectivity(p, TC);
          if(sim   SimBest)
            SimBest = sim;
            PBest = p;
      if(SimBest >= Tsim)
        CandidateTuples [TC ].Patterns [PBest ] =
          = SimBest;
    return CandidateTuples;

Open Issues

Some of the recognized open problems for the application of Snowball are:

Automated setting of internal thresholds. Many thresholds are required. E.g.:
- the maximum number of words that a pattern can be found in.
- a minimum threshold on the similarity between records in clusters (similar to the selection of the number of clusters).

See: Snowball Internal Parameters

Snowball Algorithm Description