Snowball Algorithm Description
This page describes the Snowball Relation Extraction Algorithm in detail.
Definition
- Recall that: Snowball is an Eager Model-based Semi-Supervised Learning-based Relation Recognition Algorithm designed for Information Extraction Tasks that involve the extraction of Binary Relations.
- Context:
- Automated Text Annotation includes Named Entity Recognition.
- Uses a Five-Tuple Lexically-based Relation Recognition Classifier Classification Model representation.
- Uses Clustering to generalize Patterns.
- Uses Bootstrapping to benefit from Unlabeled Data.
- Uses Pattern Precision and a minimum precision Threshold to stop Pattern Generation.
- Optimized for One-to-One Relations and One-to-Many Relations.
- Example(s):
- See: PPLRE Snowball, Snowball Internal Parameters, Snowball Algorithm Description.
Input/Output
This section describes Snowball's input/output requirements
Input: Initial Seed Tuples
One of the inputs is the initial set of seed tuples (Training Cases). For the OrganizationHeadquarterLocation() relation a sample set of seeds is:
| ORGANIZATION || LOCATION | | MICROSOFT || REDMOND | | IBM || ARMONK | | BOEING || SEATTLE | | INTEL || SANTA CLARA |
Input: Annotated Training Corpus
The second input is a corpus where the entities in the relation have been annotated.
Output: Tuples
This is an example of a Snowball output tuple:
<Tuple>
<TupleField>
<FieldName>LOCALIZATION</FieldName>
<FieldValue>+periplasm+</FieldValue>
</TupleField>
<TupleField>
<FieldName>PROTEIN</FieldName>
<FieldValue>+zncl2+</FieldValue>
</TupleField>
<TupleRank>0.30240786</TupleRank>
<SupportingDocId>%2Fhome%2Fhomira%2Fdata%2FtrainSecondary%2F884</SupportingDocId>
</Tuple>
<Tuple>
<TupleField>
<FieldName>LOCALIZATION</FieldName>
<FieldValue>+periplasm+</FieldValue>
</TupleField>
<TupleField>
<FieldName>PROTEIN</FieldName>
<FieldValue>+pao1161+</FieldValue>
</TupleField>
<TupleRank>0.305802</TupleRank>
<SupportingDocId>%2Fhome%2Fhomira%2Fdata%2FtrainSecondary%2F6066</SupportingDocId>
</Tuple>
Output: Patterns
This is an example of a Snowball pattern:
<Pattern>
<TermVector>
</TermVector>
<TermVector>
<VectorElement>
<Term>insertion</Term>
<Weight>0.4082483</Weight>
</VectorElement>
<VectorElement>
<Term>h</Term>
<Weight>0.4082483</Weight>
</VectorElement>
<VectorElement>
<Term>soluble</Term>
<Weight>0.4082483</Weight>
</VectorElement>
<VectorElement>
<Term>protein</Term>
<Weight>0.4082483</Weight>
</VectorElement>
<VectorElement>
<Term>produced</Term>
<Weight>0.4082483</Weight>
</VectorElement>
<VectorElement>
<Term>volcanii</Term>
<Weight>0.4082483</Weight>
</VectorElement>
</TermVector>
<TermVector>
</TermVector>
<SupportingTupleId>2</SupportingTupleId>
<PatternRank>1.0</PatternRank>
<PatternClosure>01</PatternClosure>
</Pattern>
Algorithm
Process
- Start with seed set R of tuples
- Generate set P of patterns from R <= GeneratePatterns()
- Compute support and confidence for each pattern in P
- Discard patterns with low support or confidence
- Generate new set T of tuples matching patterns P <= GenerateTuples(Patterns)
- Compute confidence of each tuple in T
- Add to R the tuples t2T with conf(t)> threshold.
- Go back to step 2
Algorithm - Functions
GeneratePatterns() Function
- Assign the first context vector representation V as the centroid of cluster C
- For Vi, compute the similarity with the centroids of each existing clusters, let S be the highest similarity.
- If S the threshold value (St), add the item to the corresponding cluster and re-compute the cluster centroids, else use Vi to form a new cluster
- Go to step 2 if another V needs to be clustered, else proceed to step 5
- Return the centroids of each cluster, add them (i.e. promoted extraction patterns) to candidate pattern set
sub GeneratePatterns
foreach (< oseed; `seed >, str) in Occurrences
(1) tS =< ls; t1;ms; t2; rs > = makeOccurrence(str);
(2) cluster best = FindClosestCluster(tS, sim);
if(best)
best.Add(tS );
best.UpdateCentroid(tS);
best.AddTuple(< oseed; `seed >);
else
CreateNewCluster(< oseed; `seed >, tS);
(3)Patterns = FilterPatterns( Clusters, sup);
return Patterns;
GenerateTuples(Patterns) Function
sub GenerateTuples(Patterns)
foreach text segment in corpus
(1) {<o,l>, < ls; t1;ms; t2; rs>} =
= CreateOccurrence(text segment);
TC = < o;` >;
SimBest = 0;
foreach p in Patterns
(2) sim = Match(< ls; t1;ms; t2; rs >; p);
if (sim >= Tsim)
(3) UpdatePatternSelectivity(p, TC);
if(sim SimBest)
SimBest = sim;
PBest = p;
if(SimBest >= Tsim)
CandidateTuples [TC ].Patterns [PBest ] =
= SimBest;
return CandidateTuples;
Open Issues
Some of the recognized open problems for the application of Snowball are:
- Automated setting of internal thresholds. Many thresholds are required. E.g.:
- the maximum number of words that a pattern can be found in.
- a minimum threshold on the similarity between records in clusters (similar to the selection of the number of clusters).