Shortest Common Supersequence (SCS)

A Shortest Common Supersequence (SCS) is a string [math]\displaystyle{ S_1 }[/math] in a supersequence relation with every member of string set [math]\displaystyle{ S }[/math] and with no other string [math]\displaystyle{ (S_2\in S) }[/math] in a supersequence relation with the string set is shorter than [math]\displaystyle{ S_1 }[/math].

Context:
- It can be identified through a Shortest Common Supersequence System (that solves an SCS task).
- It can be the shortest string that contains two more strings as Substrings.
Example(s):
- a Shortest Common Superstring.
- S₁=babcab, for S={bcb, baab, babc}, given Alphabet {a,b,c} because a shorter string cannot be in a Supersequence Relation with S.
- …
Counter-Example(s):
- a Longest Common Substring.
- a Longest Common Subsequence.
See: Sequence Alignment Task, Approximate String Matching, Subsequence, Common Subgraph Task, Sequence Alignment Task.

References

2019

(Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Shortest_common_supersequence_problem Retrieved:2019-12-12.
- In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x₁,...,x_m > and Y = < y₁,...,y_n >, a sequence U = < u₁,...,u_k > is a common supersequence of X and Y if items can be removed from U to produce X or Y.
  A shortest common supersequence (SCS) is a common supersequence of minimal length. In the shortest common supersequence problem, the two sequences X and Y are given and the task is to find a shortest possible common supersequence of these sequences. In general, an SCS is not unique.
  For two input sequences, an SCS can be formed from a longest common subsequence (LCS) easily. For example, if X [math]\displaystyle{ [1..m] = abcbdab }[/math] and Y [math]\displaystyle{ [1..n] = bdcaba }[/math], the lcs is Z [math]\displaystyle{ [1..r] = bcba }[/math] . By inserting the non-lcs symbols while preserving the symbol order, we get the SCS: U [math]\displaystyle{ [1..t] = abdcabdab }[/math] .
  It is quite clear that [math]\displaystyle{ r + t = m + n }[/math] for two input sequences. However, for three or more input sequences this does not hold. Note also, that the LCS and the SCS problems are not dual problems.

2007

(Westling, 2007) ⇒ Andreas Westling. (2007). “The Shortest Common Supersequence Problem." In: Andreas Westling's Webpage.
- QUOTE: (...)
  - A sequence of length [math]\displaystyle{ n }[/math] is a concatenation of [math]\displaystyle{ n }[/math] symbols taken from an alphabet.
  - If [math]\displaystyle{ S }[/math] is a sequence of length [math]\displaystyle{ n }[/math] and [math]\displaystyle{ T }[/math] is a sequence of length [math]\displaystyle{ m }[/math] and [math]\displaystyle{ n \leq m }[/math] then [math]\displaystyle{ S }[/math] is a subsequence of [math]\displaystyle{ T }[/math] if [math]\displaystyle{ S }[/math] can be obtained by deleting [math]\displaystyle{ m-n }[/math] symbols from [math]\displaystyle{ T }[/math]. The symbols need not be contiguous.
  - A sequence [math]\displaystyle{ T }[/math] of length [math]\displaystyle{ m }[/math] is a supersequence of [math]\displaystyle{ S }[/math] of length [math]\displaystyle{ n }[/math] if [math]\displaystyle{ T }[/math] can be obtained by inserting [math]\displaystyle{ m-n }[/math] symbols. That is, [math]\displaystyle{ T }[/math] is a supersequence of [math]\displaystyle{ S }[/math] if and only if [math]\displaystyle{ S }[/math] is a subsequence of [math]\displaystyle{ T }[/math].
  - A sequence [math]\displaystyle{ T }[/math] is a common supersequence of the sequences [math]\displaystyle{ S_1 }[/math] and [math]\displaystyle{ S_2 }[/math] of [math]\displaystyle{ T }[/math] is a supersequence of both [math]\displaystyle{ S_1 }[/math] and [math]\displaystyle{ S_2 }[/math].

(...)

The problem is to find a shortest common supersequence (SCS), which is a common supersequence of minimal length. There could be more than one SCS for a given problem.

2002

(Navarro & Raffinot, 2002) ⇒ Gonzalo Navarro, and Mathieu Raffinot. (2002). “Flexible Pattern Matching in Strings." Cambridge University Press.
- QUOTE: Sequence comparison is about determining similarities and correspondences between two or more strings. It is related to approximate searching (Chapter 6) and has many applications in computational biology, speech recognition, computer science, coding theory, chromatography, and so on. These applications look for similarities between sequences of symbols. The general goal is to perform basic operation over the strings until they become equal. ... A concept of “distance” between two strings can be defined according to the minimum cost of making them equal. ... In some cases it is useful to measure the degree of similarity rather than of dissimilarity (i.e., a distance). One example is the LCS, a heavily studied measure. Other examples are the shortest common supersequence (SCS), longest common substring (LCG, different from LCS because the common string has to be a contiguous substring of both sequences), and shortest common superstring (SCG), as well as their version or more than two strings.

1996

(Branke & Middendorf, 1996) ⇒ J. Branke, and M. Middendorf. (1996). “Searching for Shortest Common Supersequences by Means of a Heuristic-based Genetic Algorithm". In: Proceeding of the 2nd Nordic Workshop on Genetic Algorithms and Their Applications.
- QUOTE: The Shortest Common Supersequence (SCS) problem is a classical problem from stringology which has applications e.g. in artificial intelligence (specifically planning), mechanical engineering and data compression ^[1], ^[2]. The problem is as follows: Given a finite set [math]\displaystyle{ L }[/math] of strings over an alphabet [math]\displaystyle{ \Sigma }[/math], find a string of minimal length that is a supersequence of each string in [math]\displaystyle{ L }[/math]. A string [math]\displaystyle{ S }[/math] is called a supersequence of a string [math]\displaystyle{ T }[/math] if [math]\displaystyle{ S }[/math] can be obtained from [math]\displaystyle{ T }[/math] by inserting zero or more symbols.
  For example, given the alphabet [math]\displaystyle{ \Sigma = \{a, b, c\} }[/math] and the set of strings [math]\displaystyle{ L = \{cbbc, abc, cbag\} }[/math], a shortest common supersequence of [math]\displaystyle{ L }[/math] is the string [math]\displaystyle{ cbabc }[/math]. Another, longer supersequence would be for example [math]\displaystyle{ cabbac }[/math].

↑ Foulser, D. E., Li, M., & Yang, Q. (1992). Theory and algorithms for plan merging. Artificial Intelligence, 57(2-3), 143-181.
↑ Timkovskii, V. G. (1989). Complexity of common subsequence and supersequence problems and related problems. Cybernetics, 25(5), 565-580.

1993

(Irving & Fraser, 1993) ⇒ R. Irving, and C. Fraser. (1993). “On the Worst-Case Behaviour of Some Approximation Algorithms for the Shortest Common Supersequence of k Strings". In: Combinatorial Pattern Matching. doi:10.1007/BFb0029797
- QUOTE: Two natural polynomial-time approximation algorithms for the shortest common supersequence (SCS) of k strings are analysed from the point of view of worst-case performance guarantee. Both algorithms behave badly in the worst case, whether the underlying alphabet is unbounded or of fixed size. For a Tournament style algorithm proposed by Timkovskii, we show that the length of the SCS found is between k/4 and (3k + 2)/8 times the length of the optimal in the worst case. The corresponding bounds proved for the obvious Greedy algorithm are (4k + 17)/27 and (k−1)/e. Even for a binary alphabet, no constant performance guarantee is possible for either algorithm, in contrast with the guarantee of 2 provided by a trivial algorithm in that case.

1981

(Räihä & Ukkonen, 1981) ⇒ Kari-Jouko Räihä, and Esko Ukkonen. (1981). “The Shortest Common Supersequence Problem Over Binary Alphabet is NP-complete.” In: Theoretical Computer Science, 16(2). doi:10.1016/0304-3975(81)90075-X
- ABSTRACT: We consider the complexity of the Shortest Common Supersequence (SCS) problem, i.e. the problem of finding for finite strings [math]\displaystyle{ S_1, S_2,\cdots, S_u }[/math] a shortest string [math]\displaystyle{ S }[/math] such that every [math]\displaystyle{ S_i }[/math] can be obtained by deleting zero or more elements from [math]\displaystyle{ S }[/math]. The SCS problem is shown to be NP-complete for strings over an alphabet of size greater-or-equal, slanted 2.

1978

(Garey & Johnson, 1979) ⇒ Michael R. Garey, and David S. Johnson (1979). “Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman. pg.228.

[1] Foulser, D. E., Li, M., & Yang, Q. (1992). Theory and algorithms for plan merging. Artificial Intelligence, 57(2-3), 143-181.

[2] Timkovskii, V. G. (1989). Complexity of common subsequence and supersequence problems and related problems. Cybernetics, 25(5), 565-580.

[1]

[2]