Needleman-Wunsch Sequence Alignment Algorithm
(Redirected from Needleman-Wunsch algorithm)
Jump to navigation
Jump to search
A Needleman-Wunsch Sequence Alignment Algorithm is a global sequence alignment algorithm that aligns two complete sequences.
- Example(s):
- MATLAB implementation(s) of the Needleman-Wunsch algorithm:
- C implementation(s) of the Needleman-Wunsch algorithm:
- Python implementation(s) of the Needleman-Wunsch algorithm:
nwalign 0.3.1
align.py
(developed by Hamilton & Ben-Hur 2019);
- …
- Counter-Example(s):
- See: Approximate String Matching Algorithm, Sequence Homology, Longest Common Subsequence, Shortest Common Supersequence, Longest Common Substring, Shortest Common Superstring, Approximate String Matching, Phylogenetic Analysis Task, Alignment-free Sequence Analysis Task, Levenshtein Distance, Edit Distance, Alignment Distance, Sequential Pattern Mining Task, Dynamic Programming.
References
2021a
- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm Retrieved:2021-2-24.
- The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem (e.g. the full sequence) into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.
2021b
- (Kellis et al., 2021) ⇒ Manolis Kellis et al. (2021). "2.5: The Needleman-Wunsch Algorithm". In: "Book: Computational Biology - Genomes, Networks, and Evolution (Kellis et al.)", LibreTexts.
- QUOTE: We will now use dynamic programming to tackle the harder problem of general sequence alignment. Given two strings $S=\left(S_1,\ldots,S_n\right)$ and $T=\left(T_1,\ldots,T_m\right)$ , we want to find the longest common subsequence, which may or may not contain gaps. Rather than maximizing the length of a common subsequence we want to compute the common subsequence that optimizes the score as defined by our scoring function. Let $d$ denote the gap penalty cost and $s(x;y)$ the score of aligning a base $x$ and a base $y$. These are inferred from insertion/deletion and substitution probabilities which can be determined experimentally or by looking at sequences that we know are closely related. The algorithm we will develop in the following sections to solve sequence alignment is known as the Needleman-Wunsch algorithm.
1970
- (Needleman & Wunsch, 1970) ⇒ Saul B. Needleman, and Christian D. Wunsch. (1970). “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins." J Mol Biol, 48(3). (doi:10.1016/0022-2836(70)90057-4).
- QUOTE: The maximum match is a number dependent upon the similarity of the sequences. One of its definitions is the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consideration those comparisons that cannot contribute to the maximum match.
Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array. For this maximum match only certain of the possible pathways must be evaluated. A numerical value, one in this case, is assigned to every cell in the array representing like amino acids. The maximum match is the largest number that would result from summing the cell values of every pathway.
- QUOTE: The maximum match is a number dependent upon the similarity of the sequences. One of its definitions is the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consideration those comparisons that cannot contribute to the maximum match.