logo

Needleman-Wunsch Algorithm: Global Sequence Alignment 📂Algorithm

Needleman-Wunsch Algorithm: Global Sequence Alignment

Overview

Finding an alignment between two sequences that maximizes the number of common parts is called Global Alignment, and the most widely used method for this is the Needleman-Wunsch Algorithm.

There are too many possible alignments in sequence alignment, so it’s necessary to compute them efficiently and quickly through dynamic programming.

Algorithm1

Input

Given two strings $\textbf{v}, \textbf{w}$ and a substitution matrix $\delta$, let’s represent them as follows.

$$ \textbf{v} := \left( v_{1} , \cdots , v_{m} \right) \\ \textbf{w} := \left( w_{1} , \cdots , w_{n} \right) $$

The length of $\textbf{v}$ is $m$, and the length of $\textbf{w}$ is $n$. The penalty for two characters $x,y$ is represented as $\delta (x,y)$, and specifically, the character representing a gap is represented as $-$.


Step 1. Initialization

Create an empty matrix $\left( s_{i,j} \right)_{i=0, \cdots, m \\ j = 0, \cdots , n}$ and initialize the first row and column as follows.

$$ s_{i,0} \gets i \cdot \delta \left( v_{i},- \right) \\ s_{0,j} \gets j \cdot \delta \left( -, w_{j} \right) $$


Step 2. Dynamic Programming

If $s_{i-1,j}$ and $s_{i,j-1}$ are already computed, then compute $s_{i,j}$ as follows. (Of course, $s_{i-1,j-1}$ would have been computed.)

$$ s_{i,j} \gets \max \begin{cases} s_{i-1,j} + \delta \left( v_{i}, - \right) \\ s_{i,j-1} + \delta \left( -, w_{j} \right) \\ s_{i-1, j-1} + \delta \left( v_{i} , w_{j} \right) \end{cases}$$


Output

Starting from $s_{mn}$, by tracing back from the bottom right to the upper left and selecting matching elements, one can obtain the most similar global alignment between the two strings. That score is $s_{mn}$.

Example

Input

julia> s1 = "ATTAGCTCAC";

julia> s2 = "ACTATAGCGCAG";

Let’s globally align the two sequences as above.

julia> EDNAFULL
SubstitutionMatrix{BioSymbols.DNA,Int64}:
     A  C  M  G  R  S  V  T  W  Y  H  K  D  B  N
  A  5 -4  1 -4  1 -4 -1 -4  1 -4 -1 -4 -1 -4 -2
  C -4  5  1 -4 -4  1 -1 -4 -4  1 -1 -4 -4 -1 -2
  M  1  1 -1 -4 -2 -2 -1 -4 -2 -2 -1 -4 -3 -3 -1
  G -4 -4 -4  5  1  1 -1 -4 -4 -4 -4  1 -1 -1 -2
  R  1 -4 -2  1 -1 -2 -1 -4 -2 -4 -3 -2 -1 -3 -1
  S -4  1 -2  1 -2 -1 -1 -4 -4 -2 -3 -2 -3 -1 -1
  V -1 -1 -1 -1 -1 -1 -1 -4 -3 -3 -2 -3 -2 -2 -1
  T -4 -4 -4 -4 -4 -4 -4  5  1  1 -1  1 -1 -1 -2
  W  1 -4 -2 -4 -2 -4 -3  1 -1 -2 -1 -2 -1 -3 -1
  Y -4  1 -2 -4 -4 -2 -3  1 -2 -1 -1 -2 -3 -1 -1
  H -1 -1 -1 -4 -3 -3 -2 -1 -1 -1 -1 -3 -2 -2 -1
  K -4 -4 -4  1 -2 -2 -3  1 -2 -2 -3 -1 -1 -1 -1
  D -1 -4 -3 -1 -1 -3 -2 -1 -1 -3 -2 -1 -1 -2 -1
  B -4 -1 -3 -1 -3 -1 -2 -1 -3 -1 -2 -1 -2 -1 -1
  N -2 -2 -1 -2 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1
(underlined values are default ones)

For the substitution matrix, let’s say the gap penalty is linearly $-5$ points. Then the characters used are only the four major bases, receiving $5$ points for matches, $-4$ points for mismatches, and the gap penalty of $-5$ points.

For easier understanding, let’s implement it simply in Excel.

Step 1. Write the query sequence as rows and the reference sequence as columns, and initialize.

step1


Step 2. Calculate the score in each cell.

step2

Enter the following function in cell C3 (3rd row, 3rd column) and copy it to all cells.

=MAX(B2+IF(C$1=$A3,5,-4), C2-5,B3-5)

Output. Trace back from the bottom right.

step3

Starting from the bottom right corner, follow these rules in reverse order.

  • Red ↖: Rows and columns do not match, so either side can be chosen.
  • Orange ↖: Rows and columns match, so select that base.
  • Grey ← or ↑: Represents a gap, an Indel. Choose between the top or left, whichever has a higher score.

As can be seen during the tracing process, the optimal sequence alignment may not be unique. The optimal global alignments that can be found are A-T-TAGCCAC, A-T-TAGCGAC, or even ACT-TAGCGAC, but the scores of all possible sequence alignments found are 22 points.

Code

Since it’s dynamic programming, it can be easily implemented in Excel as above. Although it’s okay for study purposes, as seen in the example, continually reusing this is certainly foolish. The following is code for performing the same task quickly and easily in Julia.

using BioAlignments

s1 = "ATTAGCTCAC";
s2 = "ACTATAGCGCAG";

scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=0);
global_result = pairalign(GlobalAlignment(), s1, s2, scoremodel)

Execution Result

using BioAlignments

julia> s1 = "ATTAGCTCAC";

julia> s2 = "ACTATAGCGCAG";

julia> scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=0);

julia> global_result = pairalign(GlobalAlignment(), s1, s2, scoremodel)
PairwiseAlignmentResult{Int64,String,String}:
  score: 22
  seq:  1 A-T-TAGCTCAC 10
          | | |||| || 
  ref:  1 ACTATAGCGCAG 12

  1. Del C. Jones, Pavel A. Pevzner. (2009). An Introduction to Bioinformatics Algorithms (Computational Molecular Biology): p177. ↩︎